SQL XML Bulk Load and line breaks

by ASH February 09, 2010 13:12

I have been working with SQL XML Bulk Load (version 4.0) over the last period of time, and suddenly I encountered an issue where the imported data didn’t match the data in the XML files I received.
Looking closer into the issue, I discovered it was because the SQL XML Bulk Load removed my line breaks entirely from the imported data.

In XML line breaks are defined as standard using only "line feed" (LF). However working in a windows environment line breaks are almost always both carriage return and line feed combined (CR+LF).
However when the SQL XML Bulk Load parser (I would think it is a parsing issue) encounters a CR+LF in a node value it strips both characters from the text which results in the data being imported without the line break. If however the parser encounters only a LF, then that LF is retained in the imported data.

Now this seemed like a strange behaviour, because why would both CR and LF be removed and not only the CR, if it was to rely on the standard. However even more annoying, when the CR+LF is used as line breaks most every other place in Windows environments, why even remove those two characters to begin with?

I contacted Microsoft technical support which seems to have confirmed that it isn’t an error on my account, and that they maybe would look at it for the next release.  (*fingers crossed*)
Whether or not they view it as a bug I do not know – but there is definitely something wrong there. And I would call it a bug, because removing extra information can’t be intentional.

To illustrate my issue I made a very simple XML file containing two nodes:

<root>
<
myNode>Before CRLF
After CRLF
</myNode>
<
myNode>Before LF
After LF
</myNode>
</
root>

In this file, the first line break was made with a CR+LF (or rather Environment.NewLine via .NET, but parsing it matches CR+LF characters)  and the second was made only with the line feed as a line break.
I then imported that file into a database using the schema file:

<?xml version="1.0" encoding="utf-8"?>
<
xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:dt="urn:schemas-microsoft-com:datatypes"
xmlns:sql="urn:schemas-microsoft-com:mapping-schema"> 

  <xs:element name="root" sql:is-constant="1">
   <
xs:complexType>
    <
xs:sequence>
     <
xs:element xml:space="preserve" type="xs:token" dt:type="string" name="myNode" sql:relation="LineBreak" />
   </
xs:sequence>
  </
xs:complexType>
 </
xs:element>
</
xs:schema>

Where I tried to do all sorts of XSD “keep my line breaks as they are” tricks (space=”preserve” and so on).
I then ran the import (vb.net code)

Dim strInFile As String = "c:\linebreak.xml" 
Dim strSchema As String = 
"c:\linebreak.xsd" 
Dim objBL As Object = CreateObject("SQLXMLBulkLoad.SQLXMLBulkload.4.0"
)
Try
    objBL.ConnectionString 
"connection string"
    
objBL.KeepIdentity 
= False
    
objBL.XMLFragment 
= True
    
objBL.ErrorLogFile 
"error log file path"
    
objBL.Execute(strSchemastrInFile
)
Catch ex As 
Exception
    
Debug.WriteLine(ex.ToString
)
Finally
    
If Not objBL 
Is Nothing Then
        
objBL 
= Nothing
    End If
End 
Try
  

And when I viewed the imported data, the CR+LF was removed from the first “myNode” and it had no breaks what so ever, and in the second “myNode” the LF was retained.
I was asked by the Microsoft employee I was in contact with to create the issue on Microsoft Connect, so I eagerly await what comes of this for the next release(s) of SQL XML Bulk Load. Hopefully they’ll have this fixed.

The only work around I have found myself is to pre-/post process the files. Before importing, run through the XML file and replace your line breaks with some special character not present in the file (usually ¤ is a good candidate) – and once the import has run, then post process your data and replace the used character with CR+LF again.
It is not a beautiful solution, but it works around the issue and helped myself move forward with the project.
 

Aggregated functions and CASE

by ASH February 03, 2010 13:11

I was stuck with a problem at work where I had a very complex query pulling data out of some tables, some XML and what not, and a lot of data manipulating in the query to be able to easier populate a DataWarehouse.
I then had to expand this query and pull some extra Boolean information out, which were based on a logical comparison of two strings, which I would normal do with a CASE like CASE string1 = string2 THEN 1.

However because I had a GROUP BY clause, I could not simply do this, because the columns the strings were taken from was not in the group clause and I didn’t want them there.
So I found out that you can actually put an aggregated function around the case using the following syntax:

SELECT 
<GroupByFields
>,
MAX
(
    
CASE 
        
WHEN <NonGroupField=  'some value' THEN 
1
        
ELSE 
0
    
END
)
FROM <TABLE
>
GROUP BY <GroupByFields

 

This saved me from using a common table expression or two more, and kept things simple. In essense this allows you to make conditional SUM and similar using this technique.
Just more evidence that a lot is possible in SQL, and just trying something often reveals interesting results.

Selecting latest order

by ASH October 31, 2009 12:15

A common problem to solve is to list, for example the current price for a product or latest order for a customer or similar.
It is a problem which quickly can seem complex, but once understanding the situation, then – as everything else – it is relative simple, and I’ll show how using both window functions (in this instance, RANK) from SQL Server 2005+ and using a sub-query in case working on an earlier version, or other databases.

I’ll take offset in the Northwind database, which I’ve installed on a SQL Server 2008.
It has an Order table with a foreign key to a Customer table; however for this the Order table is the only one of interest:
The Order table contains the following information:
SELECT [OrderID]
      
,
[CustomerID]
      
,
[EmployeeID]
      
,
[OrderDate]
      
,
[RequiredDate]
      
,
[ShippedDate]
      
,
[ShipVia]
      
,
[Freight]
      
,
[ShipName]
      
,
[ShipAddress]
      
,
[ShipCity]
      
,
[ShipRegion]
      
,
[ShipPostalCode]
      
,
[ShipCountry]
  
FROM [Northwind].[dbo].[Orders]
 


In this situation, we’ll get the latest order per customer, but the problem is the same as if you’d need the current price – the only thing which differs will be the tables and where clauses and so on.
Solving the issue using the RANK window function, and a common table expression, it would look like this:

;WITH CTE AS (
SELECT 
RANK() OVER (Partition BY T1.CustomerID ORDER BY OrderDate DESCAS OrderRank
,
T1.
*
FROM 
Orders T1
)
SELECT FROM CTE WHERE OrderRank 
1
ORDER BY CustomerID
   

What happens here is that we use RANK to give us a number partitioned (grouped) by CustomerID and sorted by OrderDate descending. This will give each row selected the number/rank that order is historically, from newest to earliest. That means each row which have the rank 1 will be the latest/newest order.
You can then expand on the joins inside/outside the common table expression to get information about customers or order details or what not.

It is also solvable without using the window function RANK, and would look for example something like this:

SELECT 
FROM 
Orders T1
INNER JOIN 
(
   
SELECT CustomerIDMAX(OrderDateAS 
OrderDate
   
FROM 
Orders T2
   
GROUP BY 
T2.CustomerID
T3 ON T3.OrderDate T1.OrderDate AND T1.CustomerID 
T3.CustomerID
ORDER BY T1.CustomerID
    

Here we utilize a sub-query where we select the CustomerID and the highest order date from Orders, which we then can join into Orders again with a self-join on customer id and order date.

Note that these are just examples. Many similar solutions exists, but they all follow the same methodology.
I’ve also not taken optimizaiton or anything like that into account. It was just to illustrate a solution to a common problem.

 

SQL XML Bulk Load - some common errors

by ASH October 22, 2009 11:50

Seeing as I have now worked somewhat the SQL XML Bulk load, I have decided to start compiling a list of the errors I got and what I did to solve them to help myself (and others) the next time I encounter one of these errors.

The statement has been terminated.
This usually means some SQL Error has occurred within the bulk load, for example a wrong data type, truncating of data and so on.
Check the ErrorLog file for more information.

Schema: the relationship XXX referred to in the node 'YYY' is not defined.
This means – as the error says – that a relationship is missing in the schema file. Usually more often than not, this means I’ve misspelled something in my relationship definitions or where I use the relationship. Otherwise make sure your relationship is defined if it is missing.

Schema: the annotation 'field' on the empty content element 'XXX' is not supported.
The “field” annotation is set on what is a container tag. You can’t assign a field to a container tag, but it needs to be set on an element.

Schema: the parent/child table of the relationship on 'XXX' does not match.
This for me usually means that I’ve forgotten to add  the relation on the field which has a relationship, or that I’m accessing the wrong tables.

The column 'XXX' was defined in the schema, but does not exist in the database.
Well, obviously, this means that there’s a field mapping in the schema file which doesn’t match the database. Check the spelling of the field/column name in the schema file.

Invalid character value for cast specification.
This means that you’re trying to make an invalid cast, for example trying to cast a string into a number field.

 

The IsNumeric Trap

by ASH September 04, 2009 08:13

As mentioned – casually – in the msdn documentation the IsNumeric function will return 1 for some values which aren’t actually numbers.  The currency sign $ is mentioned as well are plus/minus.
Period and comma will also be returning 1.

This essentially means that you can’t be sure that it is actually a number/numeric value which passes the IsNumeric check.
And least of all, you can’t be assured of the semantic value returned is what you expect.
This can – if unaware and not paying attention – be a problem when dealing with number values for countries which do not follow the same period/decimal convention as the US.

Also – “funnily” enough due to the caveats in the IsNumeric, you can’t actually be assured that it can be converted to a Numeric data type. (It can normally always be converted into a Money, but then the name is misleading :) )
To illustrate notice for example the following snippet:

DECLARE @VAR VARCHAR(20
SET @VAR '€,.,,'   
  

This will pass a “IsNumeric” check with the result 1, but it can’t be cast into a numeric/decimal. It can be cast into money but will return 0.00
And that is despite it is not in any form an actual numeric value.
There are some semantic checks built into the IsNumeric such as you can’t have values infront of the currency sign, so

SET @VAR '123$' 

would fail a “IsNumeric” check. Also some checks on the numbers of “plus/minus” signs and so on......
So when using IsNumeric, it is time to be careful and not just accept a success for actual success.

Tags:

SQL

Enabling moderated comments due to spam

by ASH July 26, 2009 18:07

Some time ago Google announced more insight in how the nofollow attribute worked, and since then I’ve received an explosion in comment spam.

So now I’ve taken the liberty of enabling moderating of comments, to stem up against this inrease in spams - too bad people have to ruin it for others.
Funnily enough some of these spamming idiots have the gall to put “nospam” in as e-mail address ….. so I guess spam is an issue for them as well. 

With any luck, moderated comments will just be temporary, but well – it never hurt anybody to wait a little while with putting a comment on-line.

Tags:

Local variables and batch scope

by ASH July 14, 2009 10:36

I was recently debugging a problem in a Stored Procedure where the wrong values where inserted into a table, in the middle of a long running cursor operation.
The culprit turned out to be the scope of a variable in SQL versus the scope most (object orientated) developers are used to in the language they usually code in.

The problem can be illustrated with this syntax:
DECLARE @Counter INT = 1

WHILE @Counter 
BEGIN
   DECLARE 
@Var VARCHAR(MAX

   
IF @Counter 

       
SET @Var 
'SomeValue'
   
SELECT 
@Var
   
SET @Counter +
1
END

The way scope works for variables in T-SQL is "batch scope", whereas most OO developers are used to the scope being limited within the innermost context.

The way it functions in SQL is that in the first iteration of the while loop the variable "@Var" gets declared but not assigned a value.
Thus, when selecting it out, the result will be NULL.

However in the second iteration, the variable is already declared once, so SQL Server will not "recreate" it, because the scope is "batch scope".
So here it’ll just be assigned the value "SomeValue".

The third iteration is like the second. The variable is already declared, and because it in the previous iteration had the value set to "SomeValue", that value will persist in this iteration, and all subsequent iterations until another value is set.

This is a source of errors if not careful, because the object orientated developer will – if not aware of this – most likely read the above as only iteration 2 will contain the value "SomeValue", whereas all other iterations will contain NULL.
But because of the batch scope, it is not so.

As the result from the query shows:

Tags: ,

SQL

SQL XML Bulk Load - basic XSD syntax

by ASH July 13, 2009 20:05

Some time ago I started a blog post on how to use SQL XML Bulk Load for SQL Server, and this time around I’ll tackle some of the syntax for how to map a XML file to the Bulk Load.

When mapping an XML file to the schema for XML SQL Bulk Load the easiest method is to get an IDE (for example Visual Studio) to create the schema file for you based on your XML, if you don’t have the schema outright.

In this blog post, I’ll use this XML file as an example, because it holds both how to read fields, attributes and relationships, and thus should cover many of the basics of the bulk load. I made my schema file via Visual Studio, and once you have a schema file you can start mapping the elements to the database tables and columns.

The first things first is to import the namespace: urn:schemas-microsoft-com:xml-sql into the file with a prefix. Usually I suggest (as others do) using "sql" for this prefix, simply to illustrate clearly that the namespace is tied to the bulk import and not to confuse it with other namespaces.
Now –a word of warning. The bulk load can only insert values into the database, so the way I normally handle imports in a production environment is to make an "import structure" which mirrors the imported data and will act as temporary storage. Then I make some SQL scripts to moving the data over into the production tables from my temporary import structure.
This also allows for interjecting a sort of validation layer for your data if you choose.

For this xml import, I make this import structure in the database:

CREATE TABLE [dbo].[BulkLoadedProduct](
   
[InternalProductID] [int] IDENTITY(1,1
) NOT NULL,
   
[ProductID] [nvarchar](200
) NOT NULL
ON 
[PRIMARY]

CREATE TABLE [dbo].[BulkLoadedProductDetails]
(
   
[InternalProductID] [int] 
NOT NULL,
   
[LanguageCode] [varchar](5
) NOT NULL,
   
[Name] [nvarchar](255
) NOT NULL,
   
[ProductDescription1] [nvarchar](MAX
) NOT NULL
ON [PRIMARY]
  

Now the sharp observer will notice some inconsistency in the naming of my columns; it is quite intentional, because I’ll use it later to illustrate how to map elements to fields in two different methods.

Just to recap, the method of calling the Bulk Load was written in an earlier blog piece which can be found here.

So, down to business. The first keyword which is needed to be mentioned is the "is-constant".
This is used to indicate an element which is only appearing once per “relationship”, and which isn’t to be mapped to a field in the database.
So in my instance the “Products” root element is one of those, so I annotate this field with the is-constant="1" like this:
<xs:element Name="Products" sql:is-constant="1">

The next annotation which is important to know is "relation". This is used to map elements to a table in the database. So in this case, I want to map Products elements to the table BulkLoadedProduct which happens like this:
<xs:element Name="Product" maxOccurs="unbounded" sql:relation="BulkLoadedProduct">    

After this we need to tell the bulk load where to put the field "ID" into the database. That’s done with the sql:field annotation like this:
<xs:element Name="ID" Type="xs:string" sql:field="ProductID" />

Right now I want to introduce an annotation called "mapped" which tells the bulk load that the field is not mapped to the database. So adding the sql:mapped="0" (or false if you’d rather use that) to an element then you’ll not get the field into the database.
So if we for example add the sql:mapped="0" to the element LanguageValues like this
<xs:element Name="LanguageValues" sql:mapped="0">  
then it will tell the bulk load to ignore all child structure under that element.
But seeing as I want the details imported, I’ll skip that part myself - but it is useful for debugging purpose as well, so it is important to know. 

So, the next problem is getting the details loaded. This is done with relationships.
So remove the mapped annotation and write the following in the top of your schema file. I personally prefer to keep the relationships in the top of my schema file for easier reference, but they can be added directly on the element. I’ll however only show the one way of doing it.
<xs:annotation
  <
xs:appinfo

    <
sql:relationship Name=
"Detail" 
                      
parent=
"BulkLoadedProduct" 
                      
parent-key=
"ProductID" 
                      
child=
"BulkLoadProductDetails" 
                      
child-key="ProductID" 
/> 
  </
xs:appinfo

</
xs:annotation>
  

This tells the bulk load to create a relationship between the two tables based on the parent/child key naming.
This relationship is then added to the LanguageValue element:
<xs:element maxOccurs="unbounded" Name="LanguageValue" 
      
sql:relation="BulkLoadProductDetails" sql:relationship="Detail">
  

One thing to be wary of is that the more relationships you have, i.e. a more complex XML structure, the slower your import will be. So if speed is important, make your XML import file simple as possible, then you can load big files very fast.

Finally, as I mentioned earlier, you can map elements to columns in two different ways. One was via the “field” annotation, however if the element is called the very same as the column in the target table relation, you do not need to add a field. So, for the fields Name and LangaugeCode, I do not need to add any “sql:field” annotation because the Bulk Load automatically will map it to the fields in the table when they’re named the same. This can simplify a schema file tremendously, so it is worth remembering.

Anyways, putting it all together and I end up with the following schema file.
Running the XML through the XML SQL Bulk Load with the scehma file, and I end up with a result which looks like this:

Result SQL selection from XML SQL Bulk Load
(click for larger view)

and now I’ve successfully imported the data from my XML into the database.

Sorting integers in a string using XQuery

by ASH June 29, 2009 18:13

I stumbled across the task of sorting numbers present in a comma separated string, based on their numerical value in SQL.
Curious as always I thought it would be possible to solve such a problem using XML and XQuery in SQL and indeed it was.

Given a numerical string, then in SQL Server 2005 and up, you can do something like this:
DECLARE @X XML ''

DECLARE @STR VARCHAR(255CAST(@X.query(
'for $i in (11, 9, 10)
                  order by $i
                  return fn:concat(xs:string($i), ",")'
AS VARCHAR(255
))
                  
SELECT LEFT(@STRLEN(@STR) -1)
  
(Remember, in SQL 2005, you need to put assignment on another line, this is 2008 syntax)

The output from this query will be the value: '9, 10, 11'

Now, it might not be special or terrible useful as such – but the point of the post is also much more that untraditional methods for solving problems exists, and one can think outside the box and solve many problems by doing so.

A problem with this is that if you want the input created dynamically - you either need to build the entire segment within a string and use EXEC, or perhaps you can use the sql:column and sql:variable to help you along.

Tags: , ,

SQL | XML

SQL XML Bulk Load - introduction

by ASH June 22, 2009 17:22

I’ve been working a great deal with SQL XML Bulk Import into SQL Server 2005 lately and wanted to document it for myself for later use. And perhaps others might find the findings useful, as the documentation isn't always very helpful.
So this means I’ll dedicate some blog posts to the XSD for the import, the syntax for the bulk import, the COM stream conversion, and how I've used the import, and various problems I’ve run into.
Even if this then will be very XML inspired, I’ve decided to post it here, instead of another blog, because it fits the SQL Server theme as well.

To start off soft and not make a too long blog post, I’ll provide the syntax for providing the import itself.
The object reference is created by adding a reference to "Microsoft XML BulkLoad for SQL Server 4.0 Type Library" in the version I’m using.
When using a reference, make sure that your thread is running in STA mode and not MTA. Then you’ll get an InvalidCastException (QueryInterface for interface SQLXMLBULKLOADLib.ISQLXMLBulkLoad failed). More about this can be read at MSDN.

The code for the import is in C# as following:

SQLXMLBULKLOADLib.SQLXMLBulkLoad4 _bulkload = new SQLXMLBULKLOADLib.SQLXMLBulkLoad4();
try 
{
            _bulkload.ConnectionString 
_connection
;
            
_bulkload.KeepIdentity = false
;
            
_bulkload.XMLFragment = true
;
            
_bulkload.ErrorLogFile ""
;
            
_bulkload.Execute(_XSDFile_DataFileOrCOMStream
);
}
catch (Exception ex

}
finally 
{
            
if ((_bulkload != null)) 
{
            _bulkload 
= null

}
  

The connection string needs to be an OLE DB connection string and can for example look like the following:
provider=SQLOLEDB;data source=[datasource];database=[database];integrated security=SSPI

KeepIdentity is important if you wish SQL Server to handle identities itself, otherwise set to true.
XML Fragment is used if you do not get a full XML document starting with the XML declaration tag.
Error log file is as the name suggest for the error file if something goes wrong. It is a good debugging tool, because otherwise the exception messages you get can be quite tricky.

Next blog instalment I’ll start with showing some XSD syntax for various XML files and go more hands on.

Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen | Modified by Mooglegiant

About me

My real name is Allan Svelmøe Hansen.

I live in Denmark, where I work as a developer for hedal:kruse:brohus using SQL Server and the .NET framework since 2004.  My primary fields of expertise is back end data integration, database design and optimization. But I also work with website development as well as application/services for server and SEO of websites.

Disclaimer

The opinions expressed herein are my own personal opinions and thoughts and does not represent my employer's view in any way, nor are my results guarentees for all situations.

Content is presented “as is”, with no warranty.