SQL Server Central is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

Pseudo-Random

Add to Technorati Favorites Add to Google
More Posts Next page »
All Posts

SQL 2008 RTM'd

By Michael Coles in Pseudo-Random 08-07-2008 11:22 AM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 226 Reads | 123 Reads in Last 30 Days |no comments

SQL Server 2008 was RTM'd yesterday (Aug. 6th).  It's currently available for download on MSDN and TechNet.  More info here: http://www.microsoft.com:80/presspass/press/2008/aug08/08-06SQLServer2008PR.mspx.


Why SELECT Queries Are Not DML

By Michael Coles in Pseudo-Random 03-08-2008 11:09 PM | Categories: Filed under: , , , ,
Rating: (not yet rated) Rate this |  Discuss | 559 Reads | 131 Reads in Last 30 Days |no comments

I had a recent conversation with someone about SELECT statements, and where they fit into the grand scheme of things. There's a common misconception that SELECT is a data manipulation language (DML) statement. I think this idea has been perpetuated somewhat by Books Online with entries that group SELECT queries into DML (like this one: http://msdn2.microsoft.com/en-us/library/ms177591.aspx). Not that BOL deserves all the blame; there are plenty of misled souls out there.

The SQL:2003 standard defines DML as INSERT, UPDATE, DELETE, and MERGE, plus some cursor-specific statements. To be fair there's also a specialized form of SELECT (SELECT ... INTO ...), but SELECT queries are classified in their own section of the standard as -- what else? -- "Queries".

The main difference between SELECT queries and DML is that DML statements by and large affect database state. INSERT, UPDATE, DELETE, and MERGE all modify data in the database. SELECT queries don't.

Another aspect to this is the "transactional nature" of DML statements and SELECT queries. Some believe that SELECT queries, like DML statements, are automatically wrapped by SQL Server in implicit transactions. The reasoning goes something like this: SELECT queries can initiate locks on rows, tables, etc., therefore they must have implicit transactions.

SQL Server does in fact wrap SELECT queries in transactions, but not the standard DML transactions that enforce ACID. SELECT queries are wrapped in read-only transactions that do not enforce ACID. While it's true that SQL Server locking mechanisms and isolation levels enforce the concept of isolation, DML read-write transactions serve a different purpose. Read-write transactions enforce atomicity and durability - two "ACID" concepts (it can probably be argued that SELECT queries enforce consistency to a degree as well, but that's another discussion). SELECT queries have zero durability and it can be easily proven that they are non-atomic in nature.

If you cut the power after a successful SELECT query, the results just sort of disappear. Durability is the assurance that after the power goes out the results of a successful transaction are persisted. So SELECT queries have no durability. Likewise, you can cancel a long-running SELECT query midstream and retrieve partial results, which is a direct violation of atomicity. If you cancel a long-running UPDATE statement, on the other hand, the entire update is rolled back. The bottom line is that SELECT queries use read-only transactions, locking, and isolation levels to enforce isolation, and they enforce consistency to some extent. Read-write transactions on the other hand, enforce the other half of acid - atomicity and durability, items not assured by isolation levels and locking mechanisms.

In the end it all sounds like another one of those fascinating trick interview questions we all love so much, doesn't it?

Add to Technorati Favorites


SQL Server 2008 Feb. CTP (CTP-6) Available

By Michael Coles in Pseudo-Random 02-20-2008 12:57 AM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 454 Reads | 122 Reads in Last 30 Days |no comments

Ladies and Gentlemen--Start Your VMs!  Microsoft has released SQL 2008 CTP 6 for download at the Connect Website: https://connect.microsoft.com/SQLServer/content/content.aspx?ContentID=5395.  There's a whole laundry list of new features in this CTP:

- iFTS
- filtered indexes
- MS Word rendering
- sparse column support
- SQL Audit
- data compression
- performance data collection
- and more...

The download is about 1.45 GB for the executable and this one is supposed to be "feature complete".  Apparently they're giving away an XBox or something in some bug-reporting contest.  I suppose you can only count on people to locate and report bugs to you for free for so many years.  Unfortunately the links to the additional information pages (like the big-reporting contest page) on the SQL 2008 download page appear to be broken right now.  Anyway, if you're interested in checking out the new features and functionality in 2008, now's your chance.

P.S. - I also checked CodePlex and it looks like there's no update to AdventureWorks for the Feb. CTP; the latest version of AdventureWorks is for the Nov. CTP. 

UPDATE:  The new version of AdventureWorks is now available at: http://www.codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx?ReleaseId=10901

 


Bye Bye FIPS... We Hardly Knew Ye...

By Michael Coles in Pseudo-Random 02-18-2008 1:55 PM | Categories: Filed under: , , , ,
Rating: (not yet rated) Rate this |  Discuss | 310 Reads | 120 Reads in Last 30 Days |no comments

SQL Server has an option called SET FIPS_FLAGGER that many developers don't even know about. This option warns you if your code does not comply with the SQL-92 levels of compliance specified by FIPS 127-2. This used to be a requirement if you wanted to sell your DBMS to the government. Effective February 5, 2008, FIPS 127-2 has been rescinded. Apparently the National Institute of Standards and Technology (NIST) wants to get out of the "software testing business".

So the question is now that FIPS 127-2 no longer exists, how long will it be before vendors stop testing for code compliance with a 16 year-old standard?  I'm sort of expecting to see a deprecation notice pop up on the SQL 2008 SET FIPS_FLAGGER page at some point: http://msdn2.microsoft.com/en-us/library/ms189781.aspx, although they may save deprecation for a future version of SQL Server.  On the other hand, it may get carried forward until it's testing against a 30-year-old standard...  I guess you never can tell.


This Just In...

By Michael Coles in Pseudo-Random 02-05-2008 10:55 AM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 321 Reads | 116 Reads in Last 30 Days |no comments

I was as surprised as anyone to watch the Giants beat the Pats on Sunday.  I take that back, apparently I wasn't as surprised as these folks:

 

I doubt I lost as much money as they probably did.  Hell, I'd buy a copy of this book just to have it as a collector's item.  Do you think they have any lying around in a warehouse somewhere?  "Sports Publishing", are you reading this?

Truman would be proud to see history repeating itself:

 

Of course that's not entirely fair since the Chicago Daily Tribune didn't cover their rear end with a disclaimer like "To be published only in the event of a win."  What a way to start the day :)

FOLLOW-UP:  It looks like the publisher has replaced this initial publication with a new "Giants World Champions" book sometime between last night and this morning at http://www.amazon.com/s/ref=nb_ss_gw/102-7534258-3982567?url=search-alias%3Daps&field-keywords=19+-+0.

ONE MORE TIME: Apparently you can still purchase the title "New England Patriots: 2008 Super Bowl Champions" on Amazon, at http://www.amazon.com/New-England-Patriots-Super-Champions/dp/1596703067/ref=sr_1_1?ie=UTF8&s=books&qid=1202366284&sr=1-1. Surely not as good a read as "19 - 0", and the title's not as catchy, but I think I need one for my bookshelf.


Stop Using OPENXML (Please...)

By Michael Coles in Pseudo-Random 01-20-2008 8:33 PM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 1,025 Reads | 192 Reads in Last 30 Days |2 comment(s)

As we all know by now, the xml data type is one of the new features introduced in SQL Server 2005. The xml data type has a handy feature for shredding XML data, the nodes() method; but based on newsgroup postings and various articles on SQL Server 2005 XML it seems that developers are still stuck in an OPENXML state of mind. OPENXML is a rowset provider that was sort of thrown onto the tail end of SQL Server 2000, along with a couple of XML-specific system stored procedures, to implement XML document shredding. OPENXML shredding, circa 2000, looked like this:

DECLARE @xml VARCHAR(8000);

-- Character representation of the XML
SET @xml = '<capitals>
<state name="Alabama"
  abbreviation="AL"
 
 capital="Montgomery"
 
flag="AL.gif"
  date="December 14, 1819"
  fact="Rosa Parks refused to give up her seat on a Montgomery bus in 1955. The Montgomery Bus Boycott kicked off the Civil Rights era a few days later."
 
address="600 Dexter Ave"
  zip="36130"
  long="-86.301963"
  lat="32.377189" />
<state name="Alaska"
  abbreviation="AK"
  capital="Juneau"
  flag="AK.gif"
  date="January 3, 1959"
  fact="In 1867 United States Secretary of State William H. Seward offered Russia $7,200,000, or two cents per acre, for Alaska."
  address="120 4th Street"
  zip="99801"
  long="-134.410699"
  lat="58.301072" />
<state name="Arizona"
  abbreviation="AZ"
  capital="Phoenix"
  flag="AZ.gif"
  date="February 14, 1912"
  fact="Arizona is home of the Grand Canyon National Park."
  address="1700 West Washington St"
  zip="85007"
  long="-112.095704"
  lat="33.448543" />
</capitals>'
;

-- Internal handle for the XML DOM document
DECLARE @iDoc INT;

-- Create an XML DOM representation of the document
EXEC sp_xml_preparedocument @iDoc OUTPUT, @xml;

-- Shred the document with OPENXML
SELECT *
FROM OPENXML(@iDoc, '/capitals/state')
WITH (Name VARCHAR(100) '@name',
  Abbreviation VARCHAR(2) '@abbreviation',
  Capital VARCHAR(100) '@capital',
  Flag VARCHAR(100) '@flag',
  Date VARCHAR(100) '@date',
  Fact VARCHAR(500) '@fact',
  Address VARCHAR(100) '@address',
  ZIP VARCHAR(5) '@zip',
  Long FLOAT '@long',
  Lat FLOAT '@lat');

-- You have to manually remove the DOM document from memory
EXEC sp_xml_removedocument @iDoc;

The XML document used here is borrowed from a sample I created for an article a while back. Apart from the complexity involved with the OPENXML syntax, on SQL 2000 you also had to manually manage memory allocation. The sp_xml_preparedocument procedure invoked MSXML through COM, and automatically allocated 1/8th of the server's total memory to the XML cache. If your server has 2 GB of RAM, the short sample above will allocate 250 MB (!) to process the 1,200 byte document. At the end of processing, you have to call sp_xml_removedocument to free up the allocated memory. Dealing with XML documents larger than 8,000 bytes on SQL 2000 was a seriously convoluted undertaking. The xml data type nodes() method eliminates the sp_xml_preparedocument and sp_xml_removedocument steps, while easily handling very large documents (up to 2.1 GB). The xml data type nodes() method is shown below:

DECLARE @xml XML;

-- Populate the XML variable
SET @xml = '<capitals>
<state name="Alabama"
  abbreviation="AL"
 
 capital="Montgomery"
 
flag="AL.gif"
  date="December 14, 1819"
  fact="Rosa Parks refused to give up her seat on a Montgomery bus in 1955. The Montgomery Bus Boycott kicked off the Civil Rights era a few days later."
 
address="600 Dexter Ave"
  zip="36130"
  long="-86.301963"
  lat="32.377189" />
<state name="Alaska"
  abbreviation="AK"
  capital="Juneau"
  flag="AK.gif"
  date="January 3, 1959"
  fact="In 1867 United States Secretary of State William H. Seward offered Russia $7,200,000, or two cents per acre, for Alaska."
  address="120 4th Street"
  zip="99801"
  long="-134.410699"
  lat="58.301072" />
<state name="Arizona"
  abbreviation="AZ"
  capital="Phoenix"
  flag="AZ.gif"
  date="February 14, 1912"
  fact="Arizona is home of the Grand Canyon National Park."
  address="1700 West Washington St"
  zip="85007"
  long="-112.095704"
  lat="33.448543" />
</capitals>'
;

SELECT Node.value('@name', 'VARCHAR(100)') AS Name,
  Node.value('@abbreviation', 'VARCHAR(2)') AS Abbreviation,
 
Node.value('@capital', 'VARCHAR(100)') AS Capital,
  Node.value('@flag', 'VARCHAR(100)') AS Flag,
  Node.value('@date', 'VARCHAR(100)') AS Date,
  Node.value('@address', 'VARCHAR(100)') AS Address,
  Node.value('@zip', 'VARCHAR(5)') AS ZIP,
  Node.value('@long', 'FLOAT') AS Long,
  Node.value('@lat', 'FLOAT') AS Lat
FROM @xml.nodes('/capitals/state') TempXML (Node);

This produces the exact same results as the OPENXML method, but the syntax is much cleaner and the code is more manageable. You also don't have to worry about COM and the manual memory management.

The nodes() method simply generates a tabular result set that contains a row for each XML node that matches the path expression. In this example it returns a row for each '/capitals/state' node. Then the value() method is used to retrieve scalar values for each attribute in the row. You could also use the query() method to retrieve nodes instead of scalar values.

As for speed, the difference between using the xml data type nodes() method and OPENXML is negligible. If you're shredding XML data stored in a table, however, a primary XML index cuts the processing time down by nearly 50%.

For more manageable code and more robust and scalable applications, use the nodes() method to shred your XML instead of OPENXML.


Stoopid Tech Interviewer Tricks

By Michael Coles in Pseudo-Random 01-16-2008 1:19 AM | Categories: Filed under: , , , ,
Rating: (not yet rated) Rate this |  Discuss | 2,284 Reads | 187 Reads in Last 30 Days |9 comment(s)

OK, so you’ve got your suit fresh from the drycleaners. Your shirt is pressed and your shoes are shined. You’ve gone over the job requirements a dozen times, and even took the time to scan BOL for obscure tips the interviewer might try to turn into trick questions. All that hard work and now you find yourself on the losing end of a conversation like this:

 

Interviewer: “Tell me about your favorite new feature of SQL Server 2005.”

You: “I like the new XML data type, I’ve found that...”

Interviewer: “That’s not a new feature. That was around in SQL 2000.”

 

[‘Nuff said.]

 

So what do you do?  As I told my friend (the job applicant) my first instinct would have been to stand up, shake the interviewer’s hand, and politely tell him that the interview was over.  My friend, however, has a much higher threshold for mental anguish than I do.

 

Just for grins, here are some more recent gems from various friends on tech. interviews:

 

Interviewer: “What’s the difference between a temporary variable and a temporary table?”

Applicant: “The temporary variable has well-defined scope, they cause fewer stored procedure recompilations than temporary tables, and they are not affected by transaction rollbacks.”

Interviewer: “You left out the most important part – table variables are always stored in memory, they don’t clog up tempdb!”

 

[Yes, he used the highly technical term "clog up". Microsoft weighs into the discussion with the following from http://support.microsoft.com/default.aspx/kb/305977: A table variable is not a memory-only structure. Because a table variable might hold more data than can fit in memory, it has to have a place on disk to store data. Table variables are created in the tempdb database similar to temporary tables. If memory is available, both table variables and temporary tables are created and processed while in memory (data cache).”]

 

Interviewer: “Why do we normalize our tables?”

Applicant: “To preserve data integrity.”

Interviewer: “No, that’s not right.”

 

[No further explanation was apparently necessary or forthcoming from this idiot. Dr. Codd had a few words on this in his paper "Further Normalization of the Data Base Relational Model". One of his objectives of normalization was to free the database from certain insert, update, and delete anomalies. Sounds like a data integrity issue to me, but I ain't no hiring manager...]

 

Interviewer: “What’s the difference between using DELETE to delete all rows from a table and using TRUNCATE?”

Applicant: “DELETE logs every row as it’s being deleted, TRUNCATE logs page deallocations so TRUNCATE executes much faster. Oh, and TRUNCATE reseeds the IDENTITY column if the table has one.”

Interviewer: “No. TRUNCATE is not logged.”

 

[Microsoft’s take on the situation is at http://technet.microsoft.com/en-us/library/ms177570.aspx: “TRUNCATE TABLE removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log.” You be the judge.]

 

Interviewer: “Why does BULK INSERT provide better performance than individual INSERT statements?”

Applicant: “For one thing if you’re loading 1,000,000 rows you’re not running 1,000,000 individual INSERT statements. Also BULK INSERT queues up batches of rows and commits them in batches. BULK INSERT can also take advantage of minimal logging if your database is configured appropriately.”

Interviewer: “Wrong, BULK INSERT doesn’t write to the transaction log!”

 

[Hmmm, Microsoft seems to think it does. In reference to BULK INSERT with minimal logging “...the Database Engine still logs extent allocations each time a new extent is allocated to the table.” If your database is in full recovery mode, “all row-insert operations that are performed by bulk import are fully logged in the transaction log.”]


"Exact Numerics" Re-visited

By Michael Coles in Pseudo-Random 10-21-2007 12:12 AM | Categories: Filed under: , , , ,
Rating: (not yet rated) Rate this |  Discuss | 1,168 Reads | 125 Reads in Last 30 Days |8 comment(s)

Wow, seems like I have a lot to write about today.  I just noticed a blog by SQL Server guru Hugo Kornelis about the difference between "exact" and "approximate" numeric data types.  (Hugo's blog entry is here:  http://sqlblog.com/blogs/hugo_kornelis/archive/2007/10/17/so-called-exact-numerics-are-not-at-all-exact.aspx).  This is a topic that seems to pop up on the radar regularly on the SQL Server newsgroups, and Hugo attempts to dispel the myth of "exact" data types.  I appreciate Hugo's opinion on the matter, but I do disagree with some of his points and conclusions, as I'll try to explain below.  For those who aren't familiar with the differences between "exact" and "approximate" data types, here's the short version:

* Approximate data types are the FLOAT and REAL data types.  These are floating point data types, and they are stored and manipulated according to the IEEE-754 IEEE Standard for Binary Floating-Point Arithmetic.  (A good overview of IEEE-754 is available here: http://en.wikipedia.org/wiki/IEEE_floating-point_standard).

* Exact data types include the integer types (INT, SMALLINT, BIGINT, etc.) and fixed-point NUMERIC/DECIMAL types.  Exact NUMERIC and DECIMAL data types are stored and manipulated in a Binary-Coded Decimal (BCD for all you old school Assembly Language programmers out there), or possibly packed BCD format.

Approximate data types are converted to the IEEE-754 exponent-biased sign-magnitude format for storage and manipulation.  This can result in significant, often unexpected rounding errors during processing and computation.  Exact data types, however, are guaranteed to store an exact representation of any decimal value (this is a very important concept - contrast with the word fraction) with precision and scale equal to, or less than, the precision and scale specified for the data type.  In other words, if I store the constant value 1234.5678 in a NUMERIC(10, 4) variable or column, there will be no rounding errors or data loss during the assignment and storage process.  FLOAT and REAL do not make this guarantee.  For instance, consider the following code:

DECLARE @f REAL;
SET @f = 1234.5678
;
SELECT CAST(@f AS NUMERIC(20, 15));

This query returns 1234.567749023437500, due to the approximate REAL representation.  As Hugo said, however, exact numeric types can also suffer from rounding errors.  Here are the four situations in which exact numeric types can be rounded:

1) If you try to store a decimal value with greater scale than the declared NUMERIC or DECIMAL data type the value will be rounded to fit.  Consider the following code:

DECLARE @d DECIMAL(10, 4);
SET @d = 1234.56785;
SELECT @d;

The DECIMAL variable is declared with 4 digits after the decimal point, but we're trying to assign 5 digits after the decimal point.  SQL Server resolves this issue by rounding the assigned value up to 1234.5679.

2) If you try to store an irrational number like the √2 or π or a recurring decimal like 0.333333... in an exact data type.  It seems obvious that you cannot store a number with an infinite number of digits in the finite memory of a computer.

General rule of thumb #0 - You cannot store 100 gallons of water in a 2 liter bottle.

3) If you perform a calculation that results in an irrational number, a recurring decimal, or a result with greater scale than was declared for the exact numeric type it will be rounded before being stored.  This includes calculations like 1/3, 22/7, etc.  I would challenge anyone with a lot of spare time and nothing better to do to provide an exact decimal value for 1/3; once it's delivered, I'll personally demonstrate how to put that exact decimal value in a SQL Server exact numeric type for them with no rounding issues.

General rule of thumb #1 - A decimal number is not a fraction, although decimal numbers and fractions can be converted to one another.  This process often requires rounding of some sort, whether you do it with a computer or with a pen and paper.

4) If you CAST or CONVERT a numeric value to another data type with lower scale, it will be rounded according to the rules described in BOL.  If you use SQL rounding fuctions like ROUND your numeric value will be rounded as well.

Another misconception that people seem to have is that the exact numeric data types in SQL are designed to store something other than decimal data.  People throw out fractions like 1/3 and 1/6, and then claim that SQL exact numeric types are not exact because they can't store them.  SQL exact numeric types are not designed to store or manipulate fractional representations of numbers.  They are designed to store and manipulate integer and fixed-point decimal representations of data; therefore they suffer from the same limitations as the decimal system.  One of those limitations is that there are certain fractions, irrational numbers, and recurring decimals that absolutely cannot be represented exactly in the decimal system.

General rule of thumb #2 - if you can't represent a decimal number exactly with a pen and paper, then a computer cannot accurately represent it either.

The main problem here is more semantic than anything else.  As Hugo points out:

It [SQL Server Books Online] also says that “floating point data is approximate; therefore, not all values in the data type range can be represented exactly”, thereby suggesting that other numeric data types are capable of representing all values in the data type range. The latter is of course not true, for there is no way that values such as 1/3, π, or √2 can ever be represented exactly in any of SQL Server’s data types.

This makes a big assumption that fractions and irrational numbers are part of the "data type range" for a given exact numeric type.  However, exact numeric data type ranges are specifically decimal numbers with the precision and scale specified for the data type.  No more, no less.  A TINYINT, for example, can store the numbers 0 to 255 with no decimal places.  Does its inability to store 254.8765463 or -12345 make the TINYINT data type "inexact"?  Similarly, the DECIMAL (2, 1) data type can store values between -9.9 and 9.9.  Does its inability to store -10.0 or -9.987654321 make it "inexact"?  What about complex numbers?  Like fractions, they represent a whole class of numbers that can't be represented in SQL Server's numeric data types - so how does that fit into the exactness formula?

General rule of thumb #3 - SQL data types cannot be more exact than the subset of the decimal system they are designed to represent.

There are also some who are upset that the SQL Server Books Online documentation team chose to use the words "exact" and "approximate" to describe these data types.  They seem to think it is an error that Microsoft should resolve.  That probably won't happen, however, since the BOL team took their cue from the ISO SQL Standard, which specifically defines these data types as - what else? - "exact" and "approximate".  Of course the ISO SQL Standard borrowed their terminology from the IEEE-754 standard, which differentiates between "exact" and "approximate" numeric types as well.  So it's unlikely Microsoft will change their terminology anytime in the near future.


The Problem With Joins

By Michael Coles in Pseudo-Random 10-20-2007 2:46 PM | Categories: Filed under: , , , ,
Rating: (not yet rated) Rate this |  Discuss | 750 Reads | 117 Reads in Last 30 Days |no comments

I recently came up against a pretty big problem with outer joins on SQL Server, and thought I'd talk about it a bit.  First let me give you a little background on the problem.

<Begin Background>

It's common in data warehousing and datamarts to "band" data.  By "banding" I'm referring to the practice of assigning identifiers to ranges of values, like dollar amounts.  As a simple example, you might classify your sales dollar amounts into four categories like the ones represented in this sample AmountRange table:

AmountRange Table
Id   MinAmount    MaxAmount
--   ---------    ---------
1         0.00         5.00
2         5.00        10.00
3        10.00        25.00
4        25.00        50.00
5        50.00       100.00

The idea is to determine the pair of (MinAmount, MaxAmount) that a given dollar amount falls between and return that Id number.  A simple WHERE clause to do this might look like the following:

WHERE SalesAmount >= AmountRange.MinAmount
  AND SalesAmount < AmountRange.MaxAmount

A dollar amount of 8.17 will get an Id of 2, while 58.93 will return an Id of 5.

<End Background>

So back to the problem at hand.  I was attempting to do this for two dollar amount values using LEFT OUTER JOINs.  The query I used looked like the following:

SELECT p.Id AS ProductId,
 
cr.Id AS CostRangeId,
  pr.Id AS PriceRangeId
FROM
Products p
LEFT
OUTER JOIN AmountRange cr
  ON p.Cost >= cr.
MinAmount
  AND p.Cost < cr.MaxAmount
LEFT
OUTER JOIN AmountRange pr
  ON p.Price >= cr.
MinAmount
  AND p.Price < cr.MaxAmount;

With a Products table that contains 19,000,000 rows, SQL Server comes up with a query plan like this (edited for readability):

|--Nested Loops(Left Outer Join, WHERE:([p].[Price]>=[pr].[MinAmount] AND [p].[Price]<[pr].[MaxAmount])
  
|--Nested Loops(Left Outer Join, WHERE:([p].[Cost]>=[cr].[MinAmount] AND [p].[Cost]<[cr].[MaxAmount])
   | |--Clustered Index Scan(OBJECT:([p].[PK_Products]))
   | |--Clustered Index Scan(OBJECT:([cr].[PK_AmountRange]))
   |--Clustered Index Scan(OBJECT:([pr].[PK_AmountRange]))

Not a bad plan overall, until you start looking at the cost and cardinality estimates.  Cardinality estimates are the optimizer's estimates of how many rows are generated between steps and for the final result set.  In this case we know that we started with 19,000,000 rows, and we will end up with 19,000,000 rows.  But SQL Server estimates the first join will generate an intermediate result set of 26,576,500 rows, and the second join will generate a final result set of 31,360,300 rows!  The total cost is 899.479.

Apparently SQL Server does not keep good enough statistics to figure out that the ranges defined in the AmountRange table do not overlap one another.  While we know that LEFT OUTER JOINs like these could only generate more rows than we started with if the ranges in the AmountRange table actually did overlap.  The SQL optimizer apparently isn't smart enough to figure this out.

So what do we do about it?

Well, we can use subqueries instead, like this:

SELECT p.Id AS ProductId,
  (SELECT Id
   FROM
AmountRange
   WHERE p.Cost >=
MinAmount
     AND p.Cost < MaxAmount),
  (SELECT Id
   FROM
AmountRange
   WHERE p.Price >=
MinAmount
     AND p.Price < MaxAmount)
FROM Products p;

The query plan for the updated query looks like this (again edited for readability):

|--Compute Scalar(DEFINE:([Expr1013]=[Expr1013], [Expr1015]=[Expr1015]))
   |--Nested Loops(Left Outer Join, OUTER REFERENCES:([p].[Price]))
   |--Nested Loops(Left Outer Join, OUTER REFERENCES:([p].[Cost]))
   | |--Clustered Index Scan(OBJECT:([p].[PK_Products]))
   | |--Assert(WHERE:(CASE WHEN [Expr1012]>(1) THEN (0) ELSE NULL END))
   | |--Stream Aggregate(DEFINE:([Expr1012]=Count(*), [Expr1013]=ANY([cr].[Id])))
  
| |--Clustered Index Scan(OBJECT:([cr].[PK_AmountRange]), WHERE:([p].[Cost]>=[cr].[MinAmount] AND [p].[Cost]<[cr].[MaxAmount]))
   |--Assert(WHERE:(CASE WHEN [Expr1014]>(1) THEN (0) ELSE NULL END))
   |--Stream Aggregate(DEFINE:([Expr1014]=Count(*), [Expr1015]=ANY([pr].[Id])))
   |--Clustered Index Scan(OBJECT:([pr].[PK_AmountRange]), WHERE:([p].[Price]>=[pr].[MinAmount] AND [p].[Price]<[pr].[MaxAmount]))

The total cost for this plan is 516.937, significantly lower than the previous version.  The updated query plan is also notable for its accurate cardinality estimates.  At every step of the process SQL Server estimates 19,000,000 rows of input, 19,000,000 rows of output, which matches up with our expectations.  The better estimates that come with the second version help SQL Server make better decisions along the way and help it come up with a much better query plan.

If you have to perform a "banding" function like this, I highly recommend considering using subqueries instead of outer joins.


SSIS Custom Logging the Easy Way

By Michael Coles in Pseudo-Random 10-09-2007 10:44 PM | Categories: Filed under: , , ,
Rating: |  Discuss | 4,460 Reads | 232 Reads in Last 30 Days |2 comment(s)

SSIS provides several log providers, including the Log Provider for SQL Server (my personal favorite).  I decided pretty quickly that I needed to customize my logging capabilities.  There are two main methods for custom logging in SSIS:

  • Create a custom log provider in managed code, as detailed on MSDN.
  • Use event handlers in SSIS to implement custom logging.

The custom log provider option requires you to implement some .NET interfaces, compile, sign your assembly with a strong-name key, register the assembly with the GAC, etc.  Based on the deadlines I'm up against, and the extensive testing required before I could move something like this into production, this was not really option for me.

Using event handlers is an option described in detail by Jamie Thomson at http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx.  This method has some limitations, however.  No fault of Jamie's, but SSIS makes this a little more complex than it needs to be.  Some of the items involved are downright counterintuitive (variable scope issues with the OnVariableValueChanged event, for instance).  The custom log providers, and even stock log providers, provide more information than is accessible through event handlers.

I toyed with both options before coming up with option number 3 for the stock Log Provider for SQL Server.  The SSIS Log Provider for SQL Server writes entries to the sysdtslog90 table of the target database.  The log provider creates a stored procedure named sp_dts_addlogentry, which it calls to insert each row into the log table.  Option 3 is just to simply modify the sp_dts_addlogentry procedure to perform custom logging/auditing to your specifications.  In the example below I'm still writing the original log entry to the sysdtslog90 table, but I've extended the stored procedure to write entries to a second table called AuditPackage.  The AuditPackage table stores a summary of each package as it executes.

CREATE TABLE AuditPackage (Id INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,
  PackageName VARCHAR(100) NOT NULL,
  PackageGuid UNIQUEIDENTIFIER,
  ExecutionGuid UNIQUEIDENTIFIER,
  StartTime DATETIME,
  EndTime DATETIME,
  ElapsedTime INT,
  Status VARCHAR(100));
GO

CREATE
PROCEDURE dbo.sp_dts_addlogentry @event sysname,
 
@computer nvarchar(128),
 
@operator nvarchar(128),
 
@source nvarchar(1024),
 
@sourceid uniqueidentifier,
 
@executionid uniqueidentifier,
 
@starttime datetime,
 
@endtime datetime,
 
@datacode int,
 
@databytes image,
 
@message nvarchar(2048)
AS
BEGIN
 
INSERT INTO sysdtslog90 (event, computer, operator, source, sourceid, executionid, starttime, endtime, datacode, databytes, message)
 
VALUES (@event, @computer, @operator, @source, @sourceid, @executionid, @starttime, @endtime, @datacode, @databytes, @message);

  INSERT INTO AuditPackage (PackageName, PackageGuid, ExecutionGuid, StartTime, ElapsedTime)
 
SELECT @source, @sourceid, @executionid, GETDATE(), 0
 
WHERE (@event = 'PackageStart');

  UPDATE
AuditPackage
 
SET EndTime = GETDATE(),
   
ElapsedTime = DATEDIFF(ms, StartTime, GETDATE()),
   
Status = 'Complete'
  WHERE (@event = 'PackageEnd'
   
AND PackageGuid = @sourceid
   
AND ExecutionGuid = @executionid);

  UPDATE AuditPackage
  SET Status = 'Error'
  WHERE (@event = 'OnError'
   
AND PackageGuid = @sourceid
    AND ExecutionGuid = @executionid);
END

On the SSIS side you need to enable the Log Provider for SQL Server on your packages, and select the events you want to capture.  Some possible extensions to this method include grabbing the OnPipelineRowsSent event and parsing the message to get more information for each data flow component, or logging other events like OnWarning, etc.  You could also add additional code to fire off a call to send email via database mail for errors or to write certain events to the Windows Event Log.

NOTE:  Just for clarification, the sp_dts_addlogentry procedure is *not* a "system" stored procedure.  It's simply a stored procedure that is created by the SSIS Log Provider for SQL Server if it does not currently exist in the database you are writing to.  If the procedure does exist, it is not overwritten by the Log Provier for SQL Server.  Appropriate care should be taken not to mess up your procedure by introducing logic or other errors into the procedure.  On the other hand, if you do mess something up beyond repair the Log Provider will recreate the procedure if you drop it.


The Great SQL 2005 SP:CacheMiss-tery

By Michael Coles in Pseudo-Random 08-09-2006 1:41 AM | Categories: Filed under: , ,
Rating: (not yet rated) Rate this |  Discuss | 1,393 Reads | 120 Reads in Last 30 Days |no comments

OK, so I'm using SQL Server Profiler on SQL 2005 to determine for myself if the rules for stored procedure name resolution are the same as for SQL 2000.  So like any curious SQL programmer I jump in with a simple test.  I ran the following script in SQL 2000:

CREATE PROCEDURE dbo.Test
AS
SELECT 'Test';
GO

EXEC dbo.Test;
GO

I then ran the EXEC dbo.Test line several times in a row.  The results were pretty much as I expected in SQL Profiler (abbreviated for brevity):

SP:CacheMiss
SP:CacheInsert
...
SP:ExecContextHit
...
SP:ExecContextHit
...

I then ran the exact same script in SQL 2005 with SQL Server Profiler running.  Like before I ran EXEC dbo.Test several times in a row.  The results were a little surprising:

SP:CacheMiss
SP:CacheInsert
...
SP:CacheMiss
SP:CacheHit
...
SP:CacheMiss
SP:CacheHit
...

So why all the SP:CacheMiss events when we know that dbo.Test is in the cache?  It turns out that dbo.Test is in the cache, but the cost of an ad hoc query (a single SELECT, INSERT, UPDATE or DELETE statement - or apparently in this case an EXEC statement) is zero, so the EXEC by itself is not cached.  A more complex batch eliminates all of the SP:CacheMiss events:

EXEC dbo.Test;
SELECT TOP 10 * FROM Person.Contact
;
GO

Putting a SELECT statement in the same batch as the EXEC eliminates the SP:CacheMiss events.  Miss-tery solved!  Obviously this isn't a cure-all or an option for every case, but it made me aware that the cost of an ad hoc query is zero which prevents it from being cached.  A batch with multiple statements, however, has a cost associated which causes SQL Server 2005 to cache the query plan.

A couple of other things to know about the cache:

- The cache is case-sensitive (regardless of your database or server collation).  Using the name DBO.TEST will cause an SP:CacheInsert since we previously called it with dbo.Test.

- The cache requires an exact match.  Even extra white-space in your queries will affect your cache performance and cause SP:CacheInserts.

- SQL 2005 has several improvements in caching including differences in "sufficient difference" and recompilation thresholds from SQL 2000, forced parameterization, parameter sniffing, etc.

 


DUDE!? Where's my Upsert?

By Michael Coles in Pseudo-Random 05-22-2006 12:56 PM | Categories: Filed under: , ,
Rating: (not yet rated) Rate this |  Discuss | 1,224 Reads | 105 Reads in Last 30 Days |1 comment(s)

In the build-up to the release of SQL 2005, some great new features were introduced.  One of these was the implementation of the mythical "Upsert" ("Update or Insert") statement.  ANSI SQL:2003 defines this as the MERGE statement.  When I saw this I envisioned the end to convoluted upsert stored procedures that 1) SELECT to verify if a row already exists in a table, 2) Uses an IF statement to decide whether to INSERT or UPDATE the row depending on the result of step 1.

Microsoft introduced a MERGE statement with the following syntax:

MERGE INTO DestinationTable
     USING SourceTable
          ON SourceTable.ColumnX = DestinationTable.ColumnX
WHEN MATCHED THEN
     UPDATE ColumnY = SourceTable.ColumnY
WHEN NOT MATCHED THEN
     INSERT VALUES(SourceTable.ColumnX, SourceTable.ColumnY)

Touted in several books and blogs prior to the RTM release, the MERGE statement was unceremoniously dropped sometime prior to RTM.  It's a shame to drop such a useful piece of functionality, already implemented in competing products (Oracle, DB2).  I wonder why it was dropped...  there doesn't appear to be anything on Microsoft's website explaining it.  Until they bring it back, it looks like it's back to the SELECT...IF...UPDATE...ELSE...INSERT method.


SQL Basics: Part 2, Nonclustered Indexes

By Michael Coles in Pseudo-Random 05-15-2006 12:56 AM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 1,580 Reads | 131 Reads in Last 30 Days |1 comment(s)

Nonclustered Indexes

If you caught the first article in the series, you know that clustered indexes are pretty handy.  But sometimes you need to rapidly retrieve data without physically re-ordering your table.  Going back to the clustered index analogy, some books (phone books, dictionaries, etc.) benefit from having their content ordered alphabetically -- but most do not.  To demonstrate, I've reorganized the words in Act I of Shakespeare's Hamlet in alphabetical order -- here are the first few words:

Fig. 4. Act I of Shakespeare's Hamlet in alphabetical order

As you can see it has the same words as the original, but it's even harder to read (some of us might consider making Shakespeare harder to read a real feat).  Fortunately for us, some really smart guys came up with another method of indexing a long time ago.  The "book index" is an alphabetically ordered listing of words at the back of the book.  With a book index, you first locate the word you're searching for in the list.  Each word has page numbers listed next to it indicating the pages on which it can be found.  You then turn the book to the referenced page(s) to read the sentences that contain the word in question.  SQL Server's analogous indexing mechanism is called the nonclustered index.

Unlike a clustered index, nonclustered indexes are stored separately from the main table.  These indexes are not dependent on the physical ordering of the table.  Nonclustered indexes require a trade-off between storage and data retrieval efficiency.  Here are some of the factors to take into consideration:

  • Clustered indexes are automatically added to all non-clustered indexes, creating what is called a covering index.  A covering index helps reduce bookmark lookups as we’ll see below.
  • Nonclustered indexes can take up considerably more space than a clustered index, since they represent copies of the specified columns from the table, plus the nonclustered index.
  • You can also expect to encounter "bookmark lookups" when using nonclustered indexes.  Considering the book ind