SQL Server Central is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

adam machanic :: sql server programming

Add to Technorati Favorites Add to Google
February 2006 - Posts

Running sums, redux

By Adam Machanic in adam machanic :: sql server programming 02-28-2006 3:03 AM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 2,269 Reads | 106 Reads in Last 30 Days |3 comment(s)
Siddhartha Gautama, the Buddha, taught us to understand that the key to enlightenment is following the Middle Path.  And today I learned a valuable lesson in extremes.  You can file this one in the "Doh!  Wrong again!" category...

A fairly common question on SQL Server forums is, "how can I get the running sum of the data in this column?"  Being the fan of set-based queries that I am, I always answer the exact same way.  I show the person asking the question how to do a self-join on the grouped column, getting all of the "previous" values to create a running sum.  The following example shows how you might do this against the AdventureWorks Production.TransactionHistory table:

SELECT
    TH1.TransactionID,
    TH1.ActualCost,
    SUM(TH2.ActualCost) AS RunningTotal
FROM Production.TransactionHistory TH1
JOIN Production.TransactionHistory TH2 ON TH2.TransactionID <= TH1.TransactionID
GROUP BY TH1.TransactionID, TH1.ActualCost
ORDER BY TH1.TransactionID

Pretty simple query.  For each row of the "TH1" alias, every row with a lesser-or-equal TransactionID will be summed.  Thereby creating a running total for every row of the table.  Note, I've used the IDENTITY column instead of the date column.  I'd generally suggest not doing so because, e.g., you might need to insert some post-dated rows at some point and relying on the IDENTITY for a time sequence will thereby not work.  But in this case it's a lazy solution because the TransactionDate column isn't indexed, and it's also not unique.  I want to test a lot of rows (TransactionHistory has around 113,000), but I don't want to skew the test by forcing a table scan on every iteration!

But I digress.  The point is, I've given this answer more than a few times and, well, I'd like to apologize.  Just now I went ahead and ran this query on my powerful test server--err, my laptop. 

As you might guess, since I'm performance-minded I also happen to be extremely impatient--so I went ahead and killed the query at the five-minute mark.  SSMS's result grid showed the first 26,568 rows, so obviously there was a long way to go to hit that 113,000 mark.  And with an estimated cost of 38,086 for the query, I can't say I'm surprised.

A few moments of head scratching and the following re-write was issued:

SELECT
    TH1.TransactionID,
    TH1.ActualCost,
    (
        SELECT SUM(TH2.ActualCost)
        FROM Production.TransactionHistory TH2
        WHERE TH2.TransactionID <= TH1.TransactionID
    ) AS RunningTotal
FROM Production.TransactionHistory TH1
ORDER BY TH1.TransactionID

With an estimated cost of only 6,630, I had high hopes for this one.  Alas, once again I was forced to cancel the query at the five-minute mark.  27,683 rows.  Not much better, I'm afraid.  And, as an aside, I'm starting to wonder about these estimated costs.  But that's another post for another day.

So where am I going with all of this?  Well, there's a reason I haven't given any indication up until this point in the post.  You see, it's utterly painful to write this, but...

In this case, a cursor is faster.

Yes, I said it.  That evil construct which we as database developers despise, the cursor.  Thanks to Paul Nielsen, who revealed this ugly fact to me in a conversation today, I was forced to test this for myself (hoping to prove him wrong, of course).  Which is why I started playing around with the solution that I've given so many times on forums.  Unfortunately, he is correct.

My next test query, using the first cursor I've written in several years:

DECLARE RunningTotalCursor
CURSOR LOCAL FAST_FORWARD FOR
    SELECT TransactionID, ActualCost
    FROM Production.TransactionHistory
    ORDER BY TransactionID

OPEN RunningTotalCursor

DECLARE @TransactionID INT
DECLARE @ActualCost MONEY

DECLARE @RunningTotal MONEY
SET @RunningTotal = 0

DECLARE @Results TABLE
(
    TransactionID INT NOT NULL PRIMARY KEY,
    ActualCost MONEY,
    RunningTotal MONEY
)

FETCH NEXT FROM RunningTotalCursor
INTO @TransactionID, @ActualCost

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @RunningTotal = @RunningTotal + @ActualCost

    INSERT @Results
    VALUES (@TransactionID, @ActualCost, @RunningTotal)

    FETCH NEXT FROM RunningTotalCursor
    INTO @TransactionID, @ActualCost
END

CLOSE RunningTotalCursor

DEALLOCATE RunningTotalCursor

SELECT *
FROM @Results
ORDER BY TransactionID

What's really unfortunate about the cursor approach is that you need to use a temporary table if you want to return a single result set to the client. I figured the additional I/O due to the temp table would balance any improvement gains from the cursor approach, thereby rendering my forum responses correct, and Paul wrong.  Well, 14 seconds and 113,443 rows later, SSMS and my laptop declared Paul the undisputed Champion of the Cursor.

This cursor makes a lot of sense in this case.  The set-based query works by looping over each row of the table, taking a sum of every previous row.  So for the 10th row, 10 previous rows also need to be visited.  For the 1000th row, 1000 previous rows need to be visited.  And so on.  The larger the set gets, the worse performance will be--and that's not going to be a merely linear decrease in performance.  Think about this:  Using the set-based method to find the running sum over a set of 100 rows, 5050 total rows need to be visited.  For a set of 200 rows, the query processor needs to visit 20100 total rows -- a four-fold increase in the amount of work that must be done to satisfy the query (O((N^2)/2), for those who are a bit more algorithmically minded.)

The cursor, on the other hand, needs to visit each row exactly once (O(N)). By maintaining the running count in a variable, there is no need to re-visit previous rows.  And as my laptop was so happy to show me, the I/O cost due to the temp table does not overshadow the performance improvement of having to visit so many less rows.

So what have we learned today?  In my set-based singlemindedness I failed to realize that the cursor does, indeed, have utility.  Everything in moderation.

Next steps?  I get the feeling that this can be made even faster by employing a CLR routine.  Pull the data into a DataReader and loop over that instead, which will completely eliminate the need for a temporary table.  Watch for that experiment coming to this space soon.

And next time you hear someone mention how horrible cursors are, remind that person that there is a time and place for everything (and it's called college).


Is all of your data in the right place?

By Adam Machanic in adam machanic :: sql server programming 02-16-2006 2:12 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 915 Reads | 91 Reads in Last 30 Days |no comments
Jeff Smith of SQLTeam brings us a great blog post about data in tables vs. data in code.

I especially like his fourth example, which involves "lookup" tables and what I like to call "magic keys" (generally referred to as "magic numbers" in other types of code, but who said that keys have to be numbers?)  How often do you see particular surrogate key values hardcoded into SQL or application code?  I see this practice all the time in my work, and I completely agree with Jeff: It's a VERY dangerous habit to get into.

One client I recently did some work for has hundreds of different lookup codes, each with both a text-based representation (usually in the format, "domain.code" where domain is the business vertical and code is the actual code within that vertical) and a surrogate integer key (IDENTITY).  The text-based codes, I don't have that much of a problem with, but in many cases developers chose instead to hardcode the integer keys -- from the development database they happened to be working on!  And no one ever stopped this practice; instead, the database people supported the effort by creating data load scripts to ensure that the integer keys in the development database matched those in QA and production systems.

So now you have a bunch of code hanging around that looks like:

WHERE SomeId IN (44332, 45084, 59005)

Of course, there are no comments anywhere to help someone understand what a key such as 44332 might represent.  And as a result, there are people there who actually have big chunks of these codes memorized, because it's faster to remember than to look them up every time while maintaining the scripts.

This problem is certainly not restricted to that shop.  I've seen it all over the place both in my work and in forums.  It's a very common anti-pattern and I think it highlights the need for developers to take a step back from coding and think just for a moment.  Ponder maintainability.  Do you really think that the next developer who looks at your code will know what these numbers represent?  Do you think you will know, when you look at your code again six months or a year down the road?  Furthermore, do you really want to have to memorize codes?  Don't you have better things to do with your time and brain power?

Thanks, Jeff, for bringing up such an important issue!

GETDATE in a UDF?

By Adam Machanic in adam machanic :: sql server programming 02-12-2006 12:47 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 1,003 Reads | 91 Reads in Last 30 Days |no comments
An extremely common question in forums is, "How can I use the GETDATE() function in a UDF?"  Because the GETDATE() function is nondeterministic, it is not allowed in SQL Server 2000 UDFs.

To date, I've always made it a point to answer the question the same way:  "Add a  DATETIME parameter to the UDF and pass it in."  Yes, you can create a view that selects GETDATE() and select from the view in your UDF to work around the restriction, but that really seems like a dirty hack to me.  I'm not sure why, but passing in the date just feels cleaner...

But after answering this question the same way so many times, I was shocked to read a blog post by Louis Davidson in which he points out that the restriction has been lifted in SQL Server 2005.  You can now use GETDATE() within a UDF; no need to either pass it in or create the view. 

Great job spotting that change, Louis!  I never would have thought to look for a modification of that behavior.  This is one of those little tiny annoyances that crops up every once in a while (always at a very bad time), and it's good to know that we no longer have to be concerned with how to work around this issue.