SQL Server Central is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

adam machanic :: sql server programming

Add to Technorati Favorites Add to Google
August 2005 - Posts

The dark side of publishing

By Adam Machanic in adam machanic :: sql server programming 08-22-2005 8:41 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 939 Reads | 54 Reads in Last 30 Days |1 comment(s)
Recently Steve and Andy have both blogged about the publishing industry.  They both talked about positive aspects.  Learning a lot about writing, and learning a lot about SQL Server by tech editing a new book.

But today I unfortunately encountered a much darker side of the system.

I was asked last week by a major publisher to tech edit a SQL Server 2005 performance book by a new author.  I agreed, and got to work on the first chapter, which I found to be very poorly written.  At that point, I almost quit.  I didn't want to wade through hundreds of pages of dense, badly formed sentences.  But I decided to continue on to the second chapter.

I was immediately struck by the improvement in overall writing quality.  There were still sections of broken text, but suddenly I felt like I was reading something written by someone who really knew the subject, and furthermore really knew how to explain the subject.  This was nothing like the first chapter had been.  And, I thought, that first chapter must have been a fluke.

I kept reading, and found myself engrossed in the topic.  Until I came upon a sentence I didn't quite understand.  So I looked in a very well-known white paper on the topic.

And found the exact same sentence.

And found the exact same section, word-for-word.  And, looking closer, realized that almost the entire white paper was duplicated.

So I wrote an e-mail to the editor.  I still haven't heard back yet, but I decided to look further.  I found that this same author has written for several SQL Server websites, and that many of his articles are ripped, all or in part, from Technet and MSDN offerings.  Even worse, this author has published a book with another major publisher, and the sample chapter available on the publisher's web site is ripped almost exactly from a section of SQL Server Books Online!  How long did this person think this could go on?  Could this person really believe that they could make an entire career of stealing others' material?

You'll notice I'm being vague here.  I'm leaving open the slight chance that this is all some honest mistake.  But I really don't think so. 

This situation makes me physically ill.  I take a lot of pride in my writing, spending hours carefully crafting many of my blog posts -- and a tremendous amount of time on articles that I'm getting paid for.  For this person to take other peoples' writing, re-publish it under his own name, and get paid for it, is, in my opinion, a heinous, unforgivable offense. 

I've alerted the various web sites and publishers who I believe are being ripped off by this person, and I hope that, assuming I'm correct, someone takes legal action... But in the meantime this situation has really hit me hard.  I feel like I've seen a small dose of pure evil... and I don't think I'll ever fully be able to restore my faith in the publishing industry.


Swinging From Tree to Tree Using CTEs, Part 2: Adjacency to Nested Intervals

By Adam Machanic in adam machanic :: sql server programming 08-15-2005 10:08 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 1,000 Reads | 51 Reads in Last 30 Days |no comments
In our previous installment, we saw how to convert Adjacency Lists into Nested Sets using a CTE.

In this episode, we will convert the Adjacency List into a Nested Intervals encoding.  Specifically, this encoding will make use of the Nested Intervals with Continued Fractions technique that Tropashko presented in a later paper.

The key to this technique lies in using a slightly different form of materialized path than was used in the last post.  Rather than materializing the EmployeeIds into a path, the path will be created as an enumerated representation, based on sibling ordering.  For example, the first two levels of the AdventureWorks HumanResources.Employee table's tree look like:

EmployeeId 109 (CEO)
EmployeeId 6 (Marketing Manager)
EmployeeId 12 (VP Engineering)
EmployeeId 42 (IS Manager)
EmployeeId 140 (CFO)
EmployeeId 148 (VP Production)
EmployeeId 273 (VP Sales)

Using the materialized path representation from the previous post, the paths to the second-level employees would be:

6: 109.6
12: 109.12
42: 109.42
140: 109.140
148: 109.148
273: 109.273

However, for this post, paths will instead be materialized based on sibling ordering.  We don't know anything about how siblings should be ordered in the AdventureWorks employee hierarchy (that's probably a business rule question, if it matters at all).  So ordering will be done by EmployeeId:

6: 1.1
12: 1.2
42: 1.3
140: 1.4
148: 1.5
273: 1.6
The significance of this encoding is that for each path a rational number can be generated, using a Euclidian algorithm (described very well on this web site).  The algorithm works by iterating over each element on the path, building a rational number as it goes.  The beauty of this, as shown by Tropashko in his paper, is that by using these paths we can determine an interval, beween which all children of a given path will fall.

In order to accomplish getting the path, the ROW_NUMBER function will be used, but slightly differently than last time.  Instead of creating a second CTE to reference the first, thereby getting the row number for each element of the path, the ROW_NUMBER function will be embedded within the recursive CTE itself, as in the following example:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        ManagerId,
        ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        e.ManagerId,
        ROW_NUMBER() OVER (ORDER BY e.EmployeeId) AS theRow
    FROM EmployeeRows x
    JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
)
SELECT *
FROM EmployeeRows
ORDER BY
    ManagerId,
    EmployeeId

Interestingly, this example as-is will return exactly the same results as the following, non-recursive CTE example:
SELECT
    EmployeeId,
    ManagerId,
    ROW_NUMBER() OVER (PARTITION BY ManagerId ORDER BY EmployeeId) AS theRow
FROM HumanResources.Employee
ORDER BY
    ManagerId,
    EmployeeId
So, why do we care?  Take a close look at the two examples.  In the CTE example, the ROW_NUMBER function does not use PARTITION BY.  Yet, results are implicitly partitioned.  This gives an interesting view into the inner-workings of CTEs.  As it turns out, the recursive part of the CTE is called once per row returned by the anchor or previous recursion.  This is not how I originally expected CTEs to behave (I thought the recursive part would be called once per rowset returned by the anchor or previous recursion), but it does help us with this particular task!

The current row number, at any given point in the recursion, represents the enumeration for that node.  But because we're using a recursive CTE, we also have access to the parent's enumaration.  For instance, to build an enumerated path, the following T-SQL would be used:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        ManagerId,
        CONVERT(VARCHAR(MAX), ROW_NUMBER() OVER (ORDER BY EmployeeId)) AS thePath
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        e.ManagerId,
        x.thePath + '.' + CONVERT(VARCHAR(MAX), ROW_NUMBER() OVER (ORDER BY e.EmployeeId)) AS thePath
    FROM EmployeeRows x
    JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
)
SELECT *
FROM EmployeeRows
ORDER BY
    thePath
Note that this sample can be made a bit more readable (and more functional for later) by embeding the ROW_NUMBER in a derived table:
WITH EmployeeRows AS
(
    SELECT
        y.EmployeeId,
        y.ManagerId,
        CONVERT(VARCHAR(MAX), y.theRow) AS thePath
    FROM
    (
        SELECT
            EmployeeId,
            ManagerId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y

    UNION ALL

    SELECT
        y.EmployeeId,
        y.ManagerId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath
    FROM
    (
        SELECT
            e.EmployeeId,
            e.ManagerId,
            x.thePath,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeId) AS theRow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
SELECT *
FROM EmployeeRows
ORDER BY
    thePath
From here, it's a simple step to implement the Euclidian algorithm.  The algorithm is quite simple:
  1. Set parentNumerator <- 1
  2. Set parentDenominator <- 0
  3. Set theElement <- first enumeration in the path
  4. Set currentNumerator <- theElement
  5. Set currentDenominator <- 1
  6. Set theElement <- next enumeration in the path
  7. Set previousParentNumerator <- parentNumerator
  8. Set previousParentDenominator <- parentDenominator
  9. Set parentNumerator <- currentNumerator
  10. Set parentDenominator <- currentDenominator
  11. Set currentNumerator <- (parentNumerator * theElement) + previousParentNumerator
  12. Set currentDenominator <- (parentDenominator * theElement) + previousParentDenominator
  13. If the current element is not the final node in the path, goto 6.
This seems a bit hairy, but I think that looking at the algorithm and spending a few minutes with our old friends pencil and paper will make it quite clear.  Also, look at the web site linked above.  Here is how I've implemented the algorithm using a CTE:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), theRow) AS thePath,
        CONVERT(BIGINT, 1) AS prevNumer,
        CONVERT(BIGINT, 0) AS prevDenom,
        CONVERT(BIGINT, theRow) AS currNumer,
        CONVERT(BIGINT, 1) AS currDenom
    FROM
    (
        SELECT
            EmployeeId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y
   
    UNION ALL
   
    SELECT
        y.EmployeeId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath,
        prevNumer = y.currNumer,
        prevDenom = y.currDenom,
        (y.currNumer * y.theRow) + y.prevNumer  AS currNumer,
        (y.currDenom * y.theRow) + y.prevDenom  AS currDenom
    FROM
    (
        SELECT
            e.EmployeeID,
            x.thePath,
            x.currNumer,
            x.currDenom,
            x.prevNumer,
            x.prevDenom,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeID) AS therow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
Note that in this case I didn't require the use of the temporary variables; the previous anchor/recursive parts act as temporary storage enough. 

Readers will also hopefully notice that I haven't yet included a SELECT to get the data from the CTE!  This is because I'd like to explain briefly what it will do.  In his paper, Tropashko explains that for each node, the intervals for the children of that node will fall into an interval between the encoding of that node (currNumer / currDenom) and the numerator of the previous node plus the numerator for the current node, divided by the denominator of the previous node plus the denominator for the current node ((currNumer + prevNumer) / (currDenom + prevDenom)).  Quite wordy here.  Refer to the paper for a better explanation and proof.
Anyway, the completed query follows:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), theRow) AS thePath,
        CONVERT(BIGINT, 1) AS prevNumer,
        CONVERT(BIGINT, 0) AS prevDenom,
        CONVERT(BIGINT, theRow) AS currNumer,
        CONVERT(BIGINT, 1) AS currDenom
    FROM
    (
        SELECT
            EmployeeId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y
   
    UNION ALL
   
    SELECT
        y.EmployeeId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath,
        prevNumer = y.currNumer,
        prevDenom = y.currDenom,
        (y.currNumer * y.theRow) + y.prevNumer  AS currNumer,
        (y.currDenom * y.theRow) + y.prevDenom  AS currDenom
    FROM
    (
        SELECT
            e.EmployeeID,
            x.thePath,
            x.currNumer,
            x.currDenom,
            x.prevNumer,
            x.prevDenom,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeID) AS therow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
SELECT
    EmployeeId,
    thePath,
    currNumer AS startNumer,
    currDenom AS startDenom,
    currNumer + prevNumer AS endNumer,
    currDenom + prevDenom AS endDenom
FROM EmployeeRows
For each node (EmployeeId), you now have an interval (start and end) within which all children intervals will fall.  Note that computation must be done by the interval, not by the current node's encoding.  The reason becomes apparent when looking at the encodings for 1.1.1 and 1.2.  They are the same; however, their intervals do not overlap.  As a matter of fact, encodings will be the same for every next sibling/first child pair.  But the intervals remain nested, and if proper queries are written there will be no confusion.

So that's a first step towards using the Nested Intervals Model in SQL Server 2005.  Stay tuned for more... And as always, feel free to post questions or comments.  I know some of this material can be confusing (at least, it was to me before I wrote this post!)

New England Code Camp 4: Developers Gone Wild

By Adam Machanic in adam machanic :: sql server programming 08-10-2005 10:38 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 494 Reads | 47 Reads in Last 30 Days |no comments
Okay, so I'm way behind in announcing this (almost a month), but it's still not too late, sooo...

September 24/25, 2005

 

Register Today

 

September 24 – 8:30 AM – 9PM

September 25 – 8:30 AM – 4PM

 

Microsoft Waltham
201 Jones Rd.

Waltham, Ma.

 

Call for Speakers

 

Are you a developer interested in improving your .NET skills? Then this is the event to attend. Code Camp 4: Developers Gone Wild promises to be both bigger and better than anything we have done before. This free two day seminar is designed as a series of intensive code related demos and technical sessions to guide the developer to the next skill level. The continuing goal of the Code Camps is to provide an intensive developer to developer learning experience that is fun and technically stimulating. The focus is on delivering programming information and sample code that can be used immediately. All training, slides, manuals and demo code is provided free!

 

This two day camp is hosted in our Waltham facility. The leading technical camp counselors from both Microsoft and the New England Developer Community will share their technical expertise and experiences. This code camp is divided into three tracks – Smart Client, Web Development and Data Technologies. Each track starts with a “get the code” basics before advancing to more advanced topics.


(via Thom Robbins)

I'll probably be doing a "chalk talk" (basically, a guided discussion).  If you have any requests for topics, let me know.



Swinging From Tree to Tree Using CTEs, Part 1: Adjacency to Nested Sets

By Adam Machanic in adam machanic :: sql server programming 08-04-2005 5:16 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 2,037 Reads | 56 Reads in Last 30 Days |1 comment(s)
I'm not sure how many times over the last several years I've seen the same tired article titles... "Climbing Trees in SQL," "Climbing Up the SQL Tree," or maybe, "Naked Coeds Playing in the Trees!" ... Oh wait, I think that last one might be something else.

But anyway, the point is, I'm going to adhere to that standard.  But I'm adding a Tarzan-esque theme to this post, because we're not going to just climb a tree.  We're going to swing about between trees.  Different types of tree representations are appropriate for different scenarios.  Which is why, as I pointed out in my recent SearchSQLServer article on recursive Common Table Expressions, we have so many different ways of representing them.  Adjacency Lists, Materialized Paths, Nested Sets, and Nested Intervals spring to mind.  And there are probably others.

My article shows how to use the CTEs, in conjunction with a dynamically generated materialized path, to manipulate an Adjacency List, getting many of the benefits associated with using the Nested Sets Model.  And that's great.  But the Nested Sets Model itself might be useful in your endeavors.  So in this post, I will show how to extend the CTE discussed in that article.

The end-result CTE in the article, which can be run in the AdventureWorks database, looks something like this (renamed for the sake of this post):

WITH EmployeeLevels AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), EmployeeId) AS thePath,
        1 AS Level
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath,
        x.Level + 1 AS Level
    FROM EmployeeLevels x
    JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId
)
thePath is the materialized path, a '.' delimited breadcrumb from the root to each node.  We also end up with a level, which represents how many nodes away from the root in the hierarchy each employee sits (very important for those upwardly-mobile junior execs, no doubt!)

I'm going to assume that readers of this post are familiar with the Nested Sets Model, as popularized by Joe Celko.  If not, read the link.

After staring at the Nested Sets for a while, I arrived at the following mathematical properties:

  • Value of Lft for the root node is 1
  • Value of Rgt for the root node is 2 * (Number of nodes)
  • Value of Lft for any node is ((Number of nodes visited) * 2) - (Level of current node)
  • Value of Rgt for any node is (Lft value) + ((Number of subnodes) * 2) + 1
I think the only factor here that requires further explanation is "(Number of nodes visited)".  By this, I mean the number of nodes that would have been visited (including the current node) if one were doing a preorder traversal of the tree. Luckily, this number is quite easy to determine.  The row number for each row, as determined by ordering by the materialized path, is this number.  I encourage readers to validate this with pencil and paper if it doesn't quite make sense reading on the screen.  Draw a simple tree and traverse, starting from the lefthand side of the root node.

But how to translate that into T-SQL?  Luckily, SQL Server 2005, in addition to CTEs, also includes the very useful ROW_NUMBER() function.  So to get the row number, we simply need to add another CTE into the chain (did you know that successive CTEs can use the results of previous CTEs?):

WITH EmployeeLevels AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), EmployeeId) AS thePath,
        1 AS Level
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath,
        x.Level + 1 AS Level
    FROM EmployeeLevels x
    JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId
),
EmployeeRows AS
(
    SELECT
         EmployeeLevels.*,
         ROW_NUMBER() OVER (ORDER BY thePath) AS Row
    FROM EmployeeLevels
)
We now have current level and number of nodes visited, which gives us the Lft value for each node.  But how to determine the number of subnodes, in order to get the Rgt value?  Luckily, the materialized path also gives us that capability...

For any given node, number of subnodes can be determined by counting all nodes whose path value is LIKE the current path value + '.%'.

The resultant query should be fairly obvious at this point:

WITH EmployeeLevels AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), EmployeeId) AS thePath,
        1 AS Level
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath,
        x.Level + 1 AS Level
    FROM EmployeeLevels x
    JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId
),
EmployeeRows AS
(
    SELECT
         EmployeeLevels.*,
         ROW_NUMBER() OVER (ORDER BY thePath) AS Row
    FROM EmployeeLevels
)
SELECT
     ER.EmployeeId,
     ER.thePath,
     ER.Level,
     ER.Row,
     (ER.Row * 2) - ER.Level AS Lft,
     ((ER.Row * 2) - ER.Level) +
        (
            SELECT COUNT(*) * 2
            FROM EmployeeRows ER2
            WHERE ER2.thePath LIKE ER.thePath + '.%'
        ) + 1 AS Rgt
FROM EmployeeRows ER
ORDER BY thePath
... And that's it!  We have now converted the Adjacency List tree into a Nested Sets tree.

In the next installment, I will show how to determine the Nested Intervals encoding from an Adjacency List tree, also using recursive CTEs.

Questions?


October 22: SQL Server Mini-Code Camp

By Adam Machanic in adam machanic :: sql server programming 08-03-2005 5:28 PM | Categories:
Rating: (not yet rated) Rate this |  Discuss | 1,304 Reads | 53 Reads in Last 30 Days |1 comment(s)

Just announced on Thom Robbins' blog, I'm doing a SQL Server Mini-Code Camp on October 22 at the Microsoft offices in Waltham, MA. This will be a full day of content.

The morning session will focus on data fundamentals and set-based thinking. This will include a study of First Normal Form and some brief discussion of higher normal forms, a very basic intro to referential integrity, and an introduction to 3-valued logic and NULLs. I'll then proceed into set-based vs. procedural thinking, and show how to write better queries without using loops or cursors.

The afternoon session will be a deep dive into SQL Server 2005 CLR integration. I'll cover primarily how to program--and best practices for using--CLR UDTs, UDFs, aggregates, and stored procedures. I'll also briefly discuss triggers, but honestly I don't see much value in that feature so I'm going to spend very little time discussing them.

I'm very excited to be doing this! Let me know (soon, please) if you have any specific areas you'd like covered and I'll see if I can work it into the content.

You can register here.