in

SQLServerCentral.com

The largest free SQL Server community.

Ken Kaufman

June 2008 - Posts

  • SQL Server Time-Dates

     

    The other day in our architecture group meeting the question was raised on how SQL Server handled future dates in relation to lights savings time.  For example if a play was scheduled in January for a date in July at 1:00 PM would reflect this time or 2:00 PM because of the jump in a hour for day light savings time.  This is a common issue with many applications, but not SQL Server.  It all relates to the time source the application is getting the time from.  Applications use one of two sources the CPU which works on UTC or it uses an OS api which reflects the time zone setup in the OS.  Of course you’re simply transferring the burden from one program to the next (OS), but it has its advantage to work off the OS time rather then the CPU, but it also has its disadvantages as well. 

     

    Before going into how SQL uses its source, you need to understand how SQL maintains its date.  The “datetime” datatype is eight bytes, or 2 ints, divided between date and time.    The first int represents the date where the base date is 1900 and increments are set at one day.  The second int or four bytes is the time with a base of 12:00 AM, increments are 1/300 of a second.  If you want to see how this works run the following queries:

     

    --Convert date to a binary

    select convert(varbinary(100), getdate())

     

    --Take the first 8 hex values and convert ensure the "0x" is prefix

    select convert(int, 0x00009AC4)

     

    --Do a date diff with 1/1/1900 The dates should be the same

    select datediff(day, '1/1/1900', getdate())

     

    Now that we know how data is stored there is the critical nature of what source does sql use.  As I said earlier applications can use the converted OS time or do the coversion itself against the cpu’s time that’s using UTC.  SQL uses both known as “High Resolution Timer” and “Low Resolution Timer”.  The High resolution timer works off the cpu timer or UTC, which SQL queries for internal processes such as workers, locks and data cleanup.  The Low resolution timer calls the windows api GetTickCount which returns the OS time.  This is used by external functions such as Getdate().   Since these external needs are based on the OS time, there unaffected by changes in day light savings time, all time is still based on delta of midnight as the OS represents the time zone. 

     

    This type of tradeoff between High and Low resolution seems to fit most needs, unless you get into time comparisons, such as Dateadd or Datediff functions.  These are unaware  of day light savings time or UTC, there given a value,  and compute the deltas and assume no day light time.   As with SQL your application will probably require different time values based on regional time and UTC.  SQL provides this functionality in the form of the function getutcdate. 

  • Maintenance Plan Quirks

     

    The move to maintenance plans based on SSIS was a very impressive step.  As an administrator it provides the ease and flexibility of creating and maintaining plans through an object based interface.  There are several differences in the Maintenance plans implementation of SSIS compared to the full blown product, primarily in its extensibility, as it doesn't expose all the features.  However where it beats SSIS is on the management of DBA specific tasks and ease of use.  Not to say SSIS is difficult to use, but changing and troubleshooting can be cumbersome.  If you need to change a SSIS package it requires opening up Visual Studio, making the changes then saving to the appropriate location where a job (Assuming your using jobs to schedule) expects it.  Under maintenance plans all setup and changes occur within Management Studio.  If you're trying to determine what a co-worker setup it's much easier then tracking down the package, moving to a dev environment then reviewing.

     

     

    There are some quirks with SSIS which I'm not sure will be addressed with 2008, predominately in the way the Maint jobs are tied to Jobs.  If you don't understand how this works, these quirks could cause headaches.  First unlike SSIS maintenance plans are tied to sql jobs, or to be more specific sub plans.  Every Maint plan has at least one sub-plan, which is represented as a job after it's created.  If you try to delete the job, you'll get a foreign key violation, which ties the sysmaintplan_subplans and the sysjobs table.  The main oddity comes in the form of saving plans.  When you create a sub-plan it creates a new job, however when you edit a plan then save it overrides the job, by wiping out all the steps then rewriting only the single step that represents the sub plan.  In essence this doesn't allow you to append to that job, if you need other processes to occur after the maint plan runs, because every edit wipes out your appended steps.  An example of this is after backing up locally, you might want a second step to copy your data to a backup server. 

     

    There are a few ways to get around this, the first is the obvious, which is put your serialized processes into the package.  The issue here is that maintenace plans is a scaled down version of SSIS, and doesn't provide the same functionality, such scripting or running executables without hacky workarounds.  The second is to put your serialized processes into a seperate job that performs these other processes, and as part of the maint plan run the execute job task that points to this new job as the last step. The issue here is it takes away your ability to separate processes.  For example in the backup/copy example, this prevents you from backing up without copying and vice-versa. 

     

    My preferred solution is to call the maint plan with a hand rolled job.  The key is to remember the plan is simply an SSIS package stored in msdb, in a psuedo directory "Maintenance Plans".  To call the plan simply select the "SQL Server Intergration Service Package" type, navigate to the Maintenance Plans directory, and select your package.  Since sub plans are actually individual SSIS packages under the umbrella of the Maint plan you need to specify this package.  This is accomplished by navigating to the "Set Values" tab.  Under "Property Path" type "\package\[name of sub plan].disabled", then under the corresponding "value" type "false".  For Example if your sub plan is called sub_plan2 the path would be \package\sub_plan2.disabled.  Once in place any saving and/or editing will not effect this job since it references the plan and not the job.  The downside here is you have two jobs, the one created through Maint plans and the one hand rolled.

     

    Before ending there's one more point to be made about scheduling through a separate job, which is the ability to run multiple sub plans through using the set values.  If you have multiple sub plans you want to run serially, under the default behavior you need to call one sub plan job then the other.  However using the Set values allows you to run serially within one job step by simply adding the two sub plans into the set values tab. 

       

     

  • Performance Effects on Column Type Changes Another Point

     

    In a previous blog http://blogs.sqlservercentral.com/ken_kaufman/archive/2008/05/14/performance-effects-on-column-type-changes.aspx  I talked about how SQL handled switching data types and lengths of the same data type, in particular increasing a varchar length, as well as the cost of moving a varchar to an int.  One of the points left out is moving from a varchar to a char and vice-versa.   Even though these look similar from a high level to be the same data type, there stored completely different at the record level.  Storage of data in a row involves two primary locations for data.  The first in where your fixed data types resides, this is all data that is fixed in size resides, including the char data type.  The second is your variable length section.  This second section contains some critical meta-data about your variable data, there's a global 2 byte section that tracks all the variable columns in your record, and there is an additional 2 bytes for each fields that has the offset of each variable field.  What this all means for converting char into varchar and the reverse is that it requires moving data within the record, even if your going from a char(5) to a varchar(10).  Depending on the type of conversion and underlying data could also mean significant page splits

     

  • Leveraging Clustered Indexes in a Covered Index

     

    In my last blog I talked about covering indexes.  In doing so I slightly touched up non-clustered indexes containing a pointer to the clustered index of the table.    There are often times when you want to cover a portion of the a composite clustered index.  When doing so SQL Server is smart enough to realize this, and makes adjustment. 

     

    When creating a non clustered index on a table that is built on a clustered index (Not a heap) the leaf level of the index has a pointer to the back to the clustered index.  For example if you have an index  clustered index of col1, col2 and you put a non-clustered index on col3, this index would contain all three fields within the non clustered index, col3 for sorting and col1/2 to point back to the clustered index, again this is only at the leaf level and not the root or intermediate levels.

     

    Where you can leverage this is when your query looks something like this:

     

    Select col1, col2, col3 from table1

    Where col3 > 5.

     

    In this case assuming selectivity is high, SQL would use a partial scan against col3, and pick up the rest of the data without having to go back to clustered index, hence covering it.  Where SQL gets smart is when you want to use a field inside your composite index.  If you need to satisfy the below query

     

    Select col1, col2, col3 from table1

    Where col3 > 5 and col2 < 3

     

    The following index might be optimal assuming col3 has a higher selectivity then col2

     

    Create index nonclusteredtest on table1(col3, col2)

     

    In this case at the root and intermediate levels col3 and col2 are included in that sort order.  When you get down to the leaf level SQL knows not to add a double entry for col2 one for the index and one for the clustered index.  Rather it's only in there one time and represents both indexes.  This is a significant space savings particularly if you have a wide field. 

     

  • Covering Indexes

    Covering an index allow you to use a subset of data in an underlying table to answer queries without having to go to the underlying table, whether the underlying table is based on a clustered index or a heap.  Using the “include” feature allows you to add to the non-clustered index without having to worry about sorting as it comes in.   It also provides the benefit of not storing unneeded columns in the intermediate level, as well as going above the 900 byte limit placed on indexes.  Much of following refers to working with b-trees and assumes the reader has a general understanding of how they work.  The best way to visualize this is through example by working inside the Adventureworks DB on the SalesOrderDetail which has the below table definition.  The clustered index (b-tree) is sorted on Salesorderid, salesorderdetailid.  What this means that each row at the leaf level (bottom level) of the index contains all the data within each record.  However the data on this leaf level is sorted in ascending order by SalesOrderID, and SalesOrderDetailID.    The intermediate levels of the clustered index contain only the fields used in the clustered index, and serve the sole purpose to allow SQL to traverse (walk down) the tree based on the index sorted fields.  One note before moving on is the CarrierTrackingNumber field has been modified from a nvarchar 25 to a char 200, as well as additional records have been added to the table.  This was done to make a point.

     

    CREATE TABLE [Sales].[SalesOrderDetail](

          [SalesOrderID] [int] NOT NULL,

          [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NULL,

          [CarrierTrackingNumber] [char](200) NULL,

          [OrderQty] [smallint] NOT NULL,

          [ProductID] [int] NOT NULL,

          [SpecialOfferID] [int] NOT NULL,

          [UnitPrice] [money] NOT NULL,

    [UnitPriceDiscount] [money] NOT NULL CONSTRAINT [DF_SalesOrderDetail_UnitPriceDiscount]  DEFAULT ((0.0)),

    [LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),

    [rowguid] [uniqueidentifier] ROWGUIDCOL  NOT NULL CONSTRAINT [DF_SalesOrderDetail_rowguid]  DEFAULT (newid()),

          [ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_SalesOrderDetail_ModifiedDate]  DEFAULT (getdate()),

     CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] PRIMARY KEY CLUSTERED

    (

          [SalesOrderID] ASC,

          [SalesOrderDetailID] ASC

    )

     

     

    This structure is optimally sorted if you want to grab a subset of data based on salesorderid.  If you look at the query below, the physical path would be:

    • Traverse the tree vertically and find where salesorderid = 46321
    • Once at the root level go horizantal until until you hit salesorderid 47321
    • While moving horizontal grab the “CarrierTrackingNumber” value from each row it passes over.  This is accomplished by the fact that in a clustered index all the data is contained at the root level

     

     

     

    select CarrierTrackingNumber from Sales.SalesOrderDetail

    where salesorderid between 46321 and 47321

     

    This query retrieves 6228 rows using on 212 reads.  The rows to reads ratio is 29 to 1.  Meaning every IO to memory is grabbing 29 rows.  To prove out the effiency, the size of the row is approximately 258 bytes which translates into 31 rows per page. If we’re picking up 29 rows for every write almost all the IO is working out the leaf level and not traversing the tree on the intermediate levels.

     

    Tables are often used for multiple requests.  For example if you want to look up all orders between Aug 1st and Sept 1st in 2001 that had an orderqty greater then three, and was only concerned with gathering the CarrierTrackingNumber You would run the following query:

     

    select CarrierTrackingNumber

     from Sales.SalesOrderDetail

    where modifieddate between '2001-08-01' and '2001-09-01'

    and orderqty > 3

     

     

    The problem here is there is no index, and therefore would scan the entire table.  On this small table of only 120K rows a scan would do 4055 reads, to retrieve only 301 rows.  This is almost 15 reads per row.  Quite a difference from the partial scan above.  The quick fix here is to place a non-clustered index on the predicates (modifieddate, orderqty).  By creating this index your now moving from a scan to a seek.  This performs 937 reads, and get’s your ratio down to 3 reads per row.  This is a much better then the scan, but doesn’t come close to the first query.

     

    create index testnonclst

    on Sales.SalesOrderDetail(modifieddate, orderqty)

     

    Before going any further you need to understand the structure of a non-cluster index so we can see the benefits of covering and additionally leveraging the “include” clause.  The last query performed a seek, and a key lookup.  For each value found on the index meeting the predicates requirements it would traverse back to the clustered index to retrieve the actual data, hence the 3:1 ratio of reads to rows.  The 3 reads represent the the levels of the index in the b-tree of the clustered index, and an additional read of the the non-clustered leaf data.  Every value found in the non-clustered index requires traversing the  levels of the clustered index to get the data (CarrierTrackingNumber). 

     

    A non-clustered index has the same b-tree structure as a clustered index.  The primary difference is within the record.  Each record has an additional field or fields that point back to the clustered index.  For example inside the testnonclst index it contains both “modifieddate, orderqty” the data being indexed and the clustered index “SalesOrderID, SalesOrderDetailID” fields that it references.  Understanding this is a b-tree structure you can put all your data inside this non-clustered index that satisfiys your query.  Another way of saying this is any fields used by your query will reside within your index.  The outcome is the loss of the two IO’s back to the clustered index, for each record since all your data is covered inside your non-clustered index.

     

     

    drop index testnonclst on Sales.SalesOrderDetail

     

    create index testnonclst_Covered

    on Sales.SalesOrderDetail(modifieddate, orderqty, CarrierTrackingNumber)

     

    If you examine the query below and compare it to the above index creation, your covering all three fields in the query, CarrierTrackNumber which is being selected, and your two predicates, Modifieddate and orderqty.  Determining which field goes on the left should be based on selectivity of each field. The highest selectivity of the field used in your predicates should go to the left. 

     

    select CarrierTrackingNumber

     from Sales.SalesOrderDetail

    where modifieddate between '2001-08-01' and '2001-09-01'

    and orderqty > 3

     

     

    When we run the query above our IO’s drop to 31 IO’s to return 301 records, almost 10 records per IO.  This is because it does a partial scan on the leaf level of the index rather then needing to due a seek.  The numbers are close to the 29:1 ratio of the first query run against the clustered index.  However because your carrying the extra fields of the pointer back to the clustered index, you’ve added 8 bytes to each record.

     

    The final improvement to be made is remove the carrierTrackingNumber field from the index and using the “include” clause as part of the index.  By doing so you’re appending this field to the non-clustered index at the leaf level, and ignoring any sorting of this field.  This serves two purposes the first is space savings on the intermediate levels of the b-tree which will give a small performance gain, and the second is fact that you don’t have to sort your data as it comes in, and avoid page splits at every level.  In the following example splitting and sorting are not an issue, but if modifieddate and orderqty had a significant amount of duplicates it would have a good size impact during inserts and updates.

     

     

    drop index testnonclst_Covered on Sales.SalesOrderDetail

     

    create index testnonclst_Covered_include

    on Sales.SalesOrderDetail(modifieddate, orderqty)include( CarrierTrackingNumber)

     

     

    After running the same query with this structure there’s a 7 percent gain dropping to 29 IO’s.  The benefits gained through using include are dependent on sorting as mentioned earlier, and the size of the included field.  The larger this field the greater the benefits.

     

     

     

Copyright Red Gate Software
Powered by Community Server (Commercial Edition), by Telligent Systems