in

SQLServerCentral.com

The largest free SQL Server community.

Ken Kaufman

Covering Indexes

Covering an index allow you to use a subset of data in an underlying table to answer queries without having to go to the underlying table, whether the underlying table is based on a clustered index or a heap.  Using the “include” feature allows you to add to the non-clustered index without having to worry about sorting as it comes in.   It also provides the benefit of not storing unneeded columns in the intermediate level, as well as going above the 900 byte limit placed on indexes.  Much of following refers to working with b-trees and assumes the reader has a general understanding of how they work.  The best way to visualize this is through example by working inside the Adventureworks DB on the SalesOrderDetail which has the below table definition.  The clustered index (b-tree) is sorted on Salesorderid, salesorderdetailid.  What this means that each row at the leaf level (bottom level) of the index contains all the data within each record.  However the data on this leaf level is sorted in ascending order by SalesOrderID, and SalesOrderDetailID.    The intermediate levels of the clustered index contain only the fields used in the clustered index, and serve the sole purpose to allow SQL to traverse (walk down) the tree based on the index sorted fields.  One note before moving on is the CarrierTrackingNumber field has been modified from a nvarchar 25 to a char 200, as well as additional records have been added to the table.  This was done to make a point.

 

CREATE TABLE [Sales].[SalesOrderDetail](

      [SalesOrderID] [int] NOT NULL,

      [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NULL,

      [CarrierTrackingNumber] [char](200) NULL,

      [OrderQty] [smallint] NOT NULL,

      [ProductID] [int] NOT NULL,

      [SpecialOfferID] [int] NOT NULL,

      [UnitPrice] [money] NOT NULL,

[UnitPriceDiscount] [money] NOT NULL CONSTRAINT [DF_SalesOrderDetail_UnitPriceDiscount]  DEFAULT ((0.0)),

[LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),

[rowguid] [uniqueidentifier] ROWGUIDCOL  NOT NULL CONSTRAINT [DF_SalesOrderDetail_rowguid]  DEFAULT (newid()),

      [ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_SalesOrderDetail_ModifiedDate]  DEFAULT (getdate()),

 CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] PRIMARY KEY CLUSTERED

(

      [SalesOrderID] ASC,

      [SalesOrderDetailID] ASC

)

 

 

This structure is optimally sorted if you want to grab a subset of data based on salesorderid.  If you look at the query below, the physical path would be:

  • Traverse the tree vertically and find where salesorderid = 46321
  • Once at the root level go horizantal until until you hit salesorderid 47321
  • While moving horizontal grab the “CarrierTrackingNumber” value from each row it passes over.  This is accomplished by the fact that in a clustered index all the data is contained at the root level

 

 

 

select CarrierTrackingNumber from Sales.SalesOrderDetail

where salesorderid between 46321 and 47321

 

This query retrieves 6228 rows using on 212 reads.  The rows to reads ratio is 29 to 1.  Meaning every IO to memory is grabbing 29 rows.  To prove out the effiency, the size of the row is approximately 258 bytes which translates into 31 rows per page. If we’re picking up 29 rows for every write almost all the IO is working out the leaf level and not traversing the tree on the intermediate levels.

 

Tables are often used for multiple requests.  For example if you want to look up all orders between Aug 1st and Sept 1st in 2001 that had an orderqty greater then three, and was only concerned with gathering the CarrierTrackingNumber You would run the following query:

 

select CarrierTrackingNumber

 from Sales.SalesOrderDetail

where modifieddate between '2001-08-01' and '2001-09-01'

and orderqty > 3

 

 

The problem here is there is no index, and therefore would scan the entire table.  On this small table of only 120K rows a scan would do 4055 reads, to retrieve only 301 rows.  This is almost 15 reads per row.  Quite a difference from the partial scan above.  The quick fix here is to place a non-clustered index on the predicates (modifieddate, orderqty).  By creating this index your now moving from a scan to a seek.  This performs 937 reads, and get’s your ratio down to 3 reads per row.  This is a much better then the scan, but doesn’t come close to the first query.

 

create index testnonclst

on Sales.SalesOrderDetail(modifieddate, orderqty)

 

Before going any further you need to understand the structure of a non-cluster index so we can see the benefits of covering and additionally leveraging the “include” clause.  The last query performed a seek, and a key lookup.  For each value found on the index meeting the predicates requirements it would traverse back to the clustered index to retrieve the actual data, hence the 3:1 ratio of reads to rows.  The 3 reads represent the the levels of the index in the b-tree of the clustered index, and an additional read of the the non-clustered leaf data.  Every value found in the non-clustered index requires traversing the  levels of the clustered index to get the data (CarrierTrackingNumber). 

 

A non-clustered index has the same b-tree structure as a clustered index.  The primary difference is within the record.  Each record has an additional field or fields that point back to the clustered index.  For example inside the testnonclst index it contains both “modifieddate, orderqty” the data being indexed and the clustered index “SalesOrderID, SalesOrderDetailID” fields that it references.  Understanding this is a b-tree structure you can put all your data inside this non-clustered index that satisfiys your query.  Another way of saying this is any fields used by your query will reside within your index.  The outcome is the loss of the two IO’s back to the clustered index, for each record since all your data is covered inside your non-clustered index.

 

 

drop index testnonclst on Sales.SalesOrderDetail

 

create index testnonclst_Covered

on Sales.SalesOrderDetail(modifieddate, orderqty, CarrierTrackingNumber)

 

If you examine the query below and compare it to the above index creation, your covering all three fields in the query, CarrierTrackNumber which is being selected, and your two predicates, Modifieddate and orderqty.  Determining which field goes on the left should be based on selectivity of each field. The highest selectivity of the field used in your predicates should go to the left. 

 

select CarrierTrackingNumber

 from Sales.SalesOrderDetail

where modifieddate between '2001-08-01' and '2001-09-01'

and orderqty > 3

 

 

When we run the query above our IO’s drop to 31 IO’s to return 301 records, almost 10 records per IO.  This is because it does a partial scan on the leaf level of the index rather then needing to due a seek.  The numbers are close to the 29:1 ratio of the first query run against the clustered index.  However because your carrying the extra fields of the pointer back to the clustered index, you’ve added 8 bytes to each record.

 

The final improvement to be made is remove the carrierTrackingNumber field from the index and using the “include” clause as part of the index.  By doing so you’re appending this field to the non-clustered index at the leaf level, and ignoring any sorting of this field.  This serves two purposes the first is space savings on the intermediate levels of the b-tree which will give a small performance gain, and the second is fact that you don’t have to sort your data as it comes in, and avoid page splits at every level.  In the following example splitting and sorting are not an issue, but if modifieddate and orderqty had a significant amount of duplicates it would have a good size impact during inserts and updates.

 

 

drop index testnonclst_Covered on Sales.SalesOrderDetail

 

create index testnonclst_Covered_include

on Sales.SalesOrderDetail(modifieddate, orderqty)include( CarrierTrackingNumber)

 

 

After running the same query with this structure there’s a 7 percent gain dropping to 29 IO’s.  The benefits gained through using include are dependent on sorting as mentioned earlier, and the size of the included field.  The larger this field the greater the benefits.

 

 

 

Comments

No Comments
Copyright Red Gate Software
Powered by Community Server (Commercial Edition), by Telligent Systems