Sunday, April 6, 2014

Sri Lanka crowned T20 champs - 2014

We did it again! Thanks Sanga, Mahela and the team, proud to be a Sri Lankan.

Reasons for adding a surrogate key for dimension tables

As a best practice, we always add a new key, known as a surrogate key to dimension tables for identifying the records uniquely. However, generally, the unique business key becomes the primary key of relational tables related to business entities but once the surrogate key is added to dimension tables, we make it as the primary key without reusing the business key. Here is an example from AdventureWorksDW data warehouse.

image

Surrogate key is an integer that is assigned sequentially when records are added. There are many other names for this such as artificial key, synthetic key, meaningless key, integer key or non-natural key. Do we really need this? Does it seem inappropriate using the business keys for primary keys? It may seem sensible for reusing, but here are some valid reasons for not using business keys but using surrogate keys;

  1. Dimension is populated from multiple sources
    In many cases, dimensions are populated from multiple sources. This always introduces incompatibility between business keys’ data types originating from sources. Not only that, the uniqueness of the business key is not always guaranteed too.
  2. Business key is reassigned when an object becomes obsolete
    OLTP systems always focus on current operations giving less importance to the history. If an object becomes inactive or obsolete (Example: Organization decides to discontinue a product), there is a high probability to get the business key used assigned to a new object. If the business key is used as the key of the dimension, this becomes an issue because dimension already holds an object with the same business key.
  3. Business key is a complex string or GUID
    Business keys used in source systems could be formed combining numbers and different characters or with GUIDs. In such scenario, there is no harm of using the business key as the primary key of the dimension table as long as the uniqueness is guaranteed. However, in terms of performance, it makes more effective to have an integer surrogate key than a complex string.
  4. Integer improves the performance
    As mentioned with reason 3, it is always better to have an integer key as the key of the dimension even though the business key is a simple string. This improves the efficiency of the table and speeds up the data retrieval when joining with facts.
  5. Objects without business keys
    When populating dimension records, there could be a situation that object needs to be populated without a business key. Good example on this is, holding a newly introduced product in the Product dimension, before extracting it from the source, for supporting a transaction exist in the fact table. Another example is, holding an anonymous customer in the Customer dimension for supporting promotion events recorded in the fact table. It is become possible for holding these objects in dimension tables as “Unknown”, “Not applicable”, or “Anonymous” when surrogate key is used.
  6. Objects are versioned in dimension tables
    In order to maintain the history in dimension tables, same objects are versioned by adding multiple records (These dimensions are called as Slowly Changing Dimension Type 2). Example: Customer is recorded with his country Sri Lanka. Later customer becomes a citizen of Australia. In OLTP systems, it is just a change for country column. But in data warehousing, for maintaining the history for events recorded (facts), a new version (new record) of the customer is added with Australia for the country. This is not possible if the business key of the customer is used as the key of the dimension.

Of course, the maintenance cost gets increased and a mechanism has to be introduced for generating surrogate keys and linking them with foreign keys in fact tables. But it is not as difficult as many think because there are many built-in facilities available in DBMS and ETL systems.

Saturday, April 5, 2014

Free eBook: Introducing Microsoft SQL Server 2014

Want to know some major features added to SQL Server 2014? Here is an easy way. This book explains how SQL Server 2014 leverages in-memory technology for improving the performance of OLTP and data warehousing solutions, how easily on-premises solutions can be transferred to cloud with added support for hybrid environment, etc. If you are a DBA, DBE or an engineer who wants to know the new features and capabilities, this is the book to be read.

Download links:

Chapters in this book;

  • Chapter 1: SQL Server 2014 editions and engine enhancements
  • Chapter 2: In-Memory OLTP investments
  • Chapter 3: High-availability, hybrid-cloud, and backup enhancements
  • Chapter 4: Exploring self-service BI in Microsoft Excel 2013
  • Chapter 5: Introducing Power BI for Office 365
  • Chapter 6: Big data solutions

Wednesday, April 2, 2014

Easy way of adding your Microsoft Certification to LinkedIn Profile

If you are maintaining your profile with world largest professional network (Im sure that you are) and you have done a Microsoft Certification, there is an easy way of adding the certification to your LinkedIn profile rather than manually adding it. The new way has been introduced by teaming up with LinkedIn and it is opened to you via an email which you receive with details once an exam is done. All you have to do is, hit the button populated with all details required (since I have not done a certification recently, I have no idea how this will be appeared in the mail, I believe it is similar to below image).

In order to receive the mail, make sure you have enabled “Promotional email” in MCP profile setting and you are using the same email address for both MCP profile and LinkedIn profile.

For more details, read this post published by Carrie Francey (MS): http://borntolearn.mslearn.net/btl/b/weblog/archive/2014/03/25/announcing-a-new-way-to-add-microsoft-certifications-to-your-linkedin-profile.aspx

SQL Server exams update: SQL Server 2012 and 2014

Are you planning to do Microsoft SQL Server certifications in next few weeks? If so, here is an announcement on exams updates. All MCSE related SQL Server exams (464 to 467) will be updated on April 24, 2014 adding new features introduced with SQL Server 2014. Numbers of the exams will not be changed but the title will be adjusted with 2014. Official courseware for these exams have not been updated yet, most probably, updated courseware will be published during May 2014.

Here are the exams that will be updated;

Use following links to see the changes done in skills measured;

MCSA related SQL Server exams (461 to 463) will be remain unchanged.

Tuesday, April 1, 2014

Data Warehouse does not need a large number of non-clustered indexes?

It is a fact that non-clustered indexes improve the performance of data retrieval and we tend to add them mostly based on predicates used. As Data warehouses mainly design for data retrieval, many think that a data warehouse must contain non-clustered indexes as much as possible. In a way, it is true and required, however, we need to thoroughly analyze the requirements on indexing in data warehousing because generally non-clustered indexes do not help much with dimension and fact tables in data warehouses.

Non-clustered indexes generally give better performance on high selective queries. An example for a high selective query would be a query written for getting all customers for a given last name. Mostly attributes in a dimension table are used for pivoting in reports and typically hold few distinct values. Because of that, filters used in such reports are not much selective and will not get any benefits from indexes. However, parameterized reports looking for a particular value from an attribute would benefit as it is very selective. Therefore, when determining an index on an attribute, consider the reporting requirement similar to section example and then add.

It is common practice that adding non-clustered indexes on foreign keys. A typical data warehouse created based on Star-Schema has fact tables and fact tables have foreign keys linked to dimension tables. Do we need to create non-clustered indexes on these keys? Answer is “No”. The reason for that is, “star join query optimizations” feature in SQL Server (Enterprise only). This feature recognizes star join patterns and it uses bitmap filtered hash joins for joining tables in efficient way. Therefore indexes on foreign keys are not required unless a different edition is used. However, if the required dataset is smaller, indexes would be created on foreign keys as merge and nested loops joins give better performance than hash joins. Read more on that here: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx

Considering above facts, we rarely add non-clustered indexes on both fact and dimension tables. In addition to that, the volume of records involved with the query (IO), the high cost for maintaining indexes, rebuilding every time data is loaded are other factors for not using non-clustered indexes in DWs.

Sunday, March 30, 2014

Magic Quadrant for Business Intelligence and Analytics Platforms – 2014 February

Gartner has published its Magic Quadrant for Business Intelligence and Analytics platforms covering 17 capabilities grouping them into 3 categories. Business intelligence and Analytics as a software platform has been considered for this.

I have summarized capabilities used for this in a simpler format. You can read the entire document at: http://www.gartner.com/technology/reprints.do?id=1-1QYUTPJ&ct=140220&st=sb.

  • Information delivery
    • Reporting: ability to create interactive reports with or without parameters
    • Dashboards: ability to display operational or strategic information using graphical interfaces such as graphs, gauges and maps.
    • Ad hoc report/query: ability to get required information without IT support connected with predefined sources, metrics, hierarchies, etc. 
    • Microsoft Office integration: ability use Microsoft Office (Excel,..) as reporting and analytics client including advanced features like cell-locking and write-back. 
    • Mobile BI: ability to develop and deliver content to mobile devices and use native functionalities such as touchscreens, camera, etc.
  • Analysis
    • Interactive visualization: ability to interact with information with advanced visual representations that go beyond traditional visualization options such as charts, heat-maps, tree-maps, etc.
    • Search-based data discovery: ability to do search on both structured and unstructured and map easily into a classification structure of dimension and measures.
    • Geospatial and location intelligence: ability to combine geographic and location-related data from multiple sources such as aerial maps, GISs with enterprise data, and display them by overlaying on maps. Advanced features such as distance and route calculations, geofencing, 3D visualizations are considered too.
    • Embedded advanced analytics: ability to leverage statistical function library to consume analytics methods such as Predictive model markup language, R-based models, forecasting, “what-if” analysis with visualizations.
    • Online analytical processing (OLAP): ability to provide OLAP features such as drilling-down, slicing and dicing, write-back with “what-if”, etc. using data architectures like relational, multidimensional or hybrid and storage architectures like disk-based or in-memory. 
  • Integration
    • BI infrastructure and administration: ability to use all components such as security, metadata, etc. by all tools in the platform. The platform should support multitenancy.
    • Metadata management: ability to manage systems-of-records semantic and metadata centrally, including user-defined data mashup and metadata.
    • Business user data mashup and modeling: ability to create user-defined analytical models connecting with multiple multi-structured sources using code-free, drag-and-drop functionalities. Advanced features such as semantic auto-discovery, intelligence join etc. are considered too.
    • Development tools: ability to provide set of programmatic and visual tools for developing objects such as reports, dashboards, models etc.
    • Embeddable analytics: ability to create and modify analytic content, visualizations, and application embedding them into business processes.
    • Collaboration: ability to start discussion thread on information, analysis, analytic content, etc.
    • Support for big data sources: ability to connect with hybrid, columnar and array-based sources such as MapReduce.

Thursday, March 20, 2014

Door-Quiz – SQL Server Sri Lanka User Group Meet-up

We had another Door-Quiz at SS SLUG yesterday and as usual, the question asked was tricky. However, many had circled the correct answer and one lucky person grabbed the goodie-pack. Here is the question asked;

image

Whenever we see a numerator is getting divided by 0, all we expect  is an error because, it is a common fact that division by zero has no value. If we execute the SELECT statement separately, SQL Server will definitely throw an error saying it. But the tricky part of this question is, combining this fact with EXISTS predicate. EXISTS in this scenario only considers the existence of result-set produced by SELECT rather executing it as a general SELECT. Since the predicate becomes TRUE, SQL Server goes to PRINT ‘yes’ without producing any error.

Answer for the question is “yes"Smile.

Thursday, March 13, 2014

Who invented dimensional modeling?

Who introduced or invented Dimensional Modeling? Was it invented by Ralph Kimball? Since he completely aligns with Dimensional Modeling in his business intelligence strategies, many believe that it was from him. Unfortunately, it is one of the misperceptions persist in the industry related to Dimensional Modeling.

What is Dimensional Modeling?
Dimensional Modeling is a technique that has been used for decades for structuring databases in simple manner, supporting business users’ requirements such as understanding data, navigating through them easily, and running both simple and complex data analysis queries. Dimensional model designed with relational database management systems is called as Start-Schemas and dimensional model designed with multi-dimensional databases is called as OLAP cubes.

Who invented this?
History is limited on this, however, as per the record exist;

  • The terms “Facts” and “Dimension” were introduced with a joint project developed by General Mills and Dartmouth University in 1960’s.
  • During 1970’s, Nielsan Marketing Research team used these techniques for grocery and drug store data.
  • In 1980’s Nielsen Marketing Research and IRI used grocery and drug store scanner data to link with customer’s internal shipment data.

As per the history, no single person holds the credit of introducing the concept but it is something that have been evolved with contributions of many.

Read more about Dimensional Modeling: http://en.wikipedia.org/wiki/Dimensional_modeling

Why SQL Server does not drop the procedure and recreate the object when the code is written properly?

This is bit funny, it took few minutes to figure out the reason for getting “There is already an object named '' in the database.” error when the code for dropping and recreating is properly written. It is one of key things to remember when coding, hence sharing the experience.

Have a look at the code and the error thrown. Note that it is not the actual code, the code below is written to show the scenario. Please assume that there are many other statements before and after the code given.

  1.  
  2. ...
  3. ...
  4.  
  5. GO
  6.  
  7. -- dropping GetSalesFor2007 if exist
  8. -- and creating
  9. IF OBJECT_ID('dbo.GetSalesFor2007') IS NOT NULL
  10.     DROP PROC dbo.GetSalesFor2007
  11. GO
  12. CREATE PROCEDURE dbo.GetSalesFor2007
  13. AS
  14. BEGIN
  15.     
  16.     SELECT SalesOrderID, SalesOrderNumber
  17.     FROM Sales.SalesOrderHeader
  18.     WHERE OrderDate BETWEEN '01/01/2007'
  19.         AND '12/31/2007 23:59:59.000'
  20. END
  21.  
  22. -- dropping GetSalesFor2008 if exist
  23. -- and creating
  24. IF OBJECT_ID('dbo.GetSalesFor2008') IS NOT NULL
  25.     DROP PROC dbo.GetSalesFor2008
  26. GO
  27. CREATE PROCEDURE dbo.GetSalesFor2008
  28. AS
  29. BEGIN
  30.     
  31.     SELECT SalesOrderID, SalesOrderNumber
  32.     FROM Sales.SalesOrderHeader
  33.     WHERE OrderDate BETWEEN '01/01/2008'
  34.         AND '12/31/2008 23:59:59.000'
  35. END
  36.  
  37. ...
  38. ...

image

As you see, SQL Server tries to create the procedure (Line no: 27) without dropping it using the DROP PROC statement given (Line no: 25). Simply, the reason for this is, missing the batch separator between end of first procedure and DROP statement of second procedure.

What is a batch?
T-SQL Batch is a collections of SQL statements that need to be parsed, normalized and executed as a single unit. The end of the batch is indicated using GO statement (Read more on GO: http://dinesql.blogspot.com/2009/01/separating-batches-using-new-word.html). There are two important points we need to remember when working with T-SQL batches;

  1. The boundaries for scopes of variables are determined using batch separators, hence variables declared in one batch cannot be used beyond the GO statement. In other words, variables declared cannot be used in a different batches other than the batch it is declared.
  2. Most DDL statements require separate batches and may not be combined with other statements.

If the second point is clearly understood, then the reason for above error can be figured out. Statements like CREATE PROC, CREATE VIEW cannot be combined with other statements in the batch and these statements must start with a new batch. If you note the DROP statement for the second procedure (Line no: 25), you will see that there is no GO statement in between end of first procedure and beginning of DROP statement for the second procedure. Therefore the DROP statement becomes a part of the first batch which contains the first procedure, making the DROP statement as part of first procedure. Have a look on below output;

image

As you see, the DROP statement is in the body of first procedure. If we place a GO statement at the end of the first procedure, it will not become a part of first procedure.

Here are all statements that require a separate batch; CREATE DEFAULT, CREATE FUNCTION, CREATE PROCEDURE, CREATE RULE, CREATE SCHEMA, CREATE TRIGGER, and CREATE VIEW.

Sunday, March 9, 2014

Cumulative update package 11 for SQL Server 2008 R2 Service Pack 2

Cumulative update #11 is available for SQL Server 2008 R2 SP2. Refer the following link for downloading it and understanding the fixes done.

http://support.microsoft.com/kb/2926028

For more info on SQL Server versions and service packs, refer:http://dinesql.blogspot.com/2014/01/versions-and-service-packs-of-sql.html

Thursday, February 27, 2014

SQL Server: Calculating running totals using T-SQL

This is not a new topic. If you search, you will surely find many posts on this, mostly with traditional techniques but not using newest capabilities such as SQL Windowing. Since I wrote a post on Analysis Services for the same, thought to write the same on T-SQL too. Here is the way of calculating running totals using Window components and functions which provides an efficient way of calculating and simple code structure.

The following query shows the way of calculating. The first code creates a window based on SalesOrderId (which is unique) and get the running totals over SubTotal for a given year. The second code creates a window on OrderDate (which is not unique). This will show the totals for the date instead of Running-Totals for the date unless the range is specified using boundaries. That is the reason for adding upper and lower boundaries using ROW, UNBOUNED PRECEDING and CURRENT ROW inside the window for restricting rows to be participated for the calculation.

  1. -- Window based on OrderId which is unique
  2. SELECT OrderDate, SalesOrderID, SalesOrderNumber, CustomerID
  3.     , SubTotal Total
  4.     , SUM(SubTotal) OVER(ORDER BY SalesOrderID) RunningTotal
  5. FROM Sales.SalesOrderHeader
  6. ORDER BY OrderDate, SalesOrderID
  7.  
  8. -- Window based on OrderDate which is not unique
  9. SELECT OrderDate, SalesOrderID, SalesOrderNumber, CustomerID
  10.     , SubTotal Total
  11.     , SUM(SubTotal) OVER(ORDER BY OrderDate
  12.         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) RunningTotal
  13. FROM Sales.SalesOrderHeader
  14. ORDER BY OrderDate, SalesOrderID

Both queries produce same result;

image

Here is a comparison on the same using traditional techniques. Though it shows that the query uses window is faster than other queries, always check and pick the best.

  1. -- Using Window components
  2. SELECT OrderDate, SalesOrderID, SalesOrderNumber, CustomerID
  3.     , SubTotal Total
  4.     , SUM(SubTotal) OVER(ORDER BY SalesOrderID) RunningTotal
  5. FROM Sales.SalesOrderHeader
  6. WHERE YEAR(OrderDate) = 2006
  7. ORDER BY OrderDate, SalesOrderID
  8.  
  9. -- Using self-join
  10. SELECT h1.OrderDate, h1.SalesOrderID, h1.SalesOrderNumber, h1.CustomerID
  11.     , h1.SubTotal Total
  12.     , SUM(h2.SubTotal) RunningTotal
  13. FROM Sales.SalesOrderHeader h1
  14.     INNER JOIN Sales.SalesOrderHeader h2
  15.         ON h1.SalesOrderID >= h2.SalesOrderID
  16.             AND YEAR(h2.OrderDate) = 2006
  17. WHERE YEAR(h1.OrderDate) = 2006
  18. GROUP BY h1.OrderDate, h1.SalesOrderID, h1.SalesOrderNumber
  19.     , h1.CustomerID, h1.SubTotal
  20. ORDER BY h1.OrderDate, h1.SalesOrderID
  21.  
  22. -- Using sub query
  23. SELECT h1.OrderDate, h1.SalesOrderID, h1.SalesOrderNumber, h1.CustomerID
  24.     , h1.SubTotal Total
  25.     , (SELECT SUM(h2.SubTotal) RunningTotal
  26.         FROM Sales.SalesOrderHeader h2
  27.         WHERE h1.SalesOrderID >= h2.SalesOrderID
  28.             AND YEAR(h2.OrderDate) = 2006)
  29. FROM Sales.SalesOrderHeader h1
  30. WHERE YEAR(h1.OrderDate) = 2006
  31. ORDER BY h1.OrderDate, h1.SalesOrderID

image

Wednesday, February 26, 2014

Analysis Services (SSAS): Calculating running totals for a given period using MDX

Calculating running totals against a numeric is not an easy task with both relational databases using T-SQL and multidimensional databases using MDX. In a way, calculation on relational databases is little bit easier using new window functions than using multi-dimensional queries. But it does not mean that the calculation using MDX is much more complex and difficult to write. However it depends on the familiarity. Comparatively we do not write MDX statements as much as we write T-SQLs. That is the reason for tagging “complex” on MDX, that is the reason for spending much time even on a simple query than the time spend with T-SQL for similar implementations.

I had to write a similar query today for calculating running totals mixing with YTD and previous year values. As usual I searched for best practices, there were many posts on this based on either calendar year or financial year but could not find a specific one for calculating running totals for a given period. Wrote it, and thought to share it. Here is the MDX, it is written on AdventureWorksDW. You may find it interesting and helpful.

  1. WITH MEMBER [Measures].[Running Total]
  2. AS
  3. (
  4.     AGGREGATE([Date].[Calendar].[Month].&[2006]&[10]
  5.             :[Date].[Calendar].CurrentMember
  6.             , [Measures].[Sales Amount])
  7. )
  8. SELECT
  9.     {[Measures].[Sales Amount]
  10.     , [Measures].[Running Total]} ON 0
  11.     , [Date].[Calendar].[Month].Members ON 1
  12. FROM (SELECT {[Date].[Calendar].[Month].&[2006]&[10]
  13.             :[Date].[Calendar].[Month].&[2007]&[11]} ON 0
  14.         FROM [Adventure Works])

image

Sunday, February 23, 2014

SQL Server Management Studio: Exploring General Options for T-SQL – Part I

SQL Server Management Studio is an integrated management, development and querying application that we use for working with SQL Server instances and databases. It is based on on Visual Studio shell and it is the key application for both developers and administrators for working with SQL Server.

Just like other Microsoft applications, Management Studio allows you to adjust/customize the environment as you wish. It offers many settings, mostly unknown to us, that can be used for making the application more user-friendly and more flexible. Thought to explore some of important ones via number of notes. Here is the first one, this note explores settings listed under Text Editing for T-SQL .

The general options related to T-SQL can be found under Tools Menu –> Options menu –> Text Editor –> Transact-SQL –> General;

image

There are 10 options that can be set under T-SQL –> General. However some of them are not related to T-SQL. Let’s explore one by one.

  1. Auto list members
    Selecting this option lists columns, functions, tables, etc. out on a pop-up menu based on the statement you write, making intelliSense enabled. By selecting the relevant item, the code can be completed without typing the whole word avoiding mistakes caused by misspelling and increasing the typing time.

    image
  2. Hide advanced members
    I believe that this option is not applicable for SQL Server but yet to be confirmed. As per BOL, this hides members marked as “advanced” limiting items loaded to the pop-up. This option is disabled when “Auto list members” is cleared or no members are marked as “advanced”.
  3. Parameter information
    If this option is selected, the complete syntax of current declaration or procedure is displayed with its parameters. The parameter which is bold shows the one needs to be set as the next parameter.

    image
  4. Enable virtual space
    Selecting this makes the position of cursor consistence with all the lines in the code regardless of the length of the line. By default, this is disable. Therefore the position of the cursor is not consistence when moving up and down. Have a look on below image;

    image

    Now, if the cursor is moved down, it will be positioned to column 31.

    image

    If the option is selected, position of the cursor will be remained in same column.

    image

    The reason for this is, when the option is selected, tabs or spaces are automatically added to complete the line.
  5. Word wrap
    This makes the entire line you have typed visible in viewable editor area even though it has extended beyond the area horizontally.

    Editor with option cleared.
    image

    Editor with option selected.
    image
  6. Show visual glyphs for word wrap
    This option comes as a sub option of “Word wrap” hence it is only enabled when “Word wrap” is selected. Selecting this makes a glyph (a graphical symbol that shows a returned-arrow) appeared on wrapped lines indicating that the lines are wrapped.

    image
  7. Apply Cut or Copy commands to blank line when there is no selection
    This setting allows us to cut or copy blank lines and paste without selecting anything. Look at the below image. It has a blank line and the cursor is positioned in it. Now press Ctrl+C for copying;

    image

    If the option is selected, Ctrl+C will copy the blank line and Ctrl+V will insert a new blank line.

    image
  8. Line numbers
    Selecting this options displays line numbers for each line;

    image
  9. Enable single-click URL navigation
    Selecting this option makes URL in the editor clickable for opening the web page. If the option is cleared, there will be no change on the cursor when passing over the URL but if it is selected, URL will be shown as a hyperlink and can be click on it while holding the Ctrl key. The first image shows the editor with the option cleared and the second shows with the option selected.

    image

    image
  10. Navigation bar
    This option is for getting all objects and procedures displayed in drop-downs at the top of the editor for easy navigation. However this option is not enabled for SQL Server.

Friday, February 21, 2014

SQL Server 2012: Getting Previous and Next values using Offset functions

How can we access other rows in a set while accessing one particular row? In other words, can we access values in other rows other than current row while the current row is being processed? There were no built-in functions for supporting this functionality with previous versions of SQL Server but there were many ways of getting the required result generated. One common way was linking the same table to itself either using as a derived table or CTE. Microsoft SQL Server offers four offset functions for supporting this requirement. Here is note on it;

SQL Server 2012 Offset Functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE
Offset functions allow to access values located in other rows while accessing the current row;

Function Description
LAG LAG works on window partition and window order clauses. It allows to access a value of a row at a certain offset from the current row which appears before the current row based on the order specified. It accepts three parameters; value (or column) which needs to be returned, offset as optional (1 is default), and default value to be returned in case of no row at the specified offset (null is default).
LEAD LEAD works on same manner, just like LAG. Only different is, while LAG is looking for records before the current row, LEAD is looking for records after the current row.
FIRST_VALUE This allows to access values from the first row in the window frame. The first value of the first row is accessed with a window frame extent ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
LAST_VALUE This allows to access values from the last row in the windows frame. The extent ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING should be used with LAST_VALUE.

Here is an example for offset functions;

  1. USE AdventureWorks2012
  2. GO
  3.  
  4. -- Create a view for forming a set that contains sales amount for years and months.
  5. CREATE VIEW dbo.SalesView
  6.     AS
  7.     SELECT YEAR(h.OrderDate) OrderYear, MONTH(OrderDate) OrderMonth
  8.         , SUM(h.SubTotal) Total
  9.     FROM Sales.SalesOrderHeader h
  10.     GROUP BY YEAR(h.OrderDate), MONTH(OrderDate)
  11. GO
  12.  
  13.  
  14. SELECT
  15.     OrderYear, OrderMonth, Total
  16.     -- Getting previous month value for the year
  17.     , LAG(Total, 1, 0) OVER (PARTITION BY OrderYear ORDER BY OrderMonth) PreviousMonthTotal
  18.     -- Getting next month value for the year
  19.     , LEAD(Total, 1, 0) OVER (PARTITION BY OrderYear ORDER BY OrderMonth) NextMonthTotal
  20.     -- Getting first month value for the year
  21.     , FIRST_VALUE(Total) OVER (PARTITION BY OrderYear ORDER BY OrderMonth ROWS BETWEEN UNBOUNDED PRECEDING
  22. AND CURRENT ROW) FirstMonthValue
  23.     -- Getting last month value for the yea
  24.     , LAST_VALUE(Total) OVER (PARTITION BY OrderYear ORDER BY OrderMonth ROWS BETWEEN CURRENT ROW
  25. AND UNBOUNDED FOLLOWING) LastMonthValue
  26. FROM dbo.SalesView
  27. ORDER BY 1, 2

image

Monday, February 17, 2014

UNION ALL is possible. How about INTERSECT ALL and EXCEPT ALL?

Not expecting duplicates but a significant improvement in performance, we tend to use UNION ALL when no duplicates are guaranteed in combining two sets. The reason is, UNION ALL does not performing an additional task for filtering (or removing) duplicates, hence gives a better performance than UNION. However, the purpose of UNION ALL is not for improving the performance but for producing the result with duplicates if exist. While this was being taught during my classes, a thought came into my mind: Why other set operators such as INTERSECT and EXCEPT do not offer the same built-in functionality like INTERSECT ALL and EXCEPT all?

A usual quick search did not give me the required answer but proved that it is not available as a built-in functionality. There were few possible solutions but the one given by T-SQL expert Itzik Ben-Gan was quite interesting. Here is a small note on INTERSECT and EXCEPT and implementation suggested by Itzik Ben-Gan for INTERSECT ALL and EXCEPT ALL.

INTERSECT and INTERSECT ALL
INTERSECT allows us to retrieve only rows that are available in both two sets. This removes duplicates if found. It basically behaves as DISTINCT INTERSECT. If duplicates to be included to the result, a workaround is required and it can be easily done with ROW_NUMBER ranking function. All required columns need to be added under PARTITION BY clause and SELECT < constant> needs to be added for ODER BY clause for instructing SQL Server to not consider the order of the window. This generates aN unique number of all records, making them available in the final resultset of INTERSECT. Here is a sample code for this;

  1. USE tempdb
  2. GO
  3.  
  4. -- creating table 1
  5. CREATE TABLE dbo.InternetSales
  6. (
  7.     SaleDate datetime NOT NULL
  8.     , Product varchar(100) NOT NULL
  9.     , Amount money NOT NULL
  10. )
  11. GO
  12.  
  13. -- creating table 2
  14. CREATE TABLE dbo.ResellerSales
  15. (
  16.     SaleDate datetime NOT NULL
  17.     , Product varchar(100) NOT NULL
  18.     , Amount money NOT NULL
  19. )
  20. GO
  21.  
  22. -- inserting values for table 1
  23. INSERT INTO dbo.InternetSales
  24.     (SaleDate, Product, Amount)
  25. VALUES
  26.     ('01/01/2014', 'A', 100)
  27.     , ('02/01/2014', 'B', 100)
  28.     , ('03/01/2014', 'B', 100)
  29.     , ('04/01/2014', 'B', 100)
  30.     , ('05/01/2014', 'C', 100)
  31.     , ('05/01/2014', 'C', 100)
  32.  
  33. -- inserting values for table 2
  34. INSERT INTO dbo.ResellerSales
  35.     (SaleDate, Product, Amount)
  36. VALUES
  37.     ('01/01/2014', 'A', 100)
  38.     , ('02/01/2014', 'B', 100)
  39.     , ('03/01/2014', 'B', 100)
  40.     , ('04/01/2014', 'D', 100)
  41.     , ('05/01/2014', 'D', 100)
  42.  
  43. -- checking INTERSECT, will return only A, B
  44. -- If duplicates were included, should return A, B, B
  45. SELECT Product FROM dbo.InternetSales
  46. INTERSECT
  47. SELECT Product FROM dbo.ResellerSales
  48.  
  49. -- Adding ROW_NUMBER for creating unique numbers
  50. -- And using CTE for removing the number
  51. -- This returns A, B, B
  52. WITH cte
  53. AS
  54. (
  55.     SELECT
  56.         ROW_NUMBER() OVER (PARTITION BY Product ORDER BY (SELECT 0)) AS number
  57.         , Product
  58.     FROM dbo.InternetSales
  59.     INTERSECT
  60.     SELECT
  61.         ROW_NUMBER() OVER (PARTITION BY Product ORDER BY (SELECT 0)) AS number
  62.         , Product
  63.     FROM dbo.ResellerSales
  64. )
  65. SELECT Product FROM cte;

EXCEPT and EXCEPT ALL
EXCEPT returns all distinct rows found in “left” (or first) set that are not found in “right” (or second) set. This discards duplicates too. If duplicates are required, same workaround can be applied. Here is the code for it;

  1. -- checking EXCEPT, will return only C
  2. -- If duplicates were included, should return B, C, C
  3. SELECT Product FROM dbo.InternetSales
  4. EXCEPT
  5. SELECT Product FROM dbo.ResellerSales
  6.  
  7. -- Adding ROW_NUMBER for creating unique numbers
  8. -- And using CTE for removing the number
  9. -- This returns B, C, C (Duplicates)
  10. WITH cte
  11. AS
  12. (
  13.     SELECT
  14.         ROW_NUMBER() OVER (PARTITION BY Product ORDER BY (SELECT 0)) AS number
  15.         , Product
  16.     FROM dbo.InternetSales
  17.     EXCEPT
  18.     SELECT
  19.         ROW_NUMBER() OVER (PARTITION BY Product ORDER BY (SELECT 0)) AS number
  20.         , Product
  21.     FROM dbo.ResellerSales
  22. )
  23. SELECT Product FROM cte;
  24.  
  25. -- Cleaning
  26. DROP TABLE dbo.InternetSales
  27. DROP TABLE dbo.ResellerSales

Sunday, February 16, 2014

Best way to find records exist in one set that do not appear in other set: NOT IN | LEFT OUTER | EXCEPT

Finding records exist only in one set that do not exist in another is a common requirement in database developments. Finding customers who have not placed orders, finding products that have not been purchased by any customers are general examples for it. There are many difference ways of obtaining the required record set and the mentioned methods/operators in the title are the commonly used ones for achieving this. However which gives the better performance is questionable, hence let’s analyze them and see.

Let’s try to get all products that have not been purchased by customers from AdventureWorks database. Have a look on all three SELECTs. They return product ids from Production.Product table which are not exist in Sales.SalesOrderDetail table.

  1. USE AdventureWorks2012
  2. GO
  3.  
  4. -- Finding unsold products using NOT IN
  5. SELECT p.ProductID
  6. FROM Production.Product p
  7. WHERE p.ProductID NOT IN (SELECT ProductID from Sales.SalesOrderDetail)
  8. ORDER BY 1
  9.  
  10. -- Finding unsold products using LEFT OUTER JOIN
  11. SELECT p.ProductID
  12. FROM Production.Product p
  13.     LEFT OUTER JOIN Sales.SalesOrderDetail d
  14.         ON p.ProductID = d.ProductID
  15. WHERE d.ProductID IS NULL
  16. ORDER BY 1
  17.  
  18. -- Finding unsold products using EXCEPT
  19. SELECT p.ProductID
  20. FROM Production.Product p
  21. EXCEPT
  22. SELECT ProductID
  23. FROM Sales.SalesOrderDetail
  24. ORDER BY 1

All three produce the same result;

image

Let’s analyze the execution plans of all three. Note the “Query Cost – Relative to the batch” too.

image

As you see, plans for NOT IN and EXCEPT are same and the performance of them are good but LEFT OUTER is different. LEFT OUTER has used “Merge Join” whereas other two have used “Nested Loops” which is low cost join. In this scenario, LEFT OUTER does not offer much benefits but NOT IN and EXCEPT give better performance.

However, this behavior is not guaranteed with all scenario hence we cannot conclude that NOT IN and EXCEPT provide better performance rather than LEFT OUTER. This is totally depend on factors such as index availability and number of records. Therefore, best way is, trying with all ways and pick the best for the situation. In addition to the mentioned methods, there are few more popular ways such as NOT EXISTS and OUTER APPLY. All these can be used for retrieving the required result, however, as mentioned above, best way can be determined only by trying the same with all the ways.