Tuesday, December 28, 2010

Finding Gaps in Numbers

During a recent project I had to find gaps in a table’s based on the table’s primary key which was also an identity.  I came up with two different solutions that provide the missing numbers and identical execution plans. 

The below code will create the temporary table that will be used in each example:

USE tempdb;
GO
CREATE TABLE #temp(
ID     INT PRIMARY KEY
);
GO
DECLARE @id INT
SET
@id = 1000
WHILE @id > 0
BEGIN
PRINT
@id
 
IF @id%2 <> 0
  
INSERT #temp
  
VALUES(@id)
  
SET @id = @id - 1
PRINT @id
 
END;
GO

This code creates a temporary table with a single column and populates it with values from 1 to 1000.  The “%” sign is the modulo operator and, in this case, will only insert a number if it is not divisible by 2, the result is only odd numbers, 1, 3, 5, etc., are inserted.

The first solution uses a common table expression along with the ROW_NUMBER() function that creates a contiguous list of “ID”s based on the ID column in the temporary table:

--Query using a common table expression
WITH cte
AS (
SELECT ROW_NUMBER() OVER(ORDER BY r.ID) AS rownumberFROM #temp r
)
SELECT CASE
     
WHEN r.ID < 2 AND r.ID IS NULL THEN
      
1
     
WHEN r.ID IS NULL THEN
      
rr.rownumber
     
ELSE r.ID
     
END AS 'MissingID'
FROM #temp r RIGHT OUTER JOIN cte rr
ON r.ID = rr.rownumber
WHERE r.ID IS NULL

The “rownumber” column provides a contiguous range of numbers ordered by the ID column of the temporary table.  The common table expression is joined to the temporary table using a RIGHT OUTER JOIN which will return a NULL value for all ID’s that are missing, in this case all even numbers.  The CASE statement is used to evaluate if the temporary tables ID is NULL and if so then the common table expressions rownumber is returned, if the value is not NULL then the ID is returned.  The WHERE clause filters the results by returning only rows where the temporary tables ID IS NULL, gaps in the contiguous range.

***The first assessment done in the case statement is used to evaluate if the ID column is missing “1”

The second solution uses a sub query that creates a derived table, again using the ROW_NUMBER() function to create the contiguous “ID”s.

--Query using a sub-query creating a derived table
SELECT CASE
      
WHEN r.ID < 2 AND r.ID IS NULL THEN 1
      
WHEN r.ID IS NULL THEN rr.rownumber
      
ELSE r.ID
     
END AS 'MissingID'
FROM #temp r RIGHT JOIN
(SELECT ROW_NUMBER() OVER(ORDER BY innerr.ID) AS rownumber
     
FROM #temp innerr) rr
ON r.id = rr.rownumber
WHERE r.ID IS NULL

The dynamics are the same as the query using the common table expression. 

The execution plans for each query are identical so the deciding factor would be preference:

image

2 comments:

  1. Hi David,

    I've been looking at similar problems recently. The tally table style approach you're taking is really cool.

    One note about this particular tactic, where you're generating the tally table on the fly by using the rownumber of the fact table itself-- it naturally implies that it will only returns results if the number of gaps you have in your data is less than the number of rows in the table itself.

    To demonstrate, insert this final row into your table, and create a large gap:


    INSERT #temp(id) VALUES(9999)

    You'll only find missing rows up to 500, because of the limited number of rows in #temp.

    If you build a fixed tally table with a larger number of rows, you won't hit this issue.

    In starting to work with tally tables I have been trying to determine the number of rows to populate by default. I'm leaning toward billions, because I sometimes need to work with quite granular data. Of course it does then help to make sure you're running a seek against the tally table whenever possible.

    I enjoyed reading this, thanks for posting!

    ReplyDelete
  2. Kendra,
    Thanks for the respose and an excellent point!! I am hoping to post a worthy solution to for such cases.

    Thanks again.

    ReplyDelete