Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL, use HASHBYTES instead

We had a process ot using BizTalk to take in a reference file, then send data in via a table_type parameter in a stored procedure.  The stored procedure uses a MERGE statement to insert and update; the update part of which does a CHECKSUM first to determine whether an update is even necessary.

Then it happened, we had a reference row that should have been updated by the latest file, but it wasn't done.  We checked the incoming and outgoing BizTalk Message in and out of the pipeline; everything looked normal.  Then we decided to take out the CHECKSUM part and the update succeeded.

In reading over the Microsoft documentation, I was absolutely shocked when I read this in the remarks section:

However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change

http://msdn.microsoft.com/en-us/library/ms189788%28v=SQL.100%29.aspx

Then I checked the page for BINARY_CHECKSUM, a similar remark exists.  http://msdn.microsoft.com/en-us/library/ms173784%28v=SQL.100%29.aspx

According to these 2 articles, we should be using HASHBYTES().  http://msdn.microsoft.com/en-us/library/ms174415%28v=SQL.100%29.aspx

Personally, I was really shocked to find a function in any database that "most of the time" does its job.  I really do not know any system that tolerates "occasional" mistakes, or maybe it's that I haven't worked in enough industries.  I equate this to getting a notice on your bank statement saying "we might not have all your transaction listed, and we don't know for sure whether we have them all".

I love quantum physics, but not in my computer systems.


[Update 2012-10-19]

I updated the title of this post to give more clarity to what I was trying to say.  You can read the comments below to follow the discussion.  I'm really fortunate that so many smart people decided to participate in this discussion to make this post better.

Here is an article that articulates the original idea of this post:  http://www.bidn.com/blogs/TomLannen/bidn-blog/2265/using-hashbytes-to-compare-columns 

Print | posted on Friday, May 20, 2011 8:50 PM

Feedback

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by Jon of All Trades at 9/22/2011 10:26 AM
Gravatar Any hash function has a chance of collision; that's just mathematics. Unfortunately CHECKSUM()'s hash function is not very good, so it's chance is relatively high.

One option to consider: use CHECKSUM() to scan for changes during near-realtime updates, when you need speed, and fall back on slow but sure field-by-field comparisons during daily updates.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by NiteFrog at 11/19/2011 12:17 AM
Gravatar I have to agree with what you are saying. I am working on a system that I need to sync data across networks and I don't want to move data I don't need to. Using the HashBytes is a life saver as it give me a true reading of what rows have actually changed in the database to push back out to the target systems. Great article, thanks.

I am so sick of the developers that do just good enough, it is time to raise the bar.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by ron at 5/17/2012 5:54 PM
Gravatar I think when you use check sum also use some other value with it to compare incremental load if need to do so, as i was having problem having same checksum for different column values.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by Steve at 6/2/2012 1:12 PM
Gravatar CHECKSUM and BINARY_CHECKSUM return 32-bit values, so what do you expect?

No matter how good a 32-bit hash function is, the probability of a collision among N items is about 50% when N is about the square root of the hash space. So in this case 64k items means that a collision is more likely that not.

HASHBYTES exists in order to provide larger hashes with correspondingly fewer collisions.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by tobi at 6/11/2012 9:24 AM
Gravatar Looks like you don't understand what a hash function is. This blog post is misleading for others.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by LifeLongTechie at 6/11/2012 9:54 AM
Gravatar Tobi,

Can you elaborate? Some of the comments left by others have more details and can help readers learn more about this topic.

I'd love to learn about the scenario where CHECKSUM is useful, like "Jon of All Trades" proposed to use it when the system needs speed. I think a good business case will help readers of this post decide whether they should be using these functions.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by ToddM at 10/19/2012 12:08 PM
Gravatar Agreed - this post is VERY misleading. CHECKSUMs are useful for one main purpose - indexing. Say you have a column containing a large chunk of text, and you frequently want to use it in equality checks. But indexing this column is not practical because of the amount of space it would consume. Add a CHECKSUM column and index that instead. We know that 2 chucks for text whose CHECKSUMs are equal have a very high probability of being equal, but that's not guaranteed. So always do your comparisons on both the CHECKSUM column (so SQL can take advantage of the index) AND the text column itself (to guarantee equality). Now your queries are fast and guaranteed accurate, and your index storage is low. Best of all worlds.

# re: Do NOT trust CHECKSUM or BINARY_CHECKSUM in SQL, use HASHBYTES instead

Left by LifeLongTechie at 10/19/2012 1:09 PM
Gravatar Todd,

You can index Hashbytes() column, too, but CHECKSUM() is faster at calculation.

Consider the scenario where you run a batch load process at night where resource and time are more abundant, then I'd rather incur the cost at that time and put a Hashbytes() column and index it while loading data. So that during business hours and we need to find whether there is a duplicate value, then I can just do a single column lookup against an index.

Here is a post I found that articulates what I was trying to say, but better. http://www.bidn.com/blogs/TomLannen/bidn-blog/2265/using-hashbytes-to-compare-columns

My goal of the original post was to warn people about the danger of CHECKSUM() generating duplicate values and some people rely on that value only to compare data rows. Your way of implementing would definitely get around that limitation. I wish that could be added to MSDN documentation, then I would be satisfied.

# re: Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL, use HASHBYTES instead

Left by Dony at 9/11/2013 10:21 AM
Gravatar Although HASHBYTES has a smaller chance on collisions than BINARY_CHECKSUM, neither of these functions guarantees a unique value for every (long) character or binary string. In my humble opinion you could add an indexed column that contains a persistent calculated checksum of the original column value, and then add this column to the join condition to speed up the MERGE, but you should never omit the original full size column from the join. Maybe you'll have to force the engine to use the index that covers the checksum column (MERGE statements often have sub-optimal query plans) to limit the number of rows to scan. Since the number of possible strings will always exceed the number of possible checksum or hash values it is simply impossible to have a unique hash value for every (long) string.

# re: Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL, use HASHBYTES instead

Left by Tanveer Haider at 3/3/2015 6:46 AM
Gravatar I think every algorithm can have collision. I believe it is about data

What are you opinion about following ? hashbyte is not able to distinguish 2 column values

select HASHBYTES('SHA','121'+'34'), HASHBYTES('SHA','12'+'134'),BINARY_CHECKSUM('121','34'),BINARY_CHECKSUM('12','134');

# re: Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL, use HASHBYTES instead

Left by kenzo at 4/28/2015 2:35 PM
Gravatar I add a delimiter between columns when calculating the hashbytes to prevent this problem. e.g

121 + 34 becomes '|' + 121 + '|' + 34 + '|'

therefore does not hash the same as

'12' + '134'

# re: Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL, use HASHBYTES instead

Left by Thomas Franz at 6/3/2015 10:43 AM
Gravatar HASHBYTES support only strings up to 8.000 chars. So if you are storing files (e.g. pictures / PDF's / mail attachments) in a varbinary(max) column and want to prevent duplicates then you have to use BINARY_CHECKSUM.

And of course you should at least compare the DATALENGTH() of files with the same checksum. If both are equal there is a very high chance that it is the same file but to be 100 % sure you have to compare the file itself as third step.

# re: Do NOT use only CHECKSUM or BINARY_CHECKSUM as sole comparison in SQL

Left by R Hedaoo at 9/26/2017 4:57 AM
Gravatar Check below Query you will get answer
select CHECKSUM('s1%2'),CHECKSUM('sah2'), binary_CHECKSUM('s1%2'),binary_CHECKSUM('va!r')

Your comment:





 

Copyright © Kevin Shyr

Design by Bartosz Brzezinski

Design by Phil Haack Based On A Design By Bartosz Brzezinski