Andrew

« MSVC rolls over | Main | UTF-8 Conversion Tricks »

Breaking SuperFastHash (and chipping lookup3)

Posted on March 29, 2007 3:31 AM

After the problems SuperFastHash had in Hash Algorithm Attacks, I decided to try and break it completely, i.e. generate collisions algorithmically instead of brute forcing them. The attempt was more successful than I had anticipated, although Paul is obviously aware of the weak mixing in the final bits as evidenced by his comment in the source code, "Force 'avalanching' of final 127 bits". My favorite collisions encountered would have to be "10/4 < 3", "10/5 = 2", and "10/6 > 1", which have the property of hashing to the same value while being mathematically correct!

As I was writing this, I came up with a way to attack Bob Jenkins' lookup3 as well. Unlike SuperFastHash, the lookup3 attack is due to the way the input bytes are being read in and does not indicate a deficiency in the mixing itself. If you are using lookup3 with hash tables, the core function will still be quite safe; it will only need to be modified if you are using it to generate unique 64bit identifiers and the input data could be altered for a malicious purpose.

With that said, let's look at the attacks:

SuperFastHash

To begin with, I separated the innerloop in to its 4 steps:

1:	hash  += get16bits (data);
2:	tmp    = (get16bits (data+2) << 11) ^ hash;
3:	hash   = (hash << 16) ^ tmp;
4:	hash  += hash >> 11;

and then ran a couple of known duplicate strings through and printed out the value at each step (input bytes are only consumed in steps 1 and 2):

"aaaaaaaa" vs "aaadadaf"

aa 1: 00006169 aa 2: 030b6969 3: 62626969 4: 626eb5b6 aa 1: 626f1717 aa 2: 61641f17 3: 76731f17 4: 7681ed7a

aa 1: 00006169 ad 2: 03236969 3: 624a6969 4: 6256b2b6 ad 1: 62571717 af 2: 61641f17 3: 76731f17 4: 7681ed7a

"ifkzihfe" vs "igdqhtfp"

if 1: 00006671 kz 2: 03d33e71 3: 65a23e71 4: 65aef2b8 ih 1: 65af5b21 fe 2: 66846b21 3: 3da56b21 4: 3dad1fce

ig 1: 00006771 dq 2: 038b4771 3: 64fa4771 4: 6506e6b9 ht 1: 65075b21 fp 2: 66846b21 3: 3da56b21 4: 3dad1fce

We can see that after round 2 step 1 (2:1) the lower 16 bits of hash are the same, and that after (2:2) all 32 bits of tmp are the same. Looking at step 3 reveals how this creates a collision: hash = (hash << 16) ^ tmp;. The upper 16 bits of hash are thrown away, leaving the lower 16 bits (which are the same in the observed collisions) and 32 bits of tmp (which is the same in the observed collisions). Thus if after an initial 4 bytes we can find 2 bytes that create the same lower 16 bits in hash at (2:1), and then find the remaining 2 bytes that create the same 32 bits in tmp at (2:2), we have generated a colliding 8 byte string.

This can be further generalized to any string that's a multiple of 8 bytes. If, from an initial hash value of X (the length of the string for SuperFastHash), you can find 8 bytes that collide with hash Y, then you simply use Y as the initial hash value and re-work the problem for the next 8 bytes, and so on. This works because SuperFastHash has very poor mixing for last 8 bytes that have been hashed and does not use it's avalanching fix-up until it has reached the end of the input.

Attack

The attack was fairly straightforward once I identified what needed to be done. After a couple of revisions and refinements, I came up with:

  1. Take a source string S that has a length which is a multiple of 8 bytes
  2. Hash S and for each 8 byte sequence, find the values of hash and tmp at steps (2:1) and (2:2)
  3. Hash every permutation of the first 4 bytes of your attack string:
    • Find bytes 5 and 6 for each permutation which generate the lower 16 bits for the constant in (2:1).
    • Finally find bytes 7 and 8 for which generate the constant in (2:2). If this succeeds, either print the hit or process another 8 bytes until you reach the target length for your source string

This will probably not uncover every possible duplicate, especially for keys longer than 8 bytes, but it ran fast enough and generated enough duplicates that I did not need to refine it further.

Results

When using an 8 byte source string with all possible characters as input ( bytes 0x00 to 0xff ), ~130,000,000 colliding strings can be found in around 3 minutes. When you restrict the character set to printable charactersI (symbols, numbers, uppercase, and lowercase), anywhere from 10,000 to 200,000 (possibly higher) can be found in a few seconds.

Using a random initial value (keying the hash) reduces the collisions, but does not alleviate them entirely. For example, 108,600 strings were generated that hash to the same value as "zzzzzzzz". When run through a hashtable insertion test with 131072 buckets using "0xdeadbeef + length" as the initial value instead of just "length", there were still 3,374,795 compares due to full hash collisions, and the largest bucket had 376 links. By comparison, Bob Jenkins' lookup3 had 1 compare and 8 links in it's largest bucket, and the x31 variant of the Torek/DJB hash had 0 compares and 6 links in it's largest bucket. A small win is that increasing the key length does reduce the collisions with a keyed hash: 100,000 strings hashing to the same value as "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" resulted in only 500,000 compares with the largest bucket containing 158 links.

lookup3

The lookup3 attack is very simple and I actually feel dense for not having seen it until now. This is the basic 32bit little endian algorithm (32bit aligned reads) with input lengths assumed to be 24 or larger and a multiple of 12:

1:	a = b = c = ( 0xdeadbeef + len )

2:	for ( ; len > 12; len -= 12, key += 12 ) {
		a += ( key[0] )
		b += ( key[4] )
		c += ( key[8] )
		mix( a, b, c )
	}

3:	c += ( key[8] )
	b += ( key[4] )
	a += ( key[0] )

4:	final( a, b, c )
	
	return ( c );

mix() and final() mix every bit in a, b, and c very thoroughly, so trying to attack lookup3 algorithmically would be quite futile (for me). However, because every 12 bytes, or 96 bits, of input directly alters every bit of a, b, and c, there is a shortcut:

  1. Take any input string and run it through lookup3; record the values of a, b, and c after Step 3
  2. Take your attack string and pad it to be a multiple of 12 bytes long, then add an additional 12 bytes
  3. Run your attack string through lookup3 and stop after Step 2. We should now have the additional 12 bytes remaining to hash
  4. Construct the final 12 bytes so that key[ 8 ] = ( target.c - c ), key[ 4 ] = ( target.b - b ), and key[ 0 ] = ( target.a - a ). The internal state will now match that of the target string before the call to final(), guaranteeing a collision

Now, this attack relies completely on knowing what a, b, and c are initialized to at the start of the hash. If you are using lookup3 for hash tables, you should already be initializing a, b, and c to random values to defeat bucket attacks, i.e. attacks searching for keys where the lower 15 to 20 bits match. Even the best algorithm is vulnerable to bucket attacks, so choosing a random initial value should be mandatory no matter what.

However, if you're using lookup3 for something like a 64bit unique id or a file checksum, your initial state will need to be static and thus open to attack with this method. I'm not sure what you could do to get around this safely while retaining lookup3's high speed; adding another 32 bits of state that wasn't directly altered by the input may help, but where to stick it and how to mix it in isn't something I couldn't guess at. I've emailed Bob, but I don't know if deliberate attacks like this are something he is concerned with. The lookup3 source does state Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes., so we'll see.

Conclusions

I already knew SuperFastHash had some peculiar results from my previous tests, and the outcome of this experiment drove the point home. While the offending collisions will not be common (what pathological input is?), the fact that they exist, were so readily obtained, and were still somewhat evident when changing the initial value of the hash suggests that it is probably best to start over from scratch. I should have noticed a problem sooner when SuperFastHash was running in to light collisions in When bad hashing means good caching, and that wasn't even an intentional attack.

As far as lookup3, the trivial collisions are disturbing, but they are only a problem if an attacker can craft input for the hash function, and then only when you are using it to generate unique ids. It would be nice to see the issue addressed, though, as lookup3 is still unrivaled in terms of mixing and actually faster than SuperFastHash in some cases.

Source

You may download the SuperFastHash attack source and try it out for yourself. It comes preloaded with attacks against "********", yielding 194 collisions, and "10/5 = 2", yielding my favorite set of collisions. If you want to run it with all possible characters as input, make sure to turn the debug printf off or redirect output to a file; it can take a while to end process a program that has a lot of \x07 bells queued up.

TrackBack Info

TrackBack URL for this entry:
http://www.team5150.com/~andrew/blog/trackback.cgi/7

Comments (4)

Andrew --

That's a couple of interesting postings.

I'm thinking about a variant on lookup3() partly driven by your attack and partly driven by a desire to propagate small amounts of noise more quickly before they can be destroyed.

Instead of using:

a += data[0]
b += data[1]
c += data[2]

I'm looking at:
t = data[0]; a += t; t = rotate(t, const1); c ^= t;
t = data[1]; b += t; t = rotate(t, const2); a ^= t;
t = data[2]; c += t; t = rotate(t, const3); b ^= t;

This would defeat the approach you describe here. But do you see an obvious attack on this sort of approach?

You might also want to check out Google Groups' Hash Functions group: http://groups.google.com/group/hash-functions?lnk=srg

G'luck,
Cs

I embarrassingly found your comment when I saw your post linking here on the Hash Functions group. I'm not used to having actual readers, let alone commenters.

I'm not really an expert on breaking things (I had trouble re-understanding this article at first), but I did manage to somewhat compromise your approach.

If you take A, B, and C to be the target values after your modified step 3, and a, b, and c to be the arbitrary values you want to morph, then:

  • data[0] = ( A ^ rotate(data[1],rot2) ) - a
  • data[1] = ( B ^ rotate(data[2],rot3) ) - b
  • data[2] = ( C - ( c ^ rotate(data[0],rot1) ) )

You can substitute down to:

  • data[0] = ( ( A ^ rot( ( ( B ^ rot( ( C - ( c ^ rot( data[0], rot1 ) ) ), rot3 ) ) - b ), rot2 ) ) - a )

From here, you just test every value see if it remains the same after being fed through the formula. If you find a value that matches, use it as data[0] and deduce data[1] and data[2]. This method usually only results in 2-3 collisions over the entire search space for data[0] so far, and can result in 0 if you're unlucky. With real world data, you could just keep extending your bogus data by 12 bytes until you hit a match.

Here's the simple test code I used. It finds two different collisions with the current settings:

http://www.team5150.com/~andrew/blog/files/csattack.cpp

I'm not sure if you're still interested in this or not, but you might enjoy playing with MurmurHash - http://murmurhash.google.com. I'm not sure if it would be hard to break or not - it has obvious differentials, but I don't know if you can use that information to generate more than 2 colliding keys.

-Austin

I've looked at MurmurHash but haven't done anything with it yet. I actually have a hard time understanding how the bits are mixing in hash functions and how weak functions are exploited, so I can usually only do a simple brute-force approach to generate collisions from specific keys.

It would be interesting to see how colliding keys upheld when the hash was keyed (initialized to a random seed at runtime). SuperFastHash will still exhibit mild collisions with a random seed and colliding keys while lookup3 will revert to completely random behavior.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)