Andrew

March 2007 Archives

When Bad Hashing Means Good Caching

Posted on March 7, 2007 1:54 AM

I was testing various string hashing algorithms on chained hash tables, primarily to look at the bucket distribution and number of key comparisons with both prime and power of 2 sized tables. Each table node was set up to remember it's full hash value so bucket collisions would only drop to a key comparison on a true key collision. I initially wasn't concerned with run times, but I tacked on a timer anyway so I could get a quick metric on how collisions and distribution were affecting performance and wound up running in to a rather odd situation. For the purposes of this article, the tables were also pre-sized before inserting keys to avoid resizing overhead.

While I was testing many algorithms, only two are needed for this article: The string hash from STLPort 4.6.2 and Paul Hsieh's SuperFastHash. STLPort's hash is simply "hash = ( hash * 5 ) + string[ i ]" which should be fairly fast but produce weak distribution on large amounts of clustered keys. SuperFastHash should also be quite fast while having much stronger hashing no matter the key size or clustering.

The dataset that caused the problem was a simple 6 character sequential permutation in the form of:

[ABCDEFGH][EFGHIJKL][IJKLMNOP][MNOPQRST][QRSTUVWX][UVWXYZ01]

e.g.

AEIMQU
AEIMQV
AEIMQW
AEIMQX
AEIMQY
AEIMQZ
AEIMQ0
AEIMQ1
AEIMRU
AEIMRV
AEIMRW
AEIMRX
....
HLPTWZ
HLPTW0
HLPTW1
HLPTXU
HLPTXV
HLPTXW
HLPTXX
HLPTXY
HLPTXZ
HLPTX0
HLPTX1

(262144 unique keys in all)

I was hoping to ferret out the weaker algorithms with this dataset and boy did I ever. Almost all the stats were what I expected: STLPort had ~10x bucket density for both prime/pow2 table sizes (i.e. an average of 10 extra links per filled bucket), SuperFastHash had 0.77x for prime and 0.64x for pow2; STLPort had 1.5 million key compares due to full key hashes colliding, SuperFastHash had 0.006 million (6000). Now this is where it got weird: The STLPort powered hash table ran ~55% faster on my AMD64 3200+ than the SuperFastHash powered hash table. The results were similar on my AMD 1.1ghz TBird with the STLPort hash table running ~36% faster than SuperFastHash. The raw difference on both CPUs was around the same, 90ms vs 140ms on the AMD64 and 190ms vs 260ms on the Tbird. Something was definitely up.

Initial profiling was not very helpful. The STLPort hash algorithm ran faster than SuperFastHash, but the total difference was a few milliseconds at most even though SuperFastHash's weakness is in smaller keys where it can't take advantage of it's streamlined innerloop. As far as the Insert method for the hash table, the 1.5 million string compares took ~17ms which gave SuperFastHash a slight advantage when checking for duplicates, but even when I stripped out the duplicate checking code (all the keys were unique so it was safe to do so) STLPort still had a healthy advantage.

With the hash functions coming out equal and the duplicate checking code taken out, there really wasn't much left to the method at all. The offending statements came down to

node = new Node( key, value, *head, hash );
*head = node;

which simply allocates a new node with a simple constructor (key, value, next, hash) and points the head of the bucket to our new node. Since the hashtable is using <string *, int> for it's key/value pair, this is just couple dword copies to initialize the node and update the bucket list, so how could this be slowing our code down so badly? I suspected the allocator might have something to do with it after noticing that running multiple tests in a single session resulted in increased runtime for each successive test (due to accumulated news/deletes?) for VS7 and to a lesser degree for gcc 3.4.4. I dropped in the Hoard Memory Allocator and while the code sped up a bit and multiple tests were possible in a single run with no degredation, STLPort still held the same lead (STLPort down to 55ms and SuperFastHash down to 105ms). I was about at my wits end when I decided to dump the actual bucket indices for each key to see where they were being placed:

STLPort:
273830
273831
273832
273833
273834
273835
273793
273794
273835
273836

SuperFastHash:
4267
391151
77466
272914
376435
77856
225677
489782
292888
97819
9443

Whoops. STLPort's bucket accesses were nearly linear and all clustered around the same area (only the middle 1/10th of the buckets were being used), while SuperFastHash's accesses were (properly) random across the entire list. STLPort's string comparisons were also linearly clustered due to being allocated and inserted in alphabetic order:

CKKNTX - CJPNTX
CKKNTX - CJOTW0
CKKNTX - CJOSTX
CKKNTY - CKJTW1
CKKNTY - CKJSTY
CKKNTY - CJPPR1
CKKNTY - CJPOW1
CKKNTY - CJPNTY
CKKNTY - CJOTW1
CKKNTY - CJOSTY
CKKNTZ - CKJSUU
CKKNTZ - CKJSTZ
CKKNTZ - CJPNUU
CKKNTZ - CJPNTZ
CKKNTZ - CJOSUU
CKKNTZ - CJOSTZ

When the list was randomized and inserted again (under Hoard), SuperFastHash stayed at a cool 105ms while STLPort shot up to 460ms. STLPort gained ~360ms from the now non-linear string comparisons and ~40ms from the non-linear bucket accesses. Same number of compares and resulting bucket contents as with the linear dataset, but now with huge penalties for STLPort's cache misses.

Lessons learned

  • Test results can be meaningless if you don't understand exactly what you're testing.
  • Unless one of the algorithms being tested is explicitly taking advantage of cache lines, it is possible to get highly bogus results if you don't generate your test data well. Before this I hadn't even considered that caching could have this kind of effect on chained hashtables.
  • Hashing algorithms giving the fastest possible time for a particular dataset isn't as important as giving the least worst time unless you are certain it will absolutely not be used for anything else. A flawed algorithm can even be dangerous as illustrated by Scott Crosby and Dan Wallach's Denial of Service via Algorithmic Complexity Attacks.

Source Code

You may download the C++ Source Code demonstrating the problem and try it out for yourself. While the STLPort execution time should balloon when switching to the randomized dataset, it won't always be faster than SuperFastHash on the linear dataset. On a K6-3 450mhz STLPort is actually slower: 580/1500ms for the linear/random datasets versus 500/500ms for SuperFastHash.

Example output on my AMD64 3200+:

This software uses the Hoard scalable memory allocator (version 3.5.1, libhoard). Copyright © 2005 Emery Berger, The University of Texas at Austin, and the University of Massachusetts Amherst.
For more information, see http://www.hoard.org
starting..
elapsed for STLPort-4.6.2/string-6-linear PRIME 63, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for STLPort-4.6.2/string-6-linear POW2 63, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for STLPort-4.6.2/string-6-random PRIME 484, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for STLPort-4.6.2/string-6-random POW2 485, comps 1581380, sdev 10.345448, active bins 27357, items 262144
elapsed for Hsieh/string-6-linear PRIME 94, comps 6145, sdev 0.775544, active bins 188020, items 262144
elapsed for Hsieh/string-6-linear POW2 94, comps 6145, sdev 0.639169, active bins 202819, items 262144
elapsed for Hsieh/string-6-random PRIME 109, comps 6145, sdev 0.775544, active bins 188020, items 262144
elapsed for Hsieh/string-6-random POW2 110, comps 6145, sdev 0.639169, active bins 202819, items 262144

Hash Algorithm Attacks

Posted on March 13, 2007 10:12 PM

After my previous post on When Bad Hashing Means Good Caching, I got to thinking about Crosby/Wallach's Denial of Service via Algorithmic Complexity Attacks paper and wondered if any of the algorithms I was looking at were susceptible to induced collisions. While I had never done this before and didn't expect much success, the results were a little surprising.

Method

The basic method I used is the one as described by Crosby and Wallach:

The hash functions used by typical programs for their hash tables are generally not cryptographically strong functions like MD5 or SHA-1. Instead, they tend to be functions with 32 bits of internal state, designed primarily for speed. Because this state is limited, we need only find inputs such that the internal state after hashing is the same as the initial state.

Consider a hash function with the initial state of 0. Imagine we can find generators, or inputs k1,k2, ... ki such that 0 = Hash(k1) = Hash(k2) = ... = Hash(ki). Then the concatenation of any number of these generators in any combination and any order also hashes to 0. So, k1k2 also hashes to 0, as will k1k1 or k2k1k3k2. Thus, by finding three inputs k1, k2, k3 via exhaustive search and concatenating them combinatorially, we can generate a large number of collisions without requiring any additional searching. The number of possible collisions is bounded only by the maximum length to which we are willing to allow concatenated generators to grow.

I adapted this general idea into the following series of steps:

Step 1 - Quick Duplicates
Run through a couple hundred iterations of your test space and choose a few hashes to hopefully duplicate. Then re-run through the test space and watch for the selected hashes; weak algorithms should crack on this step without even needing to permute the duplicates you find.
Step 2 - Permute and Concatenate Duplicates
If you are not finding abundant duplicates with brute force iterations, keep going until you find at least ~20 or so duplicates for 2-3 hashes, then do a cheap permutation and concatenation of each set of keys as described by Crosby and Wallach. If the hash algorithm is weak, the keys may still generate substantial collisions and even provide full collisions without needing to match up initial states.
Step3 - Generating Proper Duplicates for Varied Initial States
To get full collisions with more complicated algorithms, some trickery is needed. This is because you can only ensure a string will generate a specific hash from a specific starting value. Consider the general form of a hash function
function Hash( string, initial value ) {
	hash = initial value
	foreach( string as item ) {
		hash <-> item
	}
	possible finalizer
	return ( hash )
}

Ignoring the finalizer:

Hash(x, A) = B
Hash(y, A) = B
Hash(z, A) = B
Hash(xy) = Hash(x,A) <-> Hash(y,B)
Hash(xz) = Hash(x,A) <-> Hash(z,B)

We have only guaranteed that y and z hash to B when initialized with A, not B! We can not be sure they will even collide with the new initial hash value B. To get around this, we need to:

  • Comment out any finalizers. We are only concerned with duplicates within the innerloop
  • Generate duplicates keys for the initial hash state A. If the hash is initialized with the length of the input string, we will need to include the length of our target permuted string as well, e.g. using 24 if we are going to permute three 8 character strings.
  • Once we have enough duplicates for initial state A, set the initial state to the hash of our duplicate keys and find another set of keys. Rinse and repeat
  • Permute and concatenate the generated keys in order.

Hash(x, A) = B
Hash(y, A) = B
Hash(z, B) = C

Hash(xz) = Hash(x,A) <-> Hash(z,B) = C
Hash(yz) = Hash(y,A) <-> Hash(z,B) = C

Results

Target Keyspace

To keep things simple and to be fair to the algorithms only intended for string hashing, I did my testing with lowercase alpha characters, i.e. 'a' - 'z'. Initial key length is 8 characters and the attack key length is a multiple of 8 depending on how many concatenations are performed. When fewer duplicates are found, more concatenations may be needed to generate enough collisions to be statistically interesting.

Algorithms
hash31
This is a variation of "djb" style hashes. The hash multiplier is usually 31/33/37 and the key operation is usually an addition or an xor.
for ( i = 0; i < length; ++i )
	hash = ( ( 31 * hash ) + key[i] );

Bruteforcing the keyspace wasn't enough to find enough duplicates in Step 1, but concatenating the duplicates in Step 2 resulted in full collisions. This one was actually took more work to attack than I had figured.

Difficulty in attacking: Easy


DooM III
The DooM III hash can be found in the DooM III SDK.
for ( i = 0; i < length; ++i )
	hash += ( key[i] * (119 + i) );

I had already known the algorithm was weak from previous testing (it turns useless for large table sizes), but it's efficacy to produce collisions was still amazing. Step 1 was enough to find many duplicates, and concatenating and permuting them in Step 2 also resulted in full key collisions.

Difficulty in attack: Effortless


STLPort 4.6.2.
STLPort 4.6.2 employs a rather weak algorithm that usually gives slightly worse distribution compared to most other algorithms, but is very fast.
for ( i = 0; i < length; ++i )
	hash = ( ( 5 * hash ) + key[i] );

Same story as DooM III, effortless collisions and successful permutation.

Difficulty in attack: Effortless


Tribes/Torque
The Tribes/Torque hash looks like it's supposed to be
for ( i = 0; i < length; ++i ) {
	uint c = ( tolower( key[ i ] ) );
	hash += hash;
	hash ^= ( c * c );
}

However, in Tribes the key is treated as signed so the multiplication is signed as well, and in Torque, the multiplication is replaced with a pre-generated lookup that truncates the upper 24 bits of the result and indexes the lookup table with a signed byte in the hash loop, i.e. any byte over 0x7f will actually use negative indices. The version attacked was the buggy Torque version, although any version is easy to attack.

As with the other weaker algorithms, duplicates were easy to generate with or without permutation.

Difficulty in attack: Effortless


SuperFastHash

Paul Hsieh's SuperFastHash was one of the hashes I was looking forward to having a crack at. I knew it was quite good as a general purpose hash, but I hadn't seen anything as to whether or not it could be easily attacked.

Unfortunately, it went downhill right off the bat. There were high concentrations of colliding hashes during Step 1, usually 50-60 for a particular hash in a tight group of keys. This "group" also seemed to repeat every 8 characters or so for the most significant digit, i.e. a group at axxxxxxx, a group at ixxxxxxx, and a group at qxxxxxxx.

The good news is that SuperFastHash was resistant to permuting and concatenating keys in Step 2. I could still get collisions and slow down the hash table, sometimes quite noticeably, but overall the chain lengths tended to stay around 10 to 30 links per active bucket.

The bad news is that the Step 3 method of generating successive collisions with adjusted initial hash values was very easy due to SuperFastHash's collision clustering problem. It took no time at all to get 3 sets of collisions and to permute and concatenate them into a deadly key set for SuperFastHash.

If anybody knows David Malone, tell him he was right!

Difficulty in attack: Easy after identifying the need for Step 3


FNV

The Fowler/Noll/Vo hash is a generally well performing hash based on multiplication by a prime number. Unlike SuperFastHash, duplicates are not clustered with FNV and are rather painful to find, requiring quite a lot of brute force iterating. The differences end there though, as, like SuperFastHash, FNV exhibits only mild collisions at Step 2 and full collisions at Step 3.

Since FNV took so long to generate duplicates, I did it in the following way:

  • Generate ~10 duplicates with hash A using the default hash B
  • Generate ~10 duplicates with hash B using A as the default hash

This way I can permute

Set1 ( A to B ) | Set2 ( B to A ) | Set3 ( A to B )

..and so on without needing to generate more duplicates. The only downside is the key gets a little large, but you can still generate huge amounts of keys with relatively little searching.

Difficulty in attack: Medium, somewhat time consuming


lookup3

Bob Jenkins' lookup3 hash laughed politely when I tried to knock it around. It can find duplicates (slowly) with 8 character keys, but concatenating keys does absolutely nothing due to it's 96bit internal state.


Conclusions

Crosby and Wallach presented a few remedies to this type of a attack:

  • Use another data structure with guaranteed worst-case performance, such as a self-balancing binary tree. A Judy Tree is another interesting alternative, offering sorted and sequential access as well as competitive performance to a hash table without being susceptible to algorithmic complexity attacks.
  • Generate a random value to use as the initial hash value, i.e. keying the hash table. Crosby and Wallach reported that even MD5 and SHA-1 were susceptible to brute-forcing collisions in the lower 19 bits of a hash value when unkeyed.
  • Addendum to the previous item: It's true! Masking with 0x0007ffff, I managed to find 2.7 SHA-1 collisions a second (497 total), and even got twenty-three pairs of 32bit collisions out of it. That's ~53 hours to compute all the collisions in the 512k bucket space. Changing just a single bit of the initial SHA-1 key dropped the distribution back to completely random.
  • Use a Universal Hashing algorithm. Crosby started work on a Universal Hashing library, but unfortunately doesn't seem to have updated it since 2003.

One thing they didn't mention is expanding your algorithm to use more than 32bits of internal state. Combining this with keyed hash values should solve both permuted collisions and brute forcing collisions for a subsection of the bitspace. Although if you're determined to stick with hash tables, modifying a flawed algorithm wouldn't make much sense when the general purpose algorithm with the best distribution is also close to the quickest and strongest, and easily adapted to keyed initial values: lookup3.

Download

You may download a slightly modified version of Paul Hsieh's htest which comes equipped with demonstration files for each attacked hash algorithm.

readme.txt:

A collection of colliding keysets for various string hashing algorithms.
Keyset size is limited to provide a smaller download, not due to
computational complexity. Test Harness used is Paul Hsieh's htest.

Run testcollisions.bat under windows to see the results, or figure
out how to convert it to something lunixy if you don't have windows.

Algorithm         Number of fully colliding keys
------------------------------------------------
DooM3                                  4096 keys
FNV                                   16384 keys
hash31                                 1024 keys
SuperFastHash                         16384 keys
STLPort 4.6.2                          4096 keys
Torque                                 4096 keys
The original htest may be found at
http://www.azillionmonkeys.com/qed/htest.zip

MSVC rolls over

Posted on March 22, 2007 12:44 AM

If you are using MSVC7 or earlier and you want to rotate some uints, you'll be unpleasantly surprised when

#define rotl(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

unsigned int test( unsigned int a, unsigned int b, unsigned int c ) {
	a -= c;  a ^= rotl(c, 4);  c += b;
	b -= a;  b ^= rotl(a, 6);  a += c;
	c -= b;  c ^= rotl(b, 8);  b += a;
	a -= c;  a ^= rotl(c,16);  c += b;
	b -= a;  b ^= rotl(a,19);  a += c;
	c -= b;  c ^= rotl(b, 4);  b += a;

	return ( c );
}

compiles to code that looks like:

  mov edi,eax
  shl edi,6
  shr ecx,1A
  or ecx,edi
  xor ecx,edx
  add eax,esi
  sub esi,ecx
  mov edx,ecx
  mov edi,ecx
  shl edi,8
  shr edx,18
  or edx,edi
  xor edx,esi

No! Bad dog! Unlike gcc and icc, MSVC7 and earlier don't like to recognize rol's / ror's when they see one and will instead generate dreadful shifts; this obviously slows your code down a bit, especially if it's in a tight innerloop.

The solution to the problem is to use _rotl (or _rotr), which Microsoft compilers implicitly generate rol / ror for. If you define your macros as _rotl / _rotr, you can do something like

#if !defined(_MSC_VER)
	#define _rotl(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#endif

and let the macro fall through to the built-in functions on Microsoft compilers. MSVC7 and earlier will now generate rotates properly:

  rol ecx,6
  xor ecx,edx
  add eax,esi
  sub esi,ecx
  mov edx,ecx
  rol edx,8
  xor edx,esi

Breaking SuperFastHash (and chipping lookup3)

Posted on March 29, 2007 3:31 AM

After the problems SuperFastHash had in Hash Algorithm Attacks, I decided to try and break it completely, i.e. generate collisions algorithmically instead of brute forcing them. The attempt was more successful than I had anticipated, although Paul is obviously aware of the weak mixing in the final bits as evidenced by his comment in the source code, "Force 'avalanching' of final 127 bits". My favorite collisions encountered would have to be "10/4 < 3", "10/5 = 2", and "10/6 > 1", which have the property of hashing to the same value while being mathematically correct!

As I was writing this, I came up with a way to attack Bob Jenkins' lookup3 as well. Unlike SuperFastHash, the lookup3 attack is due to the way the input bytes are being read in and does not indicate a deficiency in the mixing itself. If you are using lookup3 with hash tables, the core function will still be quite safe; it will only need to be modified if you are using it to generate unique 64bit identifiers and the input data could be altered for a malicious purpose.

With that said, let's look at the attacks:

SuperFastHash

To begin with, I separated the innerloop in to its 4 steps:

1:	hash  += get16bits (data);
2:	tmp    = (get16bits (data+2) << 11) ^ hash;
3:	hash   = (hash << 16) ^ tmp;
4:	hash  += hash >> 11;

and then ran a couple of known duplicate strings through and printed out the value at each step (input bytes are only consumed in steps 1 and 2):

"aaaaaaaa" vs "aaadadaf"

aa 1: 00006169 aa 2: 030b6969 3: 62626969 4: 626eb5b6 aa 1: 626f1717 aa 2: 61641f17 3: 76731f17 4: 7681ed7a

aa 1: 00006169 ad 2: 03236969 3: 624a6969 4: 6256b2b6 ad 1: 62571717 af 2: 61641f17 3: 76731f17 4: 7681ed7a

"ifkzihfe" vs "igdqhtfp"

if 1: 00006671 kz 2: 03d33e71 3: 65a23e71 4: 65aef2b8 ih 1: 65af5b21 fe 2: 66846b21 3: 3da56b21 4: 3dad1fce

ig 1: 00006771 dq 2: 038b4771 3: 64fa4771 4: 6506e6b9 ht 1: 65075b21 fp 2: 66846b21 3: 3da56b21 4: 3dad1fce

We can see that after round 2 step 1 (2:1) the lower 16 bits of hash are the same, and that after (2:2) all 32 bits of tmp are the same. Looking at step 3 reveals how this creates a collision: hash = (hash << 16) ^ tmp;. The upper 16 bits of hash are thrown away, leaving the lower 16 bits (which are the same in the observed collisions) and 32 bits of tmp (which is the same in the observed collisions). Thus if after an initial 4 bytes we can find 2 bytes that create the same lower 16 bits in hash at (2:1), and then find the remaining 2 bytes that create the same 32 bits in tmp at (2:2), we have generated a colliding 8 byte string.

This can be further generalized to any string that's a multiple of 8 bytes. If, from an initial hash value of X (the length of the string for SuperFastHash), you can find 8 bytes that collide with hash Y, then you simply use Y as the initial hash value and re-work the problem for the next 8 bytes, and so on. This works because SuperFastHash has very poor mixing for last 8 bytes that have been hashed and does not use it's avalanching fix-up until it has reached the end of the input.

Attack

The attack was fairly straightforward once I identified what needed to be done. After a couple of revisions and refinements, I came up with:

  1. Take a source string S that has a length which is a multiple of 8 bytes
  2. Hash S and for each 8 byte sequence, find the values of hash and tmp at steps (2:1) and (2:2)
  3. Hash every permutation of the first 4 bytes of your attack string:
    • Find bytes 5 and 6 for each permutation which generate the lower 16 bits for the constant in (2:1).
    • Finally find bytes 7 and 8 for which generate the constant in (2:2). If this succeeds, either print the hit or process another 8 bytes until you reach the target length for your source string

This will probably not uncover every possible duplicate, especially for keys longer than 8 bytes, but it ran fast enough and generated enough duplicates that I did not need to refine it further.

Results

When using an 8 byte source string with all possible characters as input ( bytes 0x00 to 0xff ), ~130,000,000 colliding strings can be found in around 3 minutes. When you restrict the character set to printable charactersI (symbols, numbers, uppercase, and lowercase), anywhere from 10,000 to 200,000 (possibly higher) can be found in a few seconds.

Using a random initial value (keying the hash) reduces the collisions, but does not alleviate them entirely. For example, 108,600 strings were generated that hash to the same value as "zzzzzzzz". When run through a hashtable insertion test with 131072 buckets using "0xdeadbeef + length" as the initial value instead of just "length", there were still 3,374,795 compares due to full hash collisions, and the largest bucket had 376 links. By comparison, Bob Jenkins' lookup3 had 1 compare and 8 links in it's largest bucket, and the x31 variant of the Torek/DJB hash had 0 compares and 6 links in it's largest bucket. A small win is that increasing the key length does reduce the collisions with a keyed hash: 100,000 strings hashing to the same value as "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" resulted in only 500,000 compares with the largest bucket containing 158 links.

lookup3

The lookup3 attack is very simple and I actually feel dense for not having seen it until now. This is the basic 32bit little endian algorithm (32bit aligned reads) with input lengths assumed to be 24 or larger and a multiple of 12:

1:	a = b = c = ( 0xdeadbeef + len )

2:	for ( ; len > 12; len -= 12, key += 12 ) {
		a += ( key[0] )
		b += ( key[4] )
		c += ( key[8] )
		mix( a, b, c )
	}

3:	c += ( key[8] )
	b += ( key[4] )
	a += ( key[0] )

4:	final( a, b, c )
	
	return ( c );

mix() and final() mix every bit in a, b, and c very thoroughly, so trying to attack lookup3 algorithmically would be quite futile (for me). However, because every 12 bytes, or 96 bits, of input directly alters every bit of a, b, and c, there is a shortcut:

  1. Take any input string and run it through lookup3; record the values of a, b, and c after Step 3
  2. Take your attack string and pad it to be a multiple of 12 bytes long, then add an additional 12 bytes
  3. Run your attack string through lookup3 and stop after Step 2. We should now have the additional 12 bytes remaining to hash
  4. Construct the final 12 bytes so that key[ 8 ] = ( target.c - c ), key[ 4 ] = ( target.b - b ), and key[ 0 ] = ( target.a - a ). The internal state will now match that of the target string before the call to final(), guaranteeing a collision

Now, this attack relies completely on knowing what a, b, and c are initialized to at the start of the hash. If you are using lookup3 for hash tables, you should already be initializing a, b, and c to random values to defeat bucket attacks, i.e. attacks searching for keys where the lower 15 to 20 bits match. Even the best algorithm is vulnerable to bucket attacks, so choosing a random initial value should be mandatory no matter what.

However, if you're using lookup3 for something like a 64bit unique id or a file checksum, your initial state will need to be static and thus open to attack with this method. I'm not sure what you could do to get around this safely while retaining lookup3's high speed; adding another 32 bits of state that wasn't directly altered by the input may help, but where to stick it and how to mix it in isn't something I couldn't guess at. I've emailed Bob, but I don't know if deliberate attacks like this are something he is concerned with. The lookup3 source does state Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes., so we'll see.

Conclusions

I already knew SuperFastHash had some peculiar results from my previous tests, and the outcome of this experiment drove the point home. While the offending collisions will not be common (what pathological input is?), the fact that they exist, were so readily obtained, and were still somewhat evident when changing the initial value of the hash suggests that it is probably best to start over from scratch. I should have noticed a problem sooner when SuperFastHash was running in to light collisions in When bad hashing means good caching, and that wasn't even an intentional attack.

As far as lookup3, the trivial collisions are disturbing, but they are only a problem if an attacker can craft input for the hash function, and then only when you are using it to generate unique ids. It would be nice to see the issue addressed, though, as lookup3 is still unrivaled in terms of mixing and actually faster than SuperFastHash in some cases.

Source

You may download the SuperFastHash attack source and try it out for yourself. It comes preloaded with attacks against "********", yielding 194 collisions, and "10/5 = 2", yielding my favorite set of collisions. If you want to run it with all possible characters as input, make sure to turn the debug printf off or redirect output to a file; it can take a while to end process a program that has a lot of \x07 bells queued up.