MurmurHash 2 Java port
Wednesday July 29, 2009 by Derek Young
I had the need for a good hash function and after some research came across MurmurHash 2.0 by Austin Appleby. This hash function improves on others out there by being faster and very well distributed. It was put through careful study and testing to prove that it is indeed a good hash function. (I’m not a hash function expert, so I won’t pretend to fully understand the claim that MurmurHash 2 has “Excellent avalanche behavior – Maximum bias is under 0.5%”).
I needed a Java version and came across a port by Andrzej Bialecki. As I scanned the code I noticed a main method at the bottom with a comment “Testing …” but no unit tests, always a red flag. The testing code runs through 1000 numbers, hashes them and prints out the hashed values. There is nothing there that verifies the output of the hash function.
I wrote some unit tests against the code by taking the original C code and running a couple sequence of hashes. First I use fixed data and vary the hash seed, then I use a fixed seed and vary the input data. The C code dumps out the results. I ran the same pattern of tests against the Java implementation. The results did not match.
Writing Java code that operates on bytes is notoriously error prone since bytes are signed where you would expect them to be unsigned. The original MurmurHash 2 code also used tricks that only work in C, processing 4 bytes at a time by treating the bytes as a single 32 bit int. The author later released an endian-neutral version of the code that did not rely on these tricks, (although at the cost of being slower).
I used the byte order safe version of the code and re-ported it to Java, being careful with signed bytes. This time my tests worked – the code generated the expected output that matched the C version.
Does the incorrect version of the code have the same desirable characteristics as the original version? It’s possible it does, but I wouldn’t trust it. The code actually found its way into Hadoop’s implementation of HDFS. The code was taken at face value, with no independent unit tests in the Hadoop code base that I could find.
I’ve attached the new Java port and the unit tests that verify it.
Update 8/6/09: I didn’t mention any license on this code in my first post. Since the original code is released as public domain, this code is public domain as well.
Update 8/11/09: Dan Chokola pointed out in the comments that the original code I posted had a bug – one statement had slipped outside of a switch statement. (And here I was presenting the code as a “tested and fixed” version!) This bug was only exposed if the key you were hashing had a length that was a multiple of four. My test cases didn’t exercise this condition. I updated the tests to vary the key length and fixed the code. Thank you Dan!
// Ported by Derek Young from the C version (specifically the endian-neutral // version) from: // http://murmurhash.googlepages.com/ // // released to the public domain - dmy999@gmail.com public class MurmurHash2 { @SuppressWarnings("fallthrough") public static int hash(byte[] data, int seed) { // 'm' and 'r' are mixing constants generated offline. // They're not really 'magic', they just happen to work well. int m = 0x5bd1e995; int r = 24; // Initialize the hash to a 'random' value int len = data.length; int h = seed ^ len; int i = 0; while (len >= 4) { int k = data[i + 0] & 0xFF; k |= (data[i + 1] & 0xFF) << 8; k |= (data[i + 2] & 0xFF) << 16; k |= (data[i + 3] & 0xFF) << 24; k *= m; k ^= k >>> r; k *= m; h *= m; h ^= k; i += 4; len -= 4; } switch (len) { case 3: h ^= (data[i + 2] & 0xFF) << 16; case 2: h ^= (data[i + 1] & 0xFF) << 8; case 1: h ^= (data[i + 0] & 0xFF); h *= m; } h ^= h >>> 13; h *= m; h ^= h >>> 15; return h; } }
// released to the public domain - dmy999@gmail.com public class MurmurHash2Test { // expected values are generated from the output of a C driver that // ran against the same input @Test public void testChangingSeed() { // use a fixed key byte [] key = new byte[] { 0x4E, (byte) 0xE3, (byte) 0x91, 0x00, 0x10, (byte) 0x8F, (byte) 0xFF }; int [] expected = { 0xeef8be32, 0x8109dec6, 0x9aaf4192, 0xc1bcaf1c, 0x821d2ce4, 0xd45ed1df, 0x6c0357a7, 0x21d4e845, 0xfa97db50, 0x2f1985c8, 0x5d69782a, 0x0d6e4b85, 0xe7d9cf6b, 0x337e6b49, 0xe1606944, 0xccc18ae8 }; for (int i = 0; i < expected.length; i++) { int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, i); assertEquals("i = " + i, expectedHash, hash); } } @Test public void testChangingKey() { byte [] key = new byte[133]; int [] expected = { 0xd743ae0b, 0xf1b461c6, 0xa45a6ceb, 0xdb15e003, 0x877721a4, 0xc30465f1, 0xfb658ba4, 0x1adf93b2, 0xe40a7931, 0x3da52db0, 0xbf523511, 0x1efaf273, 0xe628c1dd, 0x9a0344df, 0x901c99fc, 0x5ae1aa44 }; for (int i = 0; i < 16; i++) { // keep seed constant, generate a known key pattern setKey(key, i); int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, 0x1234ABCD); assertEquals("i = " + i, expectedHash, hash); } } @Test public void testChangingKeyLength() { int [] expected = { 0xa0c72f8e, 0x29c2f97e, 0x00ca8bba, 0x88387876, 0xe203ce49, 0x58d75952, 0xab84febe, 0x98153c65, 0xcbb38375, 0x6ea1a28b, 0x9afa8f55, 0xfb890eb6, 0x9516cc49, 0x6408a8eb, 0xbb12d3e6, 0x00fb7519 }; // vary the key and the length for (int i = 0; i < 16; i++) { byte [] key = new byte[i]; setKey(key, i); int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, 0x7870AAFF); assertEquals("i = " + i, expectedHash, hash); } } /** Fill a key with a known pattern (incrementing numbers) */ private void setKey(byte [] key, int start) { for (int i = 0; i < key.length; i++) { key[i] = (byte) ((start + i) & 0xFF); } } }

A better MySQL replication heartbeat Some notes on Google development process

Cool. Thanks for this!
At the risk of an annoying question, any chance of an explicit license statement?
— David R. MacIver Aug 5, 09:28 PM #
Hi David – I updated the post and marked the code public domain (the original C code is also public domain).
— Derek Young Aug 6, 07:20 AM #
I noticed in the original C++ implementation, the line h *= m;
is inside the switch..case statement. Your tests don’t check any keys with a length that is a multiple of four, so it went unnoticed.
— Dan Chokola Aug 9, 07:08 AM #
Thank you Dan! I updated the code and tests and put a note in the post.
— Derek Young Aug 11, 09:48 AM #
Excellent implementation. Note that this hash function works particularly well for improving performance in Bloom Filters (since blooms don’t need super-low collision-rate behaviors – just a relatively even bit set distribution for a given key). I have a Java-based Bloom Filter that integrates this hash function for the keys if anyone is interested.
— Darrell Teague Jan 20, 09:26 AM #
We have a port of the Bialecki code in Mahout, but I regenerated test vectors from the cpp version for my unit tests.
I will see about adding your tests to mine.
— Ted Dunning Sep 16, 03:32 PM #
Hmm…. never mind. You only implemented the 32 bit hash. I use the 64 bit version.
— Ted Dunning Sep 16, 03:36 PM #
I went ahead and added these tests to the Mahout implementation. It passed :-).
Btw… in the Mahout implementation I used ByteBuffer to avoid thinking about endian-ness and integer construction. It simplified my code substantially.
— Ted Dunning Sep 17, 02:58 PM #
The return type in C is unit32_t. So won’t the above function return -ive numbers in java?
— ARC Apr 16, 03:04 AM #
Hey Derek! Awesome post. I’m using your murmur hash in my bloom filter. Distribution is really good. Thanks!
— Anandan Sep 30, 12:01 PM #
Are all the i + 0 needed here?
— Vipul A M Apr 13, 03:40 AM #