MurmurHash 2 Java port
Wednesday July 29, 2009 by Derek Young
I had the need for a good hash function and after some research came across MurmurHash 2.0 by Austin Appleby. This hash function improves on others out there by being faster and very well distributed. It was put through careful study and testing to prove that it is indeed a good hash function. (I’m not a hash function expert, so I won’t pretend to fully understand the claim that MurmurHash 2 has “Excellent avalanche behavior – Maximum bias is under 0.5%”).
I needed a Java version and came across a port by Andrzej Bialecki. As I scanned the code I noticed a main method at the bottom with a comment “Testing …” but no unit tests, always a red flag. The testing code runs through 1000 numbers, hashes them and prints out the hashed values. There is nothing there that verifies the output of the hash function.
I wrote some unit tests against the code by taking the original C code and running a couple sequence of hashes. First I use fixed data and vary the hash seed, then I use a fixed seed and vary the input data. The C code dumps out the results. I ran the same pattern of tests against the Java implementation. The results did not match.
Writing Java code that operates on bytes is notoriously error prone since bytes are signed where you would expect them to be unsigned. The original MurmurHash 2 code also used tricks that only work in C, processing 4 bytes at a time by treating the bytes as a single 32 bit int. The author later released an endian-neutral version of the code that did not rely on these tricks, (although at the cost of being slower).
I used the byte order safe version of the code and re-ported it to Java, being careful with signed bytes. This time my tests worked – the code generated the expected output that matched the C version.
Does the incorrect version of the code have the same desirable characteristics as the original version? It’s possible it does, but I wouldn’t trust it. The code actually found its way into Hadoop’s implementation of HDFS. The code was taken at face value, with no independent unit tests in the Hadoop code base that I could find.
I’ve attached the new Java port and the unit tests that verify it.
Update 8/6/09: I didn’t mention any license on this code in my first post. Since the original code is released as public domain, this code is public domain as well.
Update 8/11/09: Dan Chokola pointed out in the comments that the original code I posted had a bug – one statement had slipped outside of a switch statement. (And here I was presenting the code as a “tested and fixed” version!) This bug was only exposed if the key you were hashing had a length that was a multiple of four. My test cases didn’t exercise this condition. I updated the tests to vary the key length and fixed the code. Thank you Dan!
// Ported by Derek Young from the C version (specifically the endian-neutral // version) from: // http://murmurhash.googlepages.com/ // // released to the public domain - dmy999@gmail.com public class MurmurHash2 { @SuppressWarnings("fallthrough") public static int hash(byte[] data, int seed) { // 'm' and 'r' are mixing constants generated offline. // They're not really 'magic', they just happen to work well. int m = 0x5bd1e995; int r = 24; // Initialize the hash to a 'random' value int len = data.length; int h = seed ^ len; int i = 0; while (len >= 4) { int k = data[i + 0] & 0xFF; k |= (data[i + 1] & 0xFF) << 8; k |= (data[i + 2] & 0xFF) << 16; k |= (data[i + 3] & 0xFF) << 24; k *= m; k ^= k >>> r; k *= m; h *= m; h ^= k; i += 4; len -= 4; } switch (len) { case 3: h ^= (data[i + 2] & 0xFF) << 16; case 2: h ^= (data[i + 1] & 0xFF) << 8; case 1: h ^= (data[i + 0] & 0xFF); h *= m; } h ^= h >>> 13; h *= m; h ^= h >>> 15; return h; } }
// released to the public domain - dmy999@gmail.com public class MurmurHash2Test { // expected values are generated from the output of a C driver that // ran against the same input @Test public void testChangingSeed() { // use a fixed key byte [] key = new byte[] { 0x4E, (byte) 0xE3, (byte) 0x91, 0x00, 0x10, (byte) 0x8F, (byte) 0xFF }; int [] expected = { 0xeef8be32, 0x8109dec6, 0x9aaf4192, 0xc1bcaf1c, 0x821d2ce4, 0xd45ed1df, 0x6c0357a7, 0x21d4e845, 0xfa97db50, 0x2f1985c8, 0x5d69782a, 0x0d6e4b85, 0xe7d9cf6b, 0x337e6b49, 0xe1606944, 0xccc18ae8 }; for (int i = 0; i < expected.length; i++) { int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, i); assertEquals("i = " + i, expectedHash, hash); } } @Test public void testChangingKey() { byte [] key = new byte[133]; int [] expected = { 0xd743ae0b, 0xf1b461c6, 0xa45a6ceb, 0xdb15e003, 0x877721a4, 0xc30465f1, 0xfb658ba4, 0x1adf93b2, 0xe40a7931, 0x3da52db0, 0xbf523511, 0x1efaf273, 0xe628c1dd, 0x9a0344df, 0x901c99fc, 0x5ae1aa44 }; for (int i = 0; i < 16; i++) { // keep seed constant, generate a known key pattern setKey(key, i); int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, 0x1234ABCD); assertEquals("i = " + i, expectedHash, hash); } } @Test public void testChangingKeyLength() { int [] expected = { 0xa0c72f8e, 0x29c2f97e, 0x00ca8bba, 0x88387876, 0xe203ce49, 0x58d75952, 0xab84febe, 0x98153c65, 0xcbb38375, 0x6ea1a28b, 0x9afa8f55, 0xfb890eb6, 0x9516cc49, 0x6408a8eb, 0xbb12d3e6, 0x00fb7519 }; // vary the key and the length for (int i = 0; i < 16; i++) { byte [] key = new byte[i]; setKey(key, i); int expectedHash = expected[i]; int hash = MurmurHash2.hash(key, 0x7870AAFF); assertEquals("i = " + i, expectedHash, hash); } } /** Fill a key with a known pattern (incrementing numbers) */ private void setKey(byte [] key, int start) { for (int i = 0; i < key.length; i++) { key[i] = (byte) ((start + i) & 0xFF); } } }
Comment [5]

A better MySQL replication heartbeat
Wednesday February 4, 2009 by Derek Young
If you’ve used MySQL replication you’ve probably discovered that slave machines can lag behind the master. Replication can also break completely, requiring hours (or days) for the slave hours to catch up. Monitoring is required to catch issues before the slaves get too far behind.
Jeremy Zawodny has suggested a heartbeat mechanism to monitor the delay between the master and the slave. (I’m not sure if he came up with this solution). His suggestion is to periodically insert a row into a heartbeat table on the master. Then you poll the table on the slave, waiting for the row to appear. The length of time you spend polling is a rough estimate for how far behind the slave is at that moment.
There are a few problems with this solution. Your have to write code to poll the slave. If you poll very frequently (every second) you’ll be polling too often if replication is actually hours behind. When do you stop polling? If you poll less frequently (every minute) your estimate gets that much less accurate. You also have to poll every slave if there are more than one.
A new solution
You can get MySQL to do the hard work for use by taking advantage of the difference in behavior between SYSDATE and CURRENT_TIMESTAMP. In almost all cases when a slave runs a SQL statement it temporarily sets the “current time” to the time the statement was executed on the master. If you insert NOW at 12:00:04 on the master the row will hold exactly 12:00:04 on the slave, not matter when it’s run. However, the SYSDATE function does not follow this behavior. It always uses the value of the slave’s system clock.
If you insert a row with one column holding the value of NOW or CURRENT_TIMESTAMP and the other holding the value of SYSDATE into the master, you can use the difference between the two values on the slave to see how far behind it is. If the slave is in sync the two values will be identical. If the slave is one second behind the column holding SYSDATE will be one second ahead of the column holding NOW. No polling is required to determine the current lag.
Implementation
First, create the heartbeat table on the master. master_time wil hold the time the row was inserted on the master. slave_time will hold the time was inserted on the slave.
create table heartbeat( master_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, slave_time TIMESTAMP NOT NULL ) ENGINE=MyISAM;
Periodically (I do it every minute), insert a row into the heartbeat table on the master.
insert into heartbeat(slave_time) values(SYSDATE());
To see the current replication lag, at any time calculate the difference between the current time and the time the most recent row was inserted on the master. (This estimate can be off by up to one heartbeat period). This query is run on a slave.
select timediff(NOW(), max(master_time)) from heartbeat;
You can see how the replication delay changed over time by selecting all rows within a range. This example shows delay for every minute of the current day. The delays are accurate to within 1 second (the max resolution of MySQL).
select master_time, timediff(slave_time, master_time) from heartbeat where DATE(master_time) = DATE(NOW()) order by master_time;
Comment [5]

Improvement as a developer
Friday November 21, 2008 by Derek Young
I was doing some thinking about areas where I’m lacking as a developer. There are many, but I’ll list a few here.
Testing Improvements
I think I’ve come a long way in the last few years in the area of testing. I’m disciplined about testing now and don’t write code without writing unit tests at the same time. If I’m modifying existing code without any tests I spend the time to write tests first, even if this takes longer than making the change. That said, my tests still need work.
Most of my tests hit the positive cases. I achieve good coverage and verify most features of the code but only really in the positive cases. I need to write most negative test cases.
I write unit tests which occasionally end up more like functional/integration tests. I don’t always set out to write good, complete functional tests. Most of the bugs I find after a release would be found with good functional tests written at a higher level than my unit tests, verifying the interaction between classes. My unit tests do catch a whole class of bugs that used to slip through before I started writing them, but going a level up as well would catch more (and would also be a good source of documentation).
I have a habit of testing for performance only when it’s too late. If I’m writing code that I know will be performance critical I’ll profile. It’s the code I had no idea would be a bottleneck that ends up being the problem. This ties back to having better functional tests. These would make it possible to profile more realistic chunks of code. Profiling a unit test usually doesn’t help discover the real bottlenecks.
Like performance testing, multi-threaded testing is something I need to be more proactive about. The biggest problem is it’s hard to do. You really can’t write a test to prove code operates correctly in the presence of multiple threads. If your test does uncover a problem, great. If it doesn’t it might just mean you haven’t looked hard enough. My defense against multi-threading issues is to be extremely careful when writing the code in the first place. I carefully double check the code to make sure everything is properly synchronized and thread-to-thread communication is safe. Static analysis tools like Findbugs also help. Adding good multi-threaded testing would be another safeguard.
While I avoid copy and paste and code duplication in my code as much as possible, it tends to end up in my tests more than I would like it to. If I want to make a second test like the first I try to extract methods out of the first as much as possible to make the test small. Then I clone the first test and modify it to make it test the new condition. This does lead to some duplication. It’s usually something I can live with but I’d like to be more disciplined about avoiding it.
Project Issues
I find that I get distracted more frequently when I hit a tough part of a project. I bring up the code and start working on it, then find myself reading my news reader or trying to fix some unimportant issue with my machine (maybe getting sound working in Flash). I also get distracted like this when the requirements of what I’m working on are not yet well defined. I don’t know which way to start moving forward and so I just end up sitting still.
The last 10% is always the toughest in any project. Even when the code is out the door it’s not really complete. There are always a handful of small but difficult issues that need addressing. Too often I find myself relieved to have made the release and start working on something new instead of pushing to get that last 10% completed.
Skills
If I could choose one area where I would most like to improve it would be writing. I consider myself an average writer but I would like to work to become a good writer. Trying to write more is one way I’m working on this but it seems like something you’re either born with or you’re not.
I’m not good at delegating. My first instinct when I see an interesting problem is to solve it myself. When I see a boring or tedious problem I want someone else to do it, but I usually end up deciding it would take me longer to explain the issue than to do the work and end up doing it myself. I would like to be more selective about the work I do myself.
Making improvements
I could list ten things here I’d like to do to address the weaknesses listed above but that’s kind of like making a new years resolution. I’m going to start by trying harder to avoid distraction and go from there.
Comment [1]

Two useless Java customs
Wednesday September 10, 2008 by Derek Young
A couple common Java practices that annoy me.
Calling super() in a constructor
If you use Eclipse to generate a constructor it will look something like this:
public MyClass(int x) { super(); this.x = x; }
Eclipse always adds the call to super(). PMD has a rule for this (in its Controversial Rules category). The explanation is simply “It is a good practice to call super() in a constructor”.
If you omit super() in a constructor the compiler adds the call to super() for you. The compiled code is identical, whether you write the call or not. Effective Java doesn’t seem to address this. Is there any benefit to writing the call in all your constructors?
Supplying messages to JUnit assert calls
JUnit assert methods support a message argument, displayed if the test fails:
assertEquals("a != b", a, b);
This message does make a test failure slightly easier to read but I don’t believe the benefit outweighs the tedious overhead of typing messages for all your asserts. Tests should fail rarely, and when they do they almost always require stepping through or at least looking at the source of the test to figure out why they are failing.
I always omit the message unless it’s explaning something that isn’t obvious when looking at the code.
Comment [1]

32 bit JDK on a 64 bit Ubuntu system
Thursday September 4, 2008 by Derek Young
If you have more than 3GB on your machine and you’re running Ubuntu you’ve probably had to figure out how to access that additional memory – the default Ubuntu desktop kernel will only allow access to the first 3GB. You can install the server kernel, but that’s been tuned for a server with different latency settings, etc. You can recompile the desktop kernel with HIGHMEM64 set, but then you’re stuck building the video drivers yourself.
My latest strategy has been to use the 64 bit kernel. 64 bit support is not bad now and most apps run normally. Of course they use about double the memory. If you’re running a lot of Java processes this 64 bit tax is very noticeable. For my needs, 32 bit Java is fine, even with a 64 bit kernel. Ubuntu/Debian ship a 32 bit JRE (ia32-sun-java6-bin). This package provides only the runtime environment (no javac) and the client VM so it has limited usefulness for a developer.
To install the 32 bit JDK from Sun on a 64 bit system you can use java-package. I’ve been running Eclipse and all my development applications and finally have some free memory again.
Installation
First, download the latest 32 bit JDK (not JRE) from Sun. At the time this was jdk-6u7-linux-i586.bin for me.
Install java-package:
sudo apt-get install java-package
Now use java-package to build a .deb package from the binary you downloaded. You have to trick it into building the 32 bit package:
DEB_BUILD_GNU_TYPE=i486-linux-gnu DEB_BUILD_ARCH=i386 fakeroot make-jpkg jdk-6u7-linux-i586.bin
This should generate a .deb package. For some reason the package name has the _amd64 suffix. Install the package:
sudo dpkg -i sun-j2sdk1.6_1.6.0+update7_amd64.deb
Use update-alternatives to select the new JDK. It was installed at /usr/lib/j2sdk1.6-sun for me.
sudo update-alternatives --config java
If you run java -version you should see the correct version:
java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode)
32 bit Eclipse
I had to reinstall the 32 bit version of Eclipse (since SWT contains native code). I also had to delete my ~/.eclipse directory or Eclipse wouldn’t start (this requires reinstalling new versions of any plugins). Finally, add the new JRE in Java->Installed JREs using the install location (/usr/lib/j2sdk1.6-sun) and select it as the default.
Comment [2]

