Geohashes and the problem of Greenwich
8 June 2008
A geohash is a way to encode latitude/longitude into a string such as gcpvjk3c that includes only number and letters, making it easy to pass around in URLs, etc. The Wikipedia entry makes the algorithm seem more complicated than it actually is. The steps are:
- normalise latitude and longitude to the range [0, 1)
- convert normalised latitude and longitude to binary (in the conventional way)
- interlace the bits of the two binary representations to create one binary number
- convert this number to base-32 (i.e. every 5 bits becomes a character) using a alphabet that includes all the numbers and all the alphabet, with the exception of a, i, l, o
A geohash also supposedly has the property that similar locations have a similar prefix, making it easier for computer programs to figure out similar locations, without resorting to complicated and slow trigonometry that won’t scale to hundreds of thousands of points.
Geohashes do have this property for many geospatial neighbours, but there’s still a most significant bit, and locations either side of it are going to have very different geohashes. Unfortunately, given the location of Greenwich—about 10km to the east of central London—a whole bunch of London locations have very different geohashes. u10hb7951 and gcpuzewfz, for example, are very close to each other.
Moving the 0, 0 point to into the middle of the Atlantic (or anywhere not as inhabited as London) helps with this, but it’s obviously not a robust fix. I wonder how the geospatial extensions to MySQL and PostgreSQL actually work? It’s fairly efficient to convert hashes based on one origin to another, so perhaps they store multiple hashes per location? There must be some clever algorithms for this.