Consistent String#hash based only on the string's content

Consistent String#hash based only on the string's content

GOAL: Map every URL handled by a server to 0, 1, 2, or 3, distributing as uniformly as possible.

While the documentation for ruby's String#hash method says it will "return a hash based on the string‘s length and content," this clearly isn't the whole story. A given string's hash is not consistent across invocations of the interpreter:

$ irb ruby-1.9.2-p180 :001 > "foo".hash => 360517580588231756 ruby-1.9.2-p180 :002 > ^D $ irb ruby-1.9.2-p180 :001 > "foo".hash => -2716152678666510148

This means a particular string's hash value may differ across, say, servers. Rails uses String#hash internally to map a URL path to one of four asset hosts (if the app's asset_host is so configured), but this feature is a lot less efficient than it could be because of the cross-machine inconsistencies; different servers may map the same URL to different asset hosts, reducing the effectiveness of caches, clouding skies, cooling cups of tea prematurely, besmirching the reputations of otherwise fine programmers.

String#hash

Can you suggest an alternate hash function that could effectively and speedily distribute hashes across a typical app's URL space, preferably one that produces a Fixnum since, in the end, I'll want to map it into one of four asset hosts?

Just wondering, did you ever find a good solution to this?
– robd
Mar 15 '13 at 17:43

CRC may be a good solution - see stackoverflow.com/questions/4452161/…
– robd
Mar 15 '13 at 17:55

4 Answers
4

there are lot of such functionality in ruby's digest module: http://ruby-doc.org/stdlib/libdoc/digest/rdoc/index.html

simple example:

require 'digest/sha1' Digest::SHA1.hexdigest("some string")

True, but are SHA1 and MD5 overkill? Too slow?
– Rob Davis
Jun 30 '11 at 15:09

fast enough i'd say. you can check google's CityHash which targets performance and minimizing collisions in general text strings but give sha1/md5 a try and test the performance.
– keymone
Jun 30 '11 at 15:11

@RobDavis the "hex" in #hexdigest suggests that output is a hexadecimal number, to convert it to int all you have to do is call to_i(16)
– keymone
Jun 12 '15 at 12:38

i don't know constraints of your system but sha is a real solution to your problem, just as md5 is and murmur hash and crc32. the difference is the space you're mapping urls onto - only 4bn numbers in 32 bits which is not good enough for almost any problem i can think of. instead of restricting yourself to short 32bit integers, try representing large integers in higher base - base64/62 is a great way to represent huge integers with few characters.
– keymone
Jun 15 '15 at 8:10

There is tiny library xxHash:

XXhash.xxh32('qwe') #=> 2396643526 XXhash.xxh64('qwe') #=> 9343136760830690622

Maybe it will have more collisions but it is 10x faster than SHA1:

Benchmark.bm do |x| n = 100_000 str = 'qweqweqwe' x.report('xxhash32') n.times XXhash.xxh32(str) x.report('xxhash64') n.times XXhash.xxh64(str) x.report('hexadigest') n.times Digest::SHA1.hexdigest(str) end;1 # user system total real # xxhash32 0.020000 0.000000 0.020000 ( 0.021948) # xxhash64 0.040000 0.000000 0.040000 ( 0.036340) # hexadigest 0.240000 0.030000 0.270000 ( 0.276443)

You can try to_i(36).

"Hash me please :(".to_i(36) => 807137

Although that only seems to look at the first four characters: "Hash something else".to_i(36) also produces 807137
– Rob Davis
Jun 30 '11 at 16:47

"Hash something else".to_i(36)

It works only until first space. So it is usable for URLs. There is similar method on Fixnum called to_s(36).
– retro
Jun 30 '11 at 23:33

Oh I see. That may just do the trick.
– Rob Davis
Jun 30 '11 at 23:39

Actually, it seems to stop at the first slash.
– Rob Davis
Jul 1 '11 at 13:20

It's going to stop at the first non-alphanumeric character, because you are specifying a number base. You could strip such characters, but then you might have to worry about overflow (since you are basically asking it to parse a huge number).
– benzado
Mar 21 '12 at 4:59

The easiest (and consistent) way may be this (and it's fast):

"https://www.example.com/abc/def/123?hij=345".sum % 4

That will always produce an integer 0 - 3, is quite fast, and should be fairly well distributed (though I haven't actually run tests on distribution).

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

FG6K60M1HlgLWWExmK762w9AzN tvSj,PFmutTOguY7dwYSVsYC4P9L,MMnFma3g mMnor,O7 nsMaD5da09fYTeaMFOHy R8ZHqU

搜尋此網誌

Dfyjkt