Re: I'm a total push-over..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Wed, 23 Jan 2008, Andreas Ericsson wrote:
> 
> Insofar as hashes go, it's not that shabby for hashing filenames.

Hashing filenames is pretty easy. You can do a reasonable job with any 
"multiply by an odd number, add in value". Picking the odd number is a 
random choice, some are better than others (and it depends on whether you 
end up always having a power-of-two bucket size etc), but it generally 
won't be horrible.

And considering that we generally won't have tons and tons of pathnames 
(ie we'd generally have thousands, not millions), and that the underlying 
hash not only resizes itself but actually uses the ful 32 bits as the 
lookup key, I don't worry too much. I suspect my random choice is fine.

So no, I didn't think my hash would _suck_, although I also didn't 
research which odd numbers to pick.

No, it's a joke because it doesn't really give an example of how to do the 
_expected_ hash collissions.

Here's another hash that is actually going to collide *much* more (and on 
purpose!), but is actually showing an example of something that actually 
hashes UTF strings so that upper-case and lower-case (and normalization) 
will all still hash to the same value:

	static unsigned int hash_name(const char *name, int namelen)
	{
	        unsigned int hash = 0x123;

	        do {
	                unsigned char c = *name++;
			if (c & 0x80)
				c = 0;
			c &= ~0x20;
	                hash = hash*101 + c;
	        } while (--namelen);
	        return hash;
	}

but the above does so by making the hash much much worse (although 
probably still acceptable for "normal source code name distributions" 
that don't have very many same-name-in-different-cases and high bit 
characters anyway).

The above is still fairly fast, but obviously at a serious cost in hash 
goodness, to the point of being totally unusable for anybody who uses 
Asian characters in their filenames.

To actually be really useful, you'd have to teach it about the character 
system and do a lookup into a case/normalization table.

So *that* is mainly why it's a joke. But it should work fine for the case 
it is used for now (exact match).

Picking a better-researched constant might still be a good idea.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux