Re: I'm a total push-over..

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 25 Jan 2008 14:35:45 -0800 (PST)

On Fri, 25 Jan 2008, Linus Torvalds wrote:
> 
> I can write a hash function that reliably does 8 bytes at a time for the 
> common case on a 64-bit architecture, exactly because it's easy to do 
> "test high bits in parallel" with a simple bitwise 'and', and we can do 
> the same with "approximate lower-to-uppercase 8 bytes at a time" for a 
> hash by just clearing bit 5.

Side note: you *can* get better approximations fairly cheaply if you care.

If you want to distinguish the characters 0-31 from the characters 31-63 
in your hash (pointless for filenames, but it can be worthwhile for some 
other string cases), you can decide to clear bit#5 only if bit#6 in that 
byte was also set, with just a few bitwise operations.

Eg, imagine that you have "unsigned long x" containing eight bytes of 
ascii data (ie you already did the test by 0x8080808080808080), you can do 
things like

	unsigned long bit6 = x & 0x4040404040404040;
	x &= ~(bit6 >> 1);

which will only clear bit5 if bit6 in the same byte was set..

So you can do tricks like that, and it will still be plenty fast. And 
nobody will ever care that while it collides 'A' with 'a' (by design), it 
also causes '{' and '[' to be considered "case collisions".

[ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit 
  "Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm 
  not kidding, and yes, it really does it on purpose!

  It used to be that before everybody turned to Latin1, the {|} characters 
  were re-used in Finland (and Sweden, for that matter) for the 
  extra characters needed in Finnish. Because obviously nobody ever
  needed them for any real work.

  I (and probably every Finnish C UNIX programmer) used to be very good at 
  reading C source code even when it was full of odd finnish characters 
  with dots on top, instead of curly braces! ]

And yes, from a performance standpoint, things liek this probably do realy 
matter. For the kernel tree, the average pathname length is ~28 
characters. If you can do it with three iterations that do the first 24 
characters eight characters at a time, and then four iterations over the 
four last ones, rather than 28 iterations with byte->longword and 
multiplications in each, I bet it's quite visible.

Of course, it's going to be visible only if everything else is fast too, 
but git has been pretty good at that in general.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html