Re: I'm a total push-over..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/27/08, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, 26 Jan 2008, Marko Kreen wrote:
> >
> > Here you misunderstood me, I was proposing following:
> >
> > int hash_folded(const char *str, int len)
> > {
> >    char buf[512];
> >    do_folding(buf, str, len);
> >    return do_hash(buf, len);
> > }
> >
> > That is - the folded string should stay internal to hash function.
>
> If it's internal, it's much better, but you still missed the performance
> angle.
>
> The fact is, hashing can take shortcuts that folding cannot do!
>
> Case folding, by definition, has to be "exact" (since the whole point is
> what you're going to use the same folding function to do the compare, so
> if you play games with folding, the compares will be wrong).
>
> But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
> they were different cases of the same character, if that gives you a
> faster hash function. Especially as those charactes are rather rare in
> filenames.
>
> So if you do hashing as a function of its own, you can simply do a better
> job at it.
>
> I do agree that the functions that create a folded set of characters from
> a _complex_ UTF-8 character should be shared between folding and hashing,
> since that code is too complex and there are no simple shortcuts for doing
> a faster hash that still retains all the properties we want.

Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:

- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
  for partial block, which is pretty expensive part of hashing.

The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
  (win on non-x86)

The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing


-- 
marko


PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon."  That obviously means "data should _not_ shift
out of horizon.

btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions.  Thus
no info loss.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux