Re: [RFC][PATCH] <linux/stringhash.h>: fix end_name_hash() for 64bit long

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 9 Feb 2018 10:33:15 -0800

On Mon, Feb 5, 2018 at 9:32 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> The comment claims that this helper will try not to loose bits, but
> for 64bit long it looses the high bits before hashing 64bit long into
> 32bit int. Use the helper hash_long() to do the right thing for 64bit
> long. For 32bit long, there is no change.

Ok, sorry for the delay, I only got back to this now because the merge
window is calming down (famous last words - sometimes Friday ends up
being busy, but I might decide that it's too late if somebody sends me
an annoying late pull request now).

And after having looked more at it, I take back all my complaints
about the patch, you were right and I was mis-reading things or just
being stupid.

I also don't worry too much about the possible performance impact of
this on 64-bit, since most architectures that actually care about
performance end up not using this very much (the dcache code is the
most performance-critical, but the word-at-a-time case uses its own
hashing anyway).

So this ends up being mostly used for filesystems that do their own
degraded hashing (usually because they want a case-insensitive
comparison function).

A _tiny_ worry remains, in that not everybody uses DCACHE_WORD_ACCESS,
and then this potentially makes things more expensive on 64-bit
architectures with slow or lacking multipliers even for the normal
case.

That said, realistically the only such architecture I can think of is
PA-RISC. Nobody really cares about performance on that, it's more of a
"look ma, I've got warts^W an odd machine" platform.

So I think your patch is fine, and all my initial worries were just
misplaced from not looking at this properly.

Sorry.

I *would* love to get some kind of estimate on how this changes the
hash distribution. Do you have anything like that? The way this
function is designed to be used, the upper bits should be used for the
hash bucket information, so doing some bucket size distribution on
(say) the upper ~12 bits or something with some real string input
would be really good.

                Linus