Hi, On Fri, 24 Jul 2009, Tony Finch wrote: > On Fri, 24 Jul 2009, Johannes Schindelin wrote: > > > > Think about it, absent any further information than "it is a hash, i.e. > > distributed pretty equally in _any_ byte", even subsets of a sorted list > > will me more or less linear. And assuming that they are linear is _still_ > > your best bet. > > The even distribution of the lower-order bytes is irrelevant. I was not talking about lower-order bytes. All bytes are pretty much evenly distributed. That's why SHA-1 is a good hash. > We're looking at the top 20-ish bits for a pack with a million-ish > objects. The more you zoom in the less linear a sorted list of hashes > will be, so assuming linearity at all scales is wrong. It's a bit like > fractal mountains. If you really find irregularities like that, then SHA-1 is really a lousy hash. Irregularities like this are typically exploitable. If you know of such an irregularity, you might want to write a paper that SHA-1 is broken and get famous. > > There is no way to achieve [O(1) seeks], best thing you can hope for > > is _expected_ O(1) (e.g. with a hashmap, with exponential worst case). > > Of course it's expected. However the worst case is nowhere near > exponential: it's linear because the second-order search is a linear > pagewise scan. But I think in practice, the larger the pack the more that > the randomization of the hash function will smooth out performance > oddities. (Sorry, I don't know enough statistics to be able to say what > the expected error of the linear interpolation is, though I expect it's a > fairly simple formula.) For small packs the number of seeks is 1 anyway. I will believe it when I see it. Ciao, Dscho -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html