Re: heads-up: git-index-pack in "next" is broken

Davide Libenzi <davidel@xxxxxxxxxxxxxxx> · Wed, 18 Oct 2006 14:21:58 -0700 (PDT)

On Wed, 18 Oct 2006, Linus Torvalds wrote:

> On Wed, 18 Oct 2006, Davide Libenzi wrote:
> >
> > Speaking in general, seen at the hash function level, of course an interface 
> > should not give different result for different word sizes or word endianess. 
> > Considering the diff algorithm as interface, as I said, the output was 
> > unaffected by the 64 bits word size. It was just very slow.
> 
> Well, even the output may actually be affected, in the case of _real_ hash 
> collisions (as opposed to just the hash _list_ collision that XDL_HASHLONG 
> caused).
> 
> So I actually think it would be better to have "uint32_t" as the hash 
> value - because that would mean that all diffs (or, in the case of the 
> block-algorithm, the deltas) are guaranteed to give the same results 
> regardless of architecture.
> 
> Right now, we actually generate a 64-bit hash value (BUT: for short lines, 
> it's likely only _interesting_ in the low bits, so the high bits tend to 
> have a very high likelihood of being zero). So hash collisions are 
> different: on a 32-bit architecture, two lines may have the same hash, 
> while on a 64-bit one, they are different.
> 
> And together with some of the limiters we have (eg XDL_MAX_EQLIMIT) hash 
> collisions can sometimes affect the output.
> 
> Admittedly, in _practice_ this is really unlikely to affect anything 
> (you'd get a valid diff in either case, they'd just possibly be subtly 
> different, and the input data must be _really_ strange to even see that 
> case), but I do think that the hash algorithm can matter.
> 
> NOTE! I'm not talking about XDL_HASHLONG(), I'm talking about the 
> xdl_hash_record() hash, which returns differently-sized hash results on 
> 32-bit and 64-bit. And there are cases where we _only_ compare the hashes, 
> and don't actually double-check the contents.

The hash value (hence the hash bucket index) simply directs you to the 
bucket where a real record-compare loop is performed. Collisions here 
means only performance loss (you don't spread buckets enough), nothing 
more. So the different behaviour does not apply to the diff algo.
But, it would apply if the knowledge of the hash indexing would be 
"exported", for example inside an external file. Think about a trivial DB 
file where you store hash buckets on file an you want to lookup records 
based on the store hash layout. In that case, of course, if the hash 
function that generated the DB bucket layout is different from the one 
that you use to get the bucket index at lookup time, you're in trouble.
IOW if the hash function result is not "exported" is some way, it doesn't 
really matter if it's an 'unsigned long' or a 'uint32_t'. In the same way 
you cannot export binary structures using 'int' or 'long', and expect to 
be compatible over different architectures.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html