Thanks for your comments. On Fri, 2 Jun 2017 18:16:45 -0400 Jeff King <peff@xxxxxxxx> wrote: > On Fri, Jun 02, 2017 at 12:38:43PM -0700, Jonathan Tan wrote: > > > > Do we need to future-proof the output format so that we can later > > > use 32-byte hash? The input to pack-objects (i.e. rev-list --objects) > > > is hexadecimal text, and it may not be so bad to make this also > > > text, e.g. "<hash> SP <length> LF". That way, you do not have to > > > worry about byte order, hash size, or length limited to uint64. > > > > The reason for using binary is for the convenience of the client to > > avoid another conversion before storing it to disk (and also network > > savings). In a large repo, I think this list will be quite large. I > > realized that I didn't mention anything about this in the commit > > message, so I have added an explanation. > > > > I think this is sufficiently future-proof in that the format of this > > hash matches the format of the hashes used in the objects in the > > packfile. As for object size being limited to uint64, I think the > > benefits of the space savings (in using binary) outweigh the small risk > > that our files will get larger than that before we upgrade our protocol > > :-P > > The rest of the pack code uses a varint encoding which is generally > much smaller than a uint64 for most files, but can handle arbitrary > sizes. > > The one thing it loses is that you wouldn't have a fixed-size record, so > if you were planning to dump this directly to disk and binary-search it, > that won't work. OTOH, you could make pseudo-pack-entries and just > index them along with the rest of the objects in the pack .idx. > > The one subtle thing there is that the pseudo-entry would have to say > "this is my sha1". And then we'd end up repeating that sha1 in the .idx > file. So it's optimal on the network but wastes 20 bytes on disk (unless > index-pack throws away the in-pack sha1s as it indexes, which is > certainly an option). If we end up going with the varint approach (which seems reasonable), maybe the client could just expand the varints into uint64s so that it has a binary-searchable file. I think it's better to keep this list separate from the pack .idx file (there has been some discussion on this - [1] and its replies). [1] https://public-inbox.org/git/777ab8f2-c31a-d07b-ffe3-f8333f408ea1@xxxxxxxxxxxxxxxxx/ > > > Can this multiplication overflow (hint: st_mult)? > > > > Thanks - used st_mult. > > Actually, I think this is a good candidate for ALLOC_ARRAY(). Thanks - I've made this change in my local version. > > > This sorting is a part of external contract (i.e. "output in hash > > > order" is documented), but is it necessary? Just wondering. > > > > It is convenient for the client because the client can then store it > > directly and binary search when accessing it. (Added a note about > > convenience to the commit message.) > > In general the Git client doesn't trust the pack data coming from a > remote, and you can't corrupt a client by sending it bogus data. Either > the client computes it from scratch (e.g., the sha1s of each object) or > the client will reject nonsense (missing bases, refs pointing to objects > that aren't sent, etc). > > I know this feature implies a certain amount of trust (after all, the > server could claim that it omitted any sha1 it likes), but we should > probably still be as strict as possible that what the other side is > sending makes sense. In this case, we should probably hashcmp() each > entry with the last and make sure they're strictly increasing (no > out-of-order and no duplicates). Good point.