Re: [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 5 Jun 2017 10:35:23 -0700

Thanks for your comments.

On Fri, 2 Jun 2017 18:16:45 -0400
Jeff King <peff@xxxxxxxx> wrote:

> On Fri, Jun 02, 2017 at 12:38:43PM -0700, Jonathan Tan wrote:
> 
> > > Do we need to future-proof the output format so that we can later
> > > use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> > > is hexadecimal text, and it may not be so bad to make this also
> > > text, e.g. "<hash> SP <length> LF".  That way, you do not have to
> > > worry about byte order, hash size, or length limited to uint64.
> > 
> > The reason for using binary is for the convenience of the client to
> > avoid another conversion before storing it to disk (and also network
> > savings). In a large repo, I think this list will be quite large. I
> > realized that I didn't mention anything about this in the commit
> > message, so I have added an explanation.
> > 
> > I think this is sufficiently future-proof in that the format of this
> > hash matches the format of the hashes used in the objects in the
> > packfile. As for object size being limited to uint64, I think the
> > benefits of the space savings (in using binary) outweigh the small risk
> > that our files will get larger than that before we upgrade our protocol
> > :-P
> 
> The rest of the pack code uses a varint encoding which is generally
> much smaller than a uint64 for most files, but can handle arbitrary
> sizes.
> 
> The one thing it loses is that you wouldn't have a fixed-size record, so
> if you were planning to dump this directly to disk and binary-search it,
> that won't work. OTOH, you could make pseudo-pack-entries and just
> index them along with the rest of the objects in the pack .idx.
> 
> The one subtle thing there is that the pseudo-entry would have to say
> "this is my sha1". And then we'd end up repeating that sha1 in the .idx
> file. So it's optimal on the network but wastes 20 bytes on disk (unless
> index-pack throws away the in-pack sha1s as it indexes, which is
> certainly an option).

If we end up going with the varint approach (which seems reasonable),
maybe the client could just expand the varints into uint64s so that it
has a binary-searchable file. I think it's better to keep this list
separate from the pack .idx file (there has been some discussion on this
- [1] and its replies).

[1] https://public-inbox.org/git/777ab8f2-c31a-d07b-ffe3-f8333f408ea1@xxxxxxxxxxxxxxxxx/

> > > Can this multiplication overflow (hint: st_mult)?
> > 
> > Thanks - used st_mult.
> 
> Actually, I think this is a good candidate for ALLOC_ARRAY().

Thanks - I've made this change in my local version.

> > > This sorting is a part of external contract (i.e. "output in hash
> > > order" is documented), but is it necessary?  Just wondering.
> > 
> > It is convenient for the client because the client can then store it
> > directly and binary search when accessing it. (Added a note about
> > convenience to the commit message.)
> 
> In general the Git client doesn't trust the pack data coming from a
> remote, and you can't corrupt a client by sending it bogus data. Either
> the client computes it from scratch (e.g., the sha1s of each object) or
> the client will reject nonsense (missing bases, refs pointing to objects
> that aren't sent, etc).
> 
> I know this feature implies a certain amount of trust (after all, the
> server could claim that it omitted any sha1 it likes), but we should
> probably still be as strict as possible that what the other side is
> sending makes sense. In this case, we should probably hashcmp() each
> entry with the last and make sure they're strictly increasing (no
> out-of-order and no duplicates).

Good point.