[WIP v2 0/2] Modifying pack objects to support --blob-max-bytes

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 2 Jun 2017 12:38:43 -0700

Here's a new version addressing Junio's comments.

> Hmph, that statement is a hard to read and agree to.  I thought an
> ignored object that is not going to be packed is one that won't hit
> to_pack?

That is true currently, but will not be the full truth once the 2nd
patch is applied.  I have expanded the commit message to explain the
plan in more detail.

> It would be nice if the name of the parameter hints that this is
> counted in bytes, e.g. "--blob-max-bytes"; 

Done. Originally I wanted to avoid terms like "max" because this
disallowed the user from requesting that even 0-byte blobs not be sent,
but realized that I probably shouldn't bother about that. :-)

> Do we need to future-proof the output format so that we can later
> use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> is hexadecimal text, and it may not be so bad to make this also
> text, e.g. "<hash> SP <length> LF".  That way, you do not have to
> worry about byte order, hash size, or length limited to uint64.

The reason for using binary is for the convenience of the client to
avoid another conversion before storing it to disk (and also network
savings). In a large repo, I think this list will be quite large. I
realized that I didn't mention anything about this in the commit
message, so I have added an explanation.

I think this is sufficiently future-proof in that the format of this
hash matches the format of the hashes used in the objects in the
packfile. As for object size being limited to uint64, I think the
benefits of the space savings (in using binary) outweigh the small risk
that our files will get larger than that before we upgrade our protocol
:-P

>> ++
>> +If pack-objects cannot efficiently determine, for an arbitrary blob,
>> +that no to-be-packed tree has that blob with a filename beginning with
>> +".git" (for example, if bitmaps are used to calculate the objects to be
>> +packed), it will pack all blobs regardless of size.
>> +
> This is a bit hard to understand, at least to me.  Let me step-wise
> deconstruct what I think you are saying.
> 
>  - A blob can appear in multiple trees, and each tree has name
>    (i.e. pathname component) for the blob.
> 
>  - Sometimes we may know _all_ trees that contain a particular
>    blob and hence _all_ names these trees call this blob.  In such a
>    case, we can _prove_ that the blob never is used as ".git<something>"
>    in any tree.
> 
>  - We exclude a blob that is larger than the specified limit, but
>    only when we can prove the above.  If it is possible that the
>    blob might appear as ".git<something>" in some tree, the blob is
>    not omitted no matter its size.
> 
> And this is a very conservative way to avoid omitting something that
> could be one of those control files like ".gitignore".
> 
> Am I following your thought correctly?

We only need to prove for *to-be-packed* trees (the trees that
pack-objects are about to pack in the packfile(s) that it is printing),
but otherwise, you are correct. I was trying to be thorough in saying
that "if we don't have bitmaps, we know if a blob's name starts with
'.git', so we can do filtering by size, but if we have bitmaps, we don't
have any relevant name information" - any rewording suggestions are
appreciated.

> Can this multiplication overflow (hint: st_mult)?

Thanks - used st_mult.

> This sorting is a part of external contract (i.e. "output in hash
> order" is documented), but is it necessary?  Just wondering.

It is convenient for the client because the client can then store it
directly and binary search when accessing it. (Added a note about
convenience to the commit message.)

> You do want to exclude the trailing thing out of the checksum, but
> CSUM_CLOSE would close the fd and that is why you pass 0 here.  OK.
>
> Even though we are in "pack_to_stdout" codepath, I'd prefer to
> consistently use f->fd, not a hardcoded 1 here.  Of course,
> sha1close() would have freed 'f' at this point, so we need to grab
> the return value from the function to learn which fd is connected to
> our original "stdout" at this point.
> 
> Likewise for write_oversized_info(), which probably should just take
> "int fd" as parameter.

Done for write_oversized_info(), although now that we are moving the
invocation of write_oversized_info() out of write_pack_file(), this
might not be so important. (For the rest, I have reverted the code
because we are now printing the excluded hashes before the packfile
instead of after.)

> I do not necessarily think it is a bad idea to show this as trailing
> data after the pack, but when we come to this function, do we have
> enough information to make the "oversize" information appear before
> the pack data if we wanted to?  I have a suspicion that it is easier
> to handle at the receiving end, after the capability exchange
> decides to use this "omit oversized ones" extension, to receive this
> before the packdata begins.  If it is at the end, the receiver needs
> to spawn either unpack-objects or index-objects to cause it write to
> the disk, but when it does so, it has to have a FD open to read from
> their standard output so that the receiver can grab the "remainder"
> that these programs did not process.  "Write to the end of the pack
> stream data to these programs, and process the remainder ourselves"
> is much harder to arrange, I'd suspect.

Yes, we can write at the beginning - done.

> > +/*
> > + * Returns 1 and records this blob in oversized_blobs if the given blob does
> > + * not meet any defined blob size limits.  Blobs that appear as a tree entry
> > + * whose basename begins with ".git" are never considered oversized.
> > + *
> > + * If the tree entry corresponding to the given blob is unknown, pass NULL as
> > + * name. In this case, this function will always return 0 to avoid false
> > + * positives.
> > + */
> > +static int oversized(const unsigned char *sha1, const char *name) {  
>
> Not object_id?

This would require a "(const struct object_id *) sha1" cast in its
caller that potentially violates alignment constraints, as far as I
know, so I used "const unsigned char *" for now. If the object_id
conversion doesn't include this file in the near future, I'll try to
send out a patch for this file.

> When we have two paths to a blob, and the blob is first fed to this
> function with one name that does not begin with ".git", we would
> register it to the hashmap.  And then the other name that begins
> with ".git" is later discovered to point at the same blob, what
> happens?  Would we need to unregister it from the hashmap elsewhere
> in the code?
[snip]
> Ah, is this where the "unregistering" happens?

Yes.

> >  static int add_object_entry(const unsigned char *sha1, enum object_type type,
> > -			    const char *name, int exclude)
> > +			    const char *name, int preferred_base)  
> 
> This belongs to the previous step that renamed "exclude", no?

Yes - done.

Jonathan Tan (2):
  pack-objects: rename want_.* to ignore_.*
  pack-objects: support --blob-max-bytes

 Documentation/git-pack-objects.txt |  19 ++++-
 builtin/pack-objects.c             | 159 ++++++++++++++++++++++++++++++-------
 t/t5300-pack-object.sh             |  71 +++++++++++++++++
 3 files changed, 220 insertions(+), 29 deletions(-)

-- 
2.13.0.506.g27d5fe0cd-goog