Re: [PATCH] pack-objects: use streaming interface for reading large loose blobs

Nicolas Pitre <nico@xxxxxxxxxxx> · Sat, 12 May 2012 12:51:05 -0400 (EDT)

On Sat, 12 May 2012, Nguyễn Thái Ngọc Duy wrote:

> git usually streams large blobs directly to packs. But there are cases
> where git can create large loose blobs (unpack-objects or hash-object
> over pipe). Or they can come from other git implementations.
> core.bigfilethreshold can also be lowered down and introduce a new
> wave of large loose blobs.
> 
> Use streaming interface to read these blobs and compress/write at the
> same time.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>

Comments below.

> ---
>  index-pack's streaming support is on the way. unpack-objects is
>  another story because I'm thinking of merging it back to index-pack
>  first, which may take more than one release cycle.
> 
>  builtin/pack-objects.c |   73 ++++++++++++++++++++++++++++++++++++++++++++----
>  t/t1050-large.sh       |   16 ++++++++++
>  2 files changed, 83 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 1861093..98b51c1 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -259,9 +309,14 @@ static unsigned long write_object(struct sha1file *f,
>  	if (!to_reuse) {
>  		no_reuse:
>  		if (!usable_delta) {
> -			buf = read_sha1_file(entry->idx.sha1, &type, &size);
> -			if (!buf)
> -				die("unable to read %s", sha1_to_hex(entry->idx.sha1));
> +			type = sha1_object_info(entry->idx.sha1, &size);

Please don't use sha1_object_info() lightly.  This is a potentially 
expensive operation, and you really don't want to do it on each objects.

And as a matter of fact, the information you are looking for has already 
been determined earlier.  See the code in check_object() which tries 
hard to avoid sha1_object_info() as much as possible.

Therefore you should have entry->type and entry->size already set for 
you to use.

Nicolas