Re: [PATCH] Enhance unpack-objects for extracting large objects

Junio C Hamano <junkio@xxxxxxx> · Fri, 25 May 2007 12:22:28 -0700

Dana How <danahow@xxxxxxxxx> writes:

> Nicolas Pitre wrote:
>> I wouldn't mind a _separate_ tool that would load a pack index,
>> determine object sizes from it, and then extract big objects to write
>> them as loose objects ...
>
> Below we add two new options to git-unpack-objects:
>
> --min-blob-size=<n>::  Unpacking is only done for objects
> larger than or equal to n kB (uncompressed size by Junio).

Elsewhere you wanted to use --max-* and that was counted in megs;
isn't using kilo here and meg there inconsistent?

> --force::  Loose objects will be created even if they
> already exist in the repository packed.  This is an option
> I've wanted before for other reasons.

        ... but if they already exist in the repository as loose
        objects, do not replace it.

Usually we do not overwrite existing loose objects and it is one
of the security measure --- if you have an object already, that
cannot be touched by somebody who maliciously creats a hash
colliding loose object and tries to inject it into your
repository via unpack-objects.  It's good that you kept this
behaviour intact.

> This passes the tests in "t" but has not yet been used on my large repos.
> Based on "next" but should apply to "master" as well.
>
> Signed-off-by: Dana L. How <danahow@xxxxxxxxx>
> ---
>  Documentation/git-unpack-objects.txt |   17 +++++++++++++----
>  builtin-unpack-objects.c             |   20 ++++++++++++++++++--
>  cache.h                              |    2 ++
>  sha1_file.c                          |   11 +++++++++--
>  4 files changed, 42 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/git-unpack-objects.txt b/Documentation/git-unpack-objects.txt
> index ff6184b..4513d8d 100644
> --- a/Documentation/git-unpack-objects.txt
> +++ b/Documentation/git-unpack-objects.txt
> ...
> @@ -17,9 +17,10 @@ Read a packed archive (.pack) from the standard input, expanding
>  the objects contained within and writing them into the repository in
>  "loose" (one object per file) format.
>  
> -Objects that already exist in the repository will *not* be unpacked
> -from the pack-file.  Therefore, nothing will be unpacked if you use
> -this command on a pack-file that exists within the target repository.
> +By default,  objects that already exist in the repository will *not*
> +be unpacked from the pack-file.  Therefore, nothing will be unpacked
> +if you use this command on a pack-file that exists within the target
> +repository,  unless you specify --force.

I would want to add:

	If an object already exists unpacked in the repository,
	it will not be replaced with the copy from the pack,
	with or without `--force`.

> diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
> index a6ff62f..a42bf0d 100644
> --- a/builtin-unpack-objects.c
> +++ b/builtin-unpack-objects.c
> @@ -10,13 +10,16 @@
>  #include "progress.h"
>  
>  static int dry_run, quiet, recover, has_errors;
> -static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] < pack-file";
> +static const char unpack_usage[] =
> +"git-unpack-objects [-n] [-q] [-r] [--force] [--min-blob-size=N] < pack-file";

Maybe we would want to call it '-f' for consistency.  Another
possibility is the other way around, giving others a longer
synonyms, like --quiet, but this command is plumbing and I do
not think long options matters that much, so my preference is to
do '-f' not '--force'.

> @@ -131,7 +134,9 @@ static void added_object(unsigned nr, enum object_type type,
>  static void write_object(unsigned nr, enum object_type type,
>  			 void *buf, unsigned long size)
>  {
> -	if (write_sha1_file(buf, size, typename(type), obj_list[nr].sha1) < 0)
> +	int force2 = size < min_blob_size ? -1 : force;
> +	if (write_sha1_file_maybe(buf, size, typename(type),
> +				  force2, obj_list[nr].sha1) < 0)
>  		die("failed to write object");
>  	added_object(nr, type, buf, size);
>  }

Without --min-blob-size option, min_blob_size is initialized to
0u and force2 always gets the value of force.  With the option,
blobs smaller than the threshold gets -1 and otherwise the value
of force.

"write_sha1_file_maybe()" can take 0, 1, or -1 as its fourth
parameter.  The reader is left puzzled what the distinction
among these three and decides to read on to figure it out before
complaining too much about the code, but no matter what it does,
doesn't the above logic already feel wrong?

 * You already have the size here, so if min_blob_size is set
   and the size is larger, you do not even have to call
   write_sha1_file() at all.

 * If you do so, write_sha1_file_maybe()'s additional parameter
   can be "skip the check to see if we have one packed".

> diff --git a/cache.h b/cache.h
> index ec85d93..d0c3030 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -343,6 +343,8 @@ extern int sha1_object_info(const unsigned char *, unsigned long *);
>  extern void * read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size);
>  extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
>  extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
> +extern int write_sha1_file_maybe(void *buf, unsigned long len, const char *type,
> +				 int ignore, unsigned char *return_sha1);
>  extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
>  
>  extern int check_sha1_signature(const unsigned char *sha1, void *buf, unsigned long size, const char *type);

... and it says "ignore".  The reader is still puzzled and reads on...

> diff --git a/sha1_file.c b/sha1_file.c
> index 12d2ef2..68b8db8 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1979,7 +1979,8 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type,
>  	return 0;
>  }
>  
> -int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
> +int write_sha1_file_maybe(void *buf, unsigned long len, const char *type,
> +			  int ignore, unsigned char *returnsha1)
>  {
>  	int size, ret;
>  	unsigned char *compressed;
> @@ -1997,7 +1998,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
>  	filename = sha1_file_name(sha1);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
> -	if (has_sha1_file(sha1))
> +	if (ignore < 0 || !ignore && has_sha1_file(sha1))
>  		return 0;
>  	fd = open(filename, O_RDONLY);
>  	if (fd >= 0) {

So "ignore" means:

        negative:       never write it out, even if it does not exist.

        zero:           do not write it out if it is available (in pack,
                        or loose, either local or alternate), do
                        write it out otherwise; it is the same
                        as the current behaviour of write_sha1_file().

        positive:       always write it out.

That does not sound like "ignore".

My suggestion would be:

>  static void write_object(unsigned nr, enum object_type type,
>  			 void *buf, unsigned long size)
>  {
        if (!min_blob_size || size < min_blob_size) {
             if (write_sha1_file_maybe(buf, size, typename(type),
                             force, obj_list[nr].sha1) < 0)
                     die("failed to write object");
             }
        }
 	added_object(nr, type, buf, size);
>  }

And then.

int write_sha1_file_maybe(void *buf, unsigned long len, const char *type,
		int make_loose, unsigned char *returnsha1)
{
	...
>  	filename = sha1_file_name(sha1);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
	if (!make_loose && has_sha1_file(sha1))
>  		return 0;
>  	fd = open(filename, O_RDONLY);
>  	if (fd >= 0) {

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html