Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 11 Jul 2017 15:02:09 -0700

On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> Currently, Git does not support repos with very large numbers of blobs
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every blob referenced by a tree object is available
> somewhere in the repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised blobs. Each Git repo can contain a list of promised blobs and
> their sizes at $GIT_DIR/objects/promisedblob. This patch contains
> functions to query them; functions for creating and modifying that file
> will be introduced in later patches.
>
> A repository that is missing a blob but has that blob promised is not
> considered to be in error, so also teach fsck this.

Here I wondered what this file looks like, in a later patch you
add documentation:

  +objects/promisedblob::
  +       This file records the sha1 object names and sizes of promised
  +       blobs.
  +

but that leaves more fundamental questions:
* Is that a list of sha1s, separated by LF? (CRLF? How would Windows
  users interact with it, are they supposed to ever modify this file?)
* Similar to .git/packed-refs, would we want to have a first line
  that has some sort of comment?
* In the first question I assumed a linear list, will that be a sorted list,
  or do we want to have some fancy data structure here?
  (critbit tree, bloom filter)
* is the contents in ASCII or binary? (What are the separators?)
* In the future I presume we'd want to quickly answer "Is X in the
  promised blobs list?" so would it be possible (in the future) to
  improve the searching e.g. binary search?
* ... I'll read on to see my questions answered, but I'd guess
  others would prefer to see it in the docs. :)

> + * A mmap-ed byte array of size (missing_blob_nr * ENTRY_SIZE). Each
> + * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised blob and its
> + * 64-bit size in network byte order. The entries are sorted in ascending SHA-1
> + * order.

This seems to be the answer to the docs. :)
So binary representation, the size variable takes a fixed amount, such
that the n-th element in the file is at n * ENTRY_SIZE.

> +       if (st.st_size % ENTRY_SIZE) {
> +               die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
> +       }

So it looks as if the file format is an array of ENTRY_SIZE.

Similar to other files, would we want to prefix the file with
a 4 letter magic number and a version number?

> +       prepare_promised_blobs();
> +       result = sha1_entry_pos(promised_blobs, ENTRY_SIZE, 0, 0,
> +                               promised_blob_nr, promised_blob_nr, oid->hash);

IIRC, this is a binary search.

> +int for_each_promised_blob(each_promised_blob_fn cb, void *data)
> +{
> +       struct object_id oid;
> +       int i, r;
> +
> +       prepare_promised_blobs();
> +       for (i = 0; i < promised_blob_nr; i++) {
> +               memcpy(oid.hash, &promised_blobs[i * ENTRY_SIZE],
> +                      GIT_SHA1_RAWSZ);
> +               r = cb(&oid, data);
> +               if (r)
> +                       return r;
> +       }
> +       return 0;
> +}
> diff --git a/promised-blob.h b/promised-blob.h
> new file mode 100644
> index 000000000..a303ea1ff
> --- /dev/null
> +++ b/promised-blob.h
> @@ -0,0 +1,14 @@
> +#ifndef PROMISED_BLOB_H
> +#define PROMISED_BLOB_H
> +
> +/*
> + * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
> + * returns its size.
> + */
> +extern int is_promised_blob(const struct object_id *oid,
> +                           unsigned long *size);
> +
> +typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
> +int for_each_promised_blob(each_promised_blob_fn, void *);
> +
> +#endif
> diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
> new file mode 100755
> index 000000000..827072004
> --- /dev/null
> +++ b/t/t3907-promised-blob.sh
> @@ -0,0 +1,29 @@
> +#!/bin/sh
> +
> +test_description='promised blobs'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'fsck fails on missing blobs' '
> +       rm -rf repo &&

We should not need to delete the repo here, this is the first test?

> +       rm repo/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&

Later down the road, do we want to have a
(plumbing) command to move an object from
standard blob to promised blob (as of now I'd think
it would perform this rm call as well as an insert into
the promised blobs file) ?
(Well, we'd also have to think about how to get objects
out of a pack)

With such a command you can easily write your own custom
filter to free up blobs again.

> +       test_must_fail git -C repo fsck

    test_i18n_grep missing out ?

maybe, too? (Maybe that is already tested elsewhere,
so no need for it)

> +'
> +
> +test_expect_success '...but succeeds if it is a promised blob' '
> +       printf "%s%016x" "$HASH" "$(wc -c <repo/1.t)" |
> +               hex_pack >repo/.git/objects/promisedblob &&
> +       git -C repo fsck
> +'
> +
> +test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
> +       GIT_IGNORE_PROMISED_BLOBS=1 test_must_fail git -C repo fsck &&
> +       unset GIT_IGNORE_PROMISED_BLOBS
> +'

This patch makes sense, so far.

Thanks,
Stefan