On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > Currently, Git does not support repos with very large numbers of objects > or repos that wish to minimize manipulation of certain blobs (for > example, because they are very large) very well, even if the user > operates mostly on part of the repo, because Git is designed on the > assumption that every referenced object is available somewhere in the > repo storage. > > As a first step to reducing this problem, introduce the concept of > promised objects. Each Git repo can contain a list of promised objects > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch > contains functions to query them; functions for creating and modifying > that file will be introduced in later patches. > > A repository that is missing an object but has that object promised is not > considered to be in error, so also teach fsck this. As part of doing > this, object.{h,c} has been modified to generate "struct object" based > on only the information available to promised objects, without requiring > the object itself. > > Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> > --- > Documentation/technical/repository-version.txt | 6 ++ > Makefile | 1 + > builtin/fsck.c | 18 +++- > cache.h | 2 + > environment.c | 1 + > fsck.c | 6 +- > object.c | 19 ++++ > object.h | 19 ++++ > promised-object.c | 130 +++++++++++++++++++++++++ > promised-object.h | 22 +++++ > setup.c | 7 +- > t/t3907-promised-object.sh | 41 ++++++++ > t/test-lib-functions.sh | 6 ++ > 13 files changed, 273 insertions(+), 5 deletions(-) > create mode 100644 promised-object.c > create mode 100644 promised-object.h > create mode 100755 t/t3907-promised-object.sh > > diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt > index 00ad37986..f8b82c1c7 100644 > --- a/Documentation/technical/repository-version.txt > +++ b/Documentation/technical/repository-version.txt > @@ -86,3 +86,9 @@ for testing format-1 compatibility. > When the config key `extensions.preciousObjects` is set to `true`, > objects in the repository MUST NOT be deleted (e.g., by `git-prune` or > `git repack -d`). > + > +`promisedObjects` > +~~~~~~~~~~~~~~~~~ > + > +(Explain this - basically a string containing a command to be run > +whenever a missing object needs to be fetched.) > diff --git a/Makefile b/Makefile > index 9c9c42f8f..c1446d5ef 100644 > --- a/Makefile > +++ b/Makefile > @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o > LIB_OBJS += pretty.o > LIB_OBJS += prio-queue.o > LIB_OBJS += progress.o > +LIB_OBJS += promised-object.o > LIB_OBJS += prompt.o > LIB_OBJS += quote.o > LIB_OBJS += reachable.o > diff --git a/builtin/fsck.c b/builtin/fsck.c > index 462b8643b..49e21f361 100644 > --- a/builtin/fsck.c > +++ b/builtin/fsck.c > @@ -15,6 +15,7 @@ > #include "progress.h" > #include "streaming.h" > #include "decorate.h" > +#include "promised-object.h" > > #define REACHABLE 0x0001 > #define SEEN 0x0002 > @@ -44,6 +45,7 @@ static int name_objects; > #define ERROR_REACHABLE 02 > #define ERROR_PACK 04 > #define ERROR_REFS 010 > +#define ERROR_PROMISED_OBJECT 011 > > static const char *describe_object(struct object *obj) > { > @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, > { > struct object *obj; > > - obj = parse_object(oid); > + obj = parse_or_promise_object(oid); > if (!obj) { > error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); > errors_found |= ERROR_REACHABLE; > @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it) > fprintf(stderr, "Checking cache tree\n"); > > if (0 <= it->entry_count) { > - struct object *obj = parse_object(&it->oid); > + struct object *obj = parse_or_promise_object(&it->oid); > if (!obj) { > error("%s: invalid sha1 pointer in cache-tree", > oid_to_hex(&it->oid)); > @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid, > return 0; > } > > +static int mark_have_promised_object(const struct object_id *oid, void *data) > +{ > + mark_object_for_connectivity(oid); > + return 0; > +} > + > static char const * const fsck_usage[] = { > N_("git fsck [<options>] [<object>...]"), > NULL > @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) > > git_config(fsck_config, NULL); > > + if (fsck_promised_objects()) { > + error("Errors found in promised object list"); > + errors_found |= ERROR_PROMISED_OBJECT; > + } This got me thinking: It is an error if we do not have an object and also do not promise it, but what about the other case: having and object and promising it, too? That is not harmful to the operation, except that the promise machinery may be slower due to its size. (Should that be a soft warning then? Do we have warnings in fsck?) > * The object type is stored in 3 bits. > */ We may want to remove this comment while we're here as it sounds stale despite being technically correct. 1974632c66 (Remove TYPE_* constant macros and use object_type enums consistently., 2006-07-11) > struct object { > + /* > + * Set if this object is parsed. If set, "type" is populated and this > + * object can be casted to "struct commit" or an equivalent. > + */ > unsigned parsed : 1; > + /* > + * Set if this object is not in the repo but is promised. If set, > + * "type" is populated, but this object cannot be casted to "struct > + * commit" or an equivalent. > + */ > + unsigned promised : 1; Would it make sense to have a bit field instead: #define STATE_BITS 2 #define STATE_PARSED (1<<0) #define STATE_PROMISED (1<<1) unsigned state : STATE_BITS This would be similar to the types and flags? > +test_expect_success 'fsck fails on missing objects' ' > + test_create_repo repo && > + > + test_commit -C repo 1 && > + test_commit -C repo 2 && > + test_commit -C repo 3 && > + git -C repo tag -a annotated_tag -m "annotated tag" && > + C=$(git -C repo rev-parse 1) && > + T=$(git -C repo rev-parse 2^{tree}) && > + B=$(git hash-object repo/3.t) && > + AT=$(git -C repo rev-parse annotated_tag) && > + > + # missing commit, tree, blob, and tag > + rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) && > + rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) && > + rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) && > + rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) && This is a pretty cool test as it promises all sorts of objects from different parts of the graph.