Re: [RFC PATCH] Updated "imported object" design

Ben Peart <peartben@xxxxxxxxx> · Thu, 17 Aug 2017 16:07:26 -0400

On 8/15/2017 8:32 PM, Jonathan Tan wrote:
This patch is based on an updated version of my refactoring of
pack-related functions [1].

This corresponds to patch 1 of my previous series [2]. I'm sending this
patch out (before I update the rest) for 2 reasons:
  * to provide a demonstration of how the feature could be implemented,
    in the hope of restarting the discussion
  * to obtain comments about this patch to see if I'm heading in the
    right direction

In an earlier e-mail [3], I suggested that loose objects can also be
marked as ".imported" (formerly ".remote" - after looking at the code, I
decided to use "imported" throughout, since "remote" can be easily
confused as the opposite of "local", used to represent objects in the
local store as opposed to an alternate store).

However, I have only implemented the imported packed objects part -
imported loose objects can be added later.

It still remains to be discussed whether we should mark the imported
objects or the non-imported objects as the source of promises, but I
still think that we should mark the imported objects. In this way, the
new properties (the provision of promises and the mark) coincide on the
same object, and the same things (locally created objects, fetches from
non-lazy-object-serving remotes) behave in the same way regardless of
whether extensions.lazyObject is set (allowing, for example, a repo to
be converted into a promise-enabled one solely through modifying the
configuration).

This illustrates another place we need to resolve the naming/vocabulary. 
 We should at least be consistent to make it easier to discuss/explain. 
 We obviously went with "virtual" when building GVFS but I'm OK with 
"lazy" as long as we're consistent.  Some examples of how the naming can 
clarify or confuse:

'Promise-enable your repo by setting the "extensions.lazyObject" flag'

'Enable your repo to lazily fetch objects by setting the 
"extensions.lazyObject"'

'Virtualize your repo by setting the "extensions.virtualize" flag'

We may want to carry the same name into the filename we use to mark the 
(virtualized/lazy/promised/imported) objects.

(This reminds me that there are only 2 hard problems in computer 
science...) ;)

It is true that converting a normal repo to 
virtualized/lazy/promise-enabled repo wouldn't have to munge with the 
object marking but since this only happens once, I wouldn't make that 
primary decision maker (and switching back would require munging the 
objects anyway).  I think we can make either work (ie tagging local vs 
non-local/remote/imported/promised/lazy/virtual).

I'm fine leaving "how" the objects are marked as an implementation 
detail - the design requires them to be marked, the person writing the 
code can figure out the fastest way to accomplish that.

In the current patch/design, the access time to check for the existence 
of an "imported" file is going to be dwarfed by the cost of opening and 
parsing every object (loose and pack) to get its oid so that isn't 
really an issue.

+/*
+ * Objects that are believed to be loadable by the lazy loader, because
+ * they are referred to by an imported object. If an object that we have
+ * refers to such an object even though we don't have that object, it is
+ * not an error.
+ */
+static struct oidset promises;
+static int promises_prepared;
+
+static int add_promise(const struct object_id *oid, struct packed_git *pack,
+		       uint32_t pos, void *data)
+{
+	struct object *obj = parse_object(oid);
+	if (!obj)
+		/*
+		 * Error messages are given when packs are verified, so
+		 * do not print any here.
+		 */
+		return 0;
+	
+	/*
+	 * If this is a tree, commit, or tag, the objects it refers
+	 * to are promises. (Blobs refer to no objects.)
+	 */
+	if (obj->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *) obj;
+		struct tree_desc desc;
+		struct name_entry entry;
+		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
+			/*
+			 * Error messages are given when packs are
+			 * verified, so do not print any here.
+			 */
+			return 0;
+		while (tree_entry_gently(&desc, &entry))
+			oidset_insert(&promises, entry.oid);
+	} else if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) obj;
+		struct commit_list *parents = commit->parents;
+
+		oidset_insert(&promises, &commit->tree->object.oid);
+		for (; parents; parents = parents->next)
+			oidset_insert(&promises, &parents->item->object.oid);
+	} else if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag *) obj;
+		oidset_insert(&promises, &tag->tagged->oid);
+	}
+	return 0;
+}
+
+static int is_promise(const struct object_id *oid)
+{
+	if (!promises_prepared) {
+		if (repository_format_lazy_object)
+			for_each_packed_object(add_promise, NULL,
+					       FOR_EACH_OBJECT_IMPORTED_ONLY);
+		promises_prepared = 1;
+	}
+	return oidset_contains(&promises, oid);
+}
+

I think this all works and would meet the requirements we've been 
discussing.  The big trade off here vs what we first discussed with 
promises is that we are generating the list of promises on the fly when 
they are needed rather than downloading and maintaining a list locally.

My biggest concern with this model is the cost of opening and parsing 
every imported object (loose and pack for local and alternates) to build 
the oidset of promises.

In fsck this probably won't be an issue as it already focuses on 
correctness at the expense of speed.  I'm more worried about when we add 
the same/similar logic into check_connected.  That impacts fetch, clone, 
and receive_pack.

I guess the only way we can know for sure it to do a perf test and 
measure the impact.