Re: [RFC PATCH] Updated "imported object" design

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 17 Aug 2017 14:39:05 -0700

Thanks for your comments. I'll reply to both your e-mails in this one
e-mail.

> This illustrates another place we need to resolve the
> naming/vocabulary.  We should at least be consistent to make it easier
> to discuss/explain.  We obviously went with "virtual" when building
> GVFS but I'm OK with "lazy" as long as we're consistent.  Some
> examples of how the naming can clarify or confuse:
> 
> 'Promise-enable your repo by setting the "extensions.lazyObject" flag'
> 
> 'Enable your repo to lazily fetch objects by setting the
> "extensions.lazyObject"'
> 
> 'Virtualize your repo by setting the "extensions.virtualize" flag'
> 
> We may want to carry the same name into the filename we use to mark
> the (virtualized/lazy/promised/imported) objects.
> 
> (This reminds me that there are only 2 hard problems in computer
> science...) ;)

Good point about the name. Maybe the 2nd one is the best? (Mainly
because I would expect a "virtualized" repo to have virtual refs too.)

But if there was a good way to refer to the "anti-projection" in a
virtualized system (that is, the "real" thing or "object" behind the
"virtual" thing or "image"), then maybe the "virtualized" language is
the best. (And I would gladly change - I'm having a hard time coming up
with a name for the "anti-projection" in the "lazy" language.)

Also, I should probably standardize on "lazily fetch" instead of "lazily
load". I didn't want to overlap with the existing fetching, but after
some thought, it's probably better to do that. The explanation would
thus be that you can either use the built-in Git fetcher (to be built,
although I have an old version here [1]) or supply a custom fetcher.

[1] https://github.com/jonathantanmy/git/commits/partialclone

> I think this all works and would meet the requirements we've been
> discussing.  The big trade off here vs what we first discussed with
> promises is that we are generating the list of promises on the fly
> when they are needed rather than downloading and maintaining a list
> locally.
> 
> My biggest concern with this model is the cost of opening and parsing
> every imported object (loose and pack for local and alternates) to
> build the oidset of promises.
> 
> In fsck this probably won't be an issue as it already focuses on
> correctness at the expense of speed.  I'm more worried about when we
> add the same/similar logic into check_connected.  That impacts fetch,
> clone, and receive_pack.
> 
> I guess the only way we can know for sure it to do a perf test and
> measure the impact.

As for fetching from the main repo, the connectivity check does not need
to be performed at all because all objects are "imported", so the
performance of the connectivity check does not matter. Same for cloning.

This is not true if you're fetching from another repo or if you're using
receive-pack, but (1) I think these are not used as much in such a
situation, and (2) if you do use them, the slowness only "kicks in" if
you do not have the objects referred to (whether non-"imported" or
"imported") and thus have to check the references in all "imported"
objects.

> I think this topic should continue to move forward so that we can 
> provide reasonable connectivity tests for fsck and check_connected in 
> the face of partial clones.  I'm not sure the prototype implementation 
> of reading/parsing all imported objects to build the promised oidset is 
> the most performant model but we can continue to investigate the best 
> options.

Agreed - I think the most important thing here is settling on the API
(name of extension and the nature of the object mark).

> Given all we need is an existance check for a given oid,

This is true...

> I wonder if it 
> would be faster overall to do a binary search through the list of 
> imported idx files + an existence test for an imported loose object.

...but what we're checking is the existence of a reference, not the
existence of an object. For a concrete example, consider what happens if
we both have an "imported" tree and a non-"imported" tree that
references a blob that we do not have. When checking the non-"imported"
tree for connectivity, we have to iterate through all "imported" trees
to see if any can vouch for the existence of such a blob. We cannot
merely binary-search the .idx file.