On 8/17/2017 5:39 PM, Jonathan Tan wrote:
Thanks for your comments. I'll reply to both your e-mails in this one
e-mail.
This illustrates another place we need to resolve the
naming/vocabulary. We should at least be consistent to make it easier
to discuss/explain. We obviously went with "virtual" when building
GVFS but I'm OK with "lazy" as long as we're consistent. Some
examples of how the naming can clarify or confuse:
'Promise-enable your repo by setting the "extensions.lazyObject" flag'
'Enable your repo to lazily fetch objects by setting the
"extensions.lazyObject"'
'Virtualize your repo by setting the "extensions.virtualize" flag'
We may want to carry the same name into the filename we use to mark
the (virtualized/lazy/promised/imported) objects.
(This reminds me that there are only 2 hard problems in computer
science...) ;)
Good point about the name. Maybe the 2nd one is the best? (Mainly
because I would expect a "virtualized" repo to have virtual refs too.)
But if there was a good way to refer to the "anti-projection" in a
virtualized system (that is, the "real" thing or "object" behind the
"virtual" thing or "image"), then maybe the "virtualized" language is
the best. (And I would gladly change - I'm having a hard time coming up
with a name for the "anti-projection" in the "lazy" language.)
The most common "anti-virtual" language I'm familiar with is "physical."
Virtual machine <-> physical machine. Virtual world <-> physical
world. Virtual repo, commit, tree, blob - physical repo, commit, tree,
blob. I'm not thrilled but I think it works...
Also, I should probably standardize on "lazily fetch" instead of "lazily
load". I didn't want to overlap with the existing fetching, but after
some thought, it's probably better to do that. The explanation would
thus be that you can either use the built-in Git fetcher (to be built,
although I have an old version here [1]) or supply a custom fetcher.
[1] https://github.com/jonathantanmy/git/commits/partialclone
I think this all works and would meet the requirements we've been
discussing. The big trade off here vs what we first discussed with
promises is that we are generating the list of promises on the fly
when they are needed rather than downloading and maintaining a list
locally.
My biggest concern with this model is the cost of opening and parsing
every imported object (loose and pack for local and alternates) to
build the oidset of promises.
In fsck this probably won't be an issue as it already focuses on
correctness at the expense of speed. I'm more worried about when we
add the same/similar logic into check_connected. That impacts fetch,
clone, and receive_pack.
I guess the only way we can know for sure it to do a perf test and
measure the impact.
As for fetching from the main repo, the connectivity check does not need
to be performed at all because all objects are "imported", so the
performance of the connectivity check does not matter. Same for cloning.
Very good point! I got stuck on connectivity check in general forgetting
that we really only need to prevent sharing a corrupt repo.
This is not true if you're fetching from another repo
This isn't a case we've explicitly dealt with (multiple remotes into a
virtualized repo). Our behavior today would be that once you set the
"virtual repo" flag on the repo (this happens at clone for us), all
remotes are treated as virtual as well (ie we don't differentiate
behavior based on which remote was used). Our "custom fetcher" always
uses "origin" and some custom settings for a cache-server saved in the
.git/config file when asked to fetch missing objects.
This is probably a good model to stick with at least initially as trying
to solve multiple possible "virtual" remotes as well as mingling
virtualized and non-virtualized remotes and all the mixed cases that can
come up makes my head hurt. We should probably address that in a
different thread. :)
or if you're using
receive-pack, but (1) I think these are not used as much in such a
situation, and (2) if you do use them, the slowness only "kicks in" if
you do not have the objects referred to (whether non-"imported" or
"imported") and thus have to check the references in all "imported"
objects.
Is there any case where receive-pack is used on the client side? I'm
only aware of it being used on the server side to receive packs pushed
from the client. If it is not used in a virtualized client, then we
would not need to do anything different for receive-pack.
I think this topic should continue to move forward so that we can
provide reasonable connectivity tests for fsck and check_connected in
the face of partial clones. I'm not sure the prototype implementation
of reading/parsing all imported objects to build the promised oidset is
the most performant model but we can continue to investigate the best
options.
Agreed - I think the most important thing here is settling on the API
(name of extension and the nature of the object mark).
Given all we need is an existance check for a given oid,
This is true...
I wonder if it
would be faster overall to do a binary search through the list of
imported idx files + an existence test for an imported loose object.
...but what we're checking is the existence of a reference, not the
existence of an object. For a concrete example, consider what happens if
we both have an "imported" tree and a non-"imported" tree that
references a blob that we do not have. When checking the non-"imported"
tree for connectivity, we have to iterate through all "imported" trees
to see if any can vouch for the existence of such a blob. We cannot
merely binary-search the .idx file.
That is another good point. Given the discussion above about not
needing to do the connectivity test for fetch/clone - the potential perf
hit of loading/parsing all the various objects to build up the oidset is
much less of an issue.