Re: Partial clone design (with connectivity check for locally-created objects)

Ben Peart <peartben@xxxxxxxxx> · Tue, 8 Aug 2017 10:18:43 -0400

On 8/7/2017 3:21 PM, Jonathan Nieder wrote:
Hi,

Ben Peart wrote:
On Fri, 04 Aug 2017 15:51:08 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:
Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

"Imported" objects must be in a packfile that has a "<pack name>.remote"
file with arbitrary text (similar to the ".keep" file). They come from
clones, fetches, and the object loader (see below).
...

A "homegrown" object is valid if each object it references:
  1. is a "homegrown" object,
  2. is an "imported" object, or
  3. is referenced by an "imported" object.

Overall it captures what was discussed, and I think it is a good
start.

I missed the offline discussion and so am trying to piece together
what this latest design is trying to do.  Please let me know if I'm
not understanding something correctly.

I believe
https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@xxxxxxxxxx/
and the surrounding thread (especially
https://public-inbox.org/git/xmqqefsudjqk.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/)
is the discussion Junio is referring to.

[...]
This segmentation is what is driving the need for the object loader
to build a new local pack file for every command that has to fetch a
missing object.  For example, we can't just write a tree object from
a "partial" clone into the loose object store as we have no way for
fsck to treat them differently and ignore any missing objects
referenced by that tree object.

That's related and how it got lumped into this proposal, but it's not
the only motivation.

Other aspects:

  1. using pack files instead of loose objects means we can use deltas.
     This is the primary motivation.

  2. pack files can use reachability bitmaps (I realize there are
     obstacles to getting benefit out of this because git's bitmap
     format currently requires a pack to be self-contained, but I
     thought it was worth mentioning for completeness).

  3. existing git servers are oriented around pack files; they can
     more cheaply serve objects from pack files in pack format,
     including reusing deltas from them.

  4. file systems cope better with a few large files than many small
     files

[...]
We all know that git doesn't scale well with a lot of pack files as
it has to do a linear search through all the pack files when
attempting to find an object.  I can see that very quickly, there
would be a lot of pack files generated and with gc ignoring
"partial" pack files, this would never get corrected.

Yes, that's an important point.  Regardless of this proposal, we need
to get more aggressive about concatenating pack files (e.g. by
implementing exponential rollup in "git gc --auto").

In our usage scenarios, _all_ of the objects come from "partial"
clones so all of our objects would end up in a series of "partial"
pack files and would have pretty poor performance as a result.

Can you say more about this?  Why would the pack files (or loose
objects, for that matter) never end up being consolidated into few
pack files?

Our initial clone is very sparse - we only pull down the commit we are 
about to checkout and none of the blobs. All missing objects are then 
downloaded on demand (and in this proposal, would end up in a "partial" 
pack file).  For performance reasons, we also (by default) download a 
server computed pack file of commits and trees to pre-populate the local 
cache.

Without modification, fsck, repack, prune, gc will trigger every object 
in the repo to be downloaded.  We punted for now and just block those 
commands but eventually they need to be aware of missing objects so that 
they do not cause them to be downloaded.  Jonathan is already working on 
this for fsck in another patch series.

[...]
That thinking did lead me back to wondering again if we could live
with a repo specific flag.  If any clone/fetch was "partial" the
flag is set and fsck ignore missing objects whether they came from a
"partial" remote or not.

I'll admit it isn't as robust if someone is mixing and matching
remotes from different servers some of which are partial and some of
which are not.  I'm not sure how often that would actually happen
but I _am_ certain a single repo specific flag is a _much_ simpler
model than anything else we've come up with so far.

The primary motivation in this thread is locally-created objects, not
objects obtained from other remotes.  Objects obtained from other
remotes are more of an edge case.

Thank you - that helps me to better understand the requirements of the 
problem we're trying to solve.  In short, that means what we really need 
is a way to identify locally created objects so that fsck can do a 
complete connectivity check on them.  I'll have to think about a good 
way to do that - we've talked about a few but each has a different set 
of trade-offs and none of them are great (yet :)).

Thanks for your thoughtful comments.

Jonathan