Re: Partial clone design (with connectivity check for locally-created objects)

Jonathan Nieder <jrnieder@xxxxxxxxx> · Mon, 7 Aug 2017 12:21:51 -0700

Hi,

Ben Peart wrote:
>> On Fri, 04 Aug 2017 15:51:08 -0700
>> Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

>>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>>> file with arbitrary text (similar to the ".keep" file). They come from
>>>> clones, fetches, and the object loader (see below).
>>>> ...
>>>>
>>>> A "homegrown" object is valid if each object it references:
>>>>  1. is a "homegrown" object,
>>>>  2. is an "imported" object, or
>>>>  3. is referenced by an "imported" object.
>>>
>>> Overall it captures what was discussed, and I think it is a good
>>> start.
>
> I missed the offline discussion and so am trying to piece together
> what this latest design is trying to do.  Please let me know if I'm
> not understanding something correctly.

I believe
https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@xxxxxxxxxx/
and the surrounding thread (especially
https://public-inbox.org/git/xmqqefsudjqk.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/)
is the discussion Junio is referring to.

[...]
> This segmentation is what is driving the need for the object loader
> to build a new local pack file for every command that has to fetch a
> missing object.  For example, we can't just write a tree object from
> a "partial" clone into the loose object store as we have no way for
> fsck to treat them differently and ignore any missing objects
> referenced by that tree object.

That's related and how it got lumped into this proposal, but it's not
the only motivation.

Other aspects:

 1. using pack files instead of loose objects means we can use deltas.
    This is the primary motivation.

 2. pack files can use reachability bitmaps (I realize there are
    obstacles to getting benefit out of this because git's bitmap
    format currently requires a pack to be self-contained, but I
    thought it was worth mentioning for completeness).

 3. existing git servers are oriented around pack files; they can
    more cheaply serve objects from pack files in pack format,
    including reusing deltas from them.

 4. file systems cope better with a few large files than many small
    files

[...]
> We all know that git doesn't scale well with a lot of pack files as
> it has to do a linear search through all the pack files when
> attempting to find an object.  I can see that very quickly, there
> would be a lot of pack files generated and with gc ignoring
> "partial" pack files, this would never get corrected.

Yes, that's an important point.  Regardless of this proposal, we need
to get more aggressive about concatenating pack files (e.g. by
implementing exponential rollup in "git gc --auto").

> In our usage scenarios, _all_ of the objects come from "partial"
> clones so all of our objects would end up in a series of "partial"
> pack files and would have pretty poor performance as a result.

Can you say more about this?  Why would the pack files (or loose
objects, for that matter) never end up being consolidated into few
pack files?

[...]
> That thinking did lead me back to wondering again if we could live
> with a repo specific flag.  If any clone/fetch was "partial" the
> flag is set and fsck ignore missing objects whether they came from a
> "partial" remote or not.
>
> I'll admit it isn't as robust if someone is mixing and matching
> remotes from different servers some of which are partial and some of
> which are not.  I'm not sure how often that would actually happen
> but I _am_ certain a single repo specific flag is a _much_ simpler
> model than anything else we've come up with so far.

The primary motivation in this thread is locally-created objects, not
objects obtained from other remotes.  Objects obtained from other
remotes are more of an edge case.

Thanks for your thoughtful comments.

Jonathan