Re: Removing Partial Clone / Filtered Clone on a repo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 1, 2021 at 12:39 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:

> Could you describe more about your scenario and why you want to
> get all objects?

A 13GB (with 1.2GB shallow head) repo is in that in-between spot where
you want to be able to get something useful to the user as fast as
possible (read: in less than the 4 hours it would take to download the
whole thing over a mediocre VPN, with corresponding risk of errors
partway), but where a user might later (eg overnight) want to get the
rest of the repo, to avoid history inconsistency issues.

In our current mode of operation (Shallow clones to 15 months' depth
by default), the initial clone can complete in well under an hour, but
the problem with the resulting clone is that normal git tooling will
see the shallow grafted commit as the "initial commit" of all older
files, and that causes no end of confusion on the part of users, eg on
"git blame". This is the main reason why we would like to consider
moving to full-history but filtered-blob clones.

(there are other reasons around manageability, eg the git server's
behavior around --shallow-since when some branches in refspec scope
are older than that date; it sends them with all their history,
effectively downloading the whole repo; similarly if a refspec is
expanded and the next fetch is run without explicit --shallow-since,
and finds new branches not already shallow-grafted, it will download
those in their entirely because the shallow-since date is not
persisted beyond the shallow grafts themselves).

With a (full-history all-trees no-blobs-except-HEAD) filtered clone,
the initial download can be quite a bit smaller than the shallow clone
scenario above (eg 1.5GB vs 2.2GB), and most of the disadvantages of
shallow clones are addressed: the just-in-time fetching can typically
work quite naturally, there are no "lies" in the history, nor are
there scenarios where you suddenly fetch an extra 10GB of history
without wanting/expecting to.

With the filtered clone there are still little edge-cases that might
motivate a user to "bite the bullet" and unfilter their clone,
however: The most obvious one I've found so far is "git blame" - it
loops fetch requests serially until it bottoms out, which on an older
poorly-factored file (hundreds or thousands of commits, each touching
different bits of a file) will effectively never complete, at
10s/fetch. And depending on the UI tooling the user is using, they may
have almost no visibility into why this "git blame" (or "annotate", or
whatever the given UI calls it) seems to hang forever.

You can work around this "git blame" issue for *most* situations, in
the case of our repo, by using a different initial filter spec, eg
"--filter=blob:limit=200k", which only costs you an extra 1GB or so...
But then you still have outliers - and in fact, the most "blameable"
files will tend to be the larger ones... :)

My working theory is that we should explain all the following to users:
* Your initial download is a nice compromise between functionality and
download delay
* You have almost all the useful history, and you have it within less
than an hour
* If you try to use "git blame" (or some other as-yet-undiscovered
scenarios) on a larger file, it may hang. In that case cancel, run a
magic command we provide which fetches all the blobs in that specific
file's history, and try again. (the magic command is a path-filtered
rev-list looking for missing objects, passed into fetch)
* If you ever get tired of the rare weird hangs, you have the option
of running *some process* that "unfilters" the repo, paying down that
initial compromise (and taking up a bit more HD space), eg overnight

This explanation is a little awkward, but less awkward than the
previous "'git blame' lies to you - it blames completely the wrong
person for the bulk of the history for the bulk of the files;
unshallow overnight if this bothers you", which is the current story
with shallow clone.

Of course, if unfiltering a large repo is impractical (and if it will
remain so), then we will probably need to err on the side of
generosity in the original clone - eg 1M instead of 200k as the blob
filter, 3GB vs 2.5GB as the initial download - and remove the last
line of the explanation! If unfiltering, or refiltering, were
practical, then we would probably err on the size of
less-blobs-by-default to optimize first download.

Over time, as we refactor the project itself to reduce the incidence
of megafiles, I would expect to be able to drop the
standard/recommended blob-size-limit too.

Sorry about the wall-of-text, hopefully I've answered the question!

Thanks,
Tao



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux