On 3/15/2018 4:36 AM, Ævar Arnfjörð Bjarmason wrote:
On Thu, Mar 15 2018, Junio C. Hamano jotted:
* nd/repack-keep-pack (2018-03-07) 6 commits
- SQUASH???
- pack-objects: display progress in get_object_details()
- pack-objects: show some progress when counting kept objects
- gc --auto: exclude base pack if not enough mem to "repack -ad"
- repack: add --keep-pack option
- t7700: have closing quote of a test at the beginning of line
"git gc" in a large repository takes a lot of time as it considers
to repack all objects into one pack by default. The command has
been taught to pretend as if the largest existing packfile is
marked with ".keep" so that it is left untouched while objects in
other packs and loose ones are repacked.
Expecting a reroll.
cf. <CACsJy8BW_EtxQvgL=YrCXCQY7cEWCQxgfkeH=Gd=X=uVYhPJcw@xxxxxxxxxxxxxx>
Except for final finishing touches, this looked more-or-less ready
for 'next'.
As I noted in 87a7vdqegi.fsf@xxxxxxxxxxxxxxxxxxx and
877eqhq7ha.fsf@xxxxxxxxxxxxxxxxxxx (both at:
https://public-inbox.org/git/?q=87a7vdqegi.fsf%40evledraar.gmail.com) I
think we should change the too-specific behavior here to be more generic
(and am happy to do the work pending feedback from Duy on what he thinks
about it).
I'm also interested to know from those at Microsoft (CC'd some) if the
mechanism I've proposed is something closer to what they could
eventually use to gc windows.git.
Sorry that I couldn't get to this message sooner, I was traveling. While
I was gone, the others who you CC'd volunteered me as the best person to
respond ;)
In the interest of full disclosure and hopefully starting an interesting
discussion, I want to share as much detail as possible as well as a few
future directions that can inform our actions now. Here are some rough
ideas that we are thinking about in this space:
I. Use the multi-pack index (MIDX) [1] to track "packfile state" so
we can do GC/repack incrementally.
II. Replace our prefetch packs model with partial clone [2].
III. Stop including all trees and focus the fetch down a cone of the
working directory.
IV. Provide a way to defer certain read-only commands to a remote when
the local repo doesn't have sufficient data.
I know that now it doesn't GC now, and they have some side-channel
mechanism for pre-deploying large (daily?) packs to clients, if it's
adjusted as I suggest gc could be told not to touch packs of that size,
leaving only stray small packs from "git pull" and loose objects to GC.
I may also have entirely misunderstood how it works, this is from brief
in-person conversations at Git Merge.
But as far as mainlining some of that eventually I think it would be
good to get feedback on whether the mechanism I proposed would get in
their way more or less than what Duy has, or be entirely irrelevant
because they need something else.
Thanks!
The GVFS cache servers pre-compute daily packfiles filled with every
commit and tree introduced that day. When a client calls 'git fetch' the
GVFS hook runs a "prefetch" command to get these daily packs from the
cache servers and place them in an alternate we call the "shared object
cache". GVFS also disables the "receive-pack" portion of the fetch. The
MIDX is updated to cover the new packs.
Something we are going to enable soon is the addition of "hourly packs"
where the cache servers keep a list of up to 24 hourly packs and on
prefetch the client receives an extra pack that concatenates that day's
pack up to that point. We are doing this because the refs they are
fetching are far enough ahead of the daily snapshots, which triggers a
decent amount of loose object downloads when those refs are checked out.
Since the next day will create a new daily pack, we can delete that
hourly pack (I don't think we do this currently). At the very least the
MIDX prefers the duplicates in the new packfile.
We do not GC in GVFS-enabled repos. This doesn't destroy clients' disks
so far because the prefetch packs do not contain blobs -- blobs are a
large portion of the full repo size. As the repo grows, we are
re-examining how to make GC and repack work in this environment. An
important part of any solution will be to make it incremental: we cannot
afford to create a second copy of the repo and we can't take the repo
offline for a significant portion of time.
I'm not sure that we can use Duy's solution out-of-the-box, in
particular because we haven't integrated our prefetch packs with the
"promisor" concept from partial clone. Hence, GC will fail with missing
objects. I think the core idea is good: perform GC incrementally and
don't touch previously-GC'd packs. Since it is rare for an object to be
reachable for a long time and then become unreachable, performing GC and
then keeping the resulting pack forever is a good idea. Especially when
paired with the MIDX so you care less about the number of packfiles.
I.
One of our thoughts is to use the MIDX to mark the packfiles with
metadata. For instance, we could mark new packfiles as "raw" and after
enough time we could run a GC on just one packfile (or a batch of
packfiles) to remove unreachable objects and mark it as "clean". In our
situation, we don't want to immediately GC the daily prefetch pack
because a large portion of those objects will be reachable in a few days
due to branch integrations.
II.
Our long-term plan is to remove these GVFS-specific fetch patterns and
replace them with partial clone logic. We still see the cache servers as
an important part of this process, but they are essentially read-only
copies of the main server object database (no refs). We haven't done the
work to see how expensive it will be to replace the precomputed prefetch
packs with a partial clone negotiation. Then we will need to see how to
make the shared object cache be transparent to the user.
An aside: it would be good to investigate how Git could provide
replication like a cache server with minimal setup, including client
configuration. The GVFS protocol has a "gvfs/config" endpoint that can
provide a list of cache server URLs, including a default. (For the
Windows repo, these URLs point to load balancers that pass-through to an
array of machines. I imagine most users will only need one server per
location.) Does Git already have a way to redirect the call to upload-pack?
III.
Partial clone has options for filtering trees by paths [3]. We want to
take advantage of this functionality to reduce the number of objects
stored on disk. In a perfect world, we would use the virtualization
layer in GVFS to auto-detect the paths that are important to the user
and create a sparse filter for them that way. Partial clone doesn't have
a mechanism for that right now (we require a sparse-checkout
specification to be present on the remote) but perhaps it could be part
of a verb in protocol v2. We are working to collect sparse-checkout
files from our users to see if it is possible to cluster their usage
patterns into a reasonably small list of sparse-checkout specifications.
Alternatively, how large would it be to dynamically send a list of paths
based on the current virtual projection? In either case, we expect some
extra trees by including some amount of sibling paths as we expect users
to be interested in reading nearby files or inspecting history for those
paths.
IV.
One big reason we started computing daily prefetch packs for GVFS is
that certain important Git commands became too slow if all objects were
obtained from the loose-object download. Simple commands like "git log"
would take forever. With all commits and trees, we can run "git log --
path" at the same speed as a vanilla repo. "git diff" can still be slow
because it needs blob contents, but hopefully the diff was for a single
commit and there are not too many calls. "git grep" could be a disaster.
When we start restricting to sparse-checkout specifications, not all
"git log -- deep/paths/in/particular" commands will have enough data
locally to run successfully.
One approach is to have a way to pass-through to a remote. Certain
commands could be requested (via a protocol v2 verb) to the server. The
server can compute the answer and send a (paged) response faster than
the client can download objects and then compute locally. This also
prevents users from getting a bloated object database full of loose
blobs they don't need.
At the very least, we will want to detect that we are missing many
objects and ask the user something like "Did you really want to do that?
It will cause 1000+ objects to be downloaded. [Y/n]"
I hope this combo of information and half-baked ideas is helpful. We are
definitely approaching a lot of interesting scale limits with many
different possible solutions.
Thanks,
-Stolee
[1]
https://public-inbox.org/git/20180107181459.222909-1-dstolee@xxxxxxxxxxxxx/T/#u
[RFC PATCH 00/18] Multi-pack index (MIDX)
[2]
https://public-inbox.org/git/20171214152404.35708-1-git@xxxxxxxxxxxxxxxxx/T/#u
[PATCH v2] Partial clone design document
[3]
https://github.com/git/git/blob/0afbf6caa5b16dcfa3074982e5b48e27d452dbbb/Documentation/rev-list-options.txt#L727-L733
rev-list-options.txt Documentation for "--filter=sparse:*"