Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/15/2018 4:36 AM, Ævar Arnfjörð Bjarmason wrote:
On Thu, Mar 15 2018, Junio C. Hamano jotted:

* nd/repack-keep-pack (2018-03-07) 6 commits
  - SQUASH???
  - pack-objects: display progress in get_object_details()
  - pack-objects: show some progress when counting kept objects
  - gc --auto: exclude base pack if not enough mem to "repack -ad"
  - repack: add --keep-pack option
  - t7700: have closing quote of a test at the beginning of line

  "git gc" in a large repository takes a lot of time as it considers
  to repack all objects into one pack by default.  The command has
  been taught to pretend as if the largest existing packfile is
  marked with ".keep" so that it is left untouched while objects in
  other packs and loose ones are repacked.

  Expecting a reroll.
  cf. <CACsJy8BW_EtxQvgL=YrCXCQY7cEWCQxgfkeH=Gd=X=uVYhPJcw@xxxxxxxxxxxxxx>
  Except for final finishing touches, this looked more-or-less ready
  for 'next'.


As I noted in 87a7vdqegi.fsf@xxxxxxxxxxxxxxxxxxx and
877eqhq7ha.fsf@xxxxxxxxxxxxxxxxxxx (both at:
https://public-inbox.org/git/?q=87a7vdqegi.fsf%40evledraar.gmail.com) I
think we should change the too-specific behavior here to be more generic
(and am happy to do the work pending feedback from Duy on what he thinks
about it).

I'm also interested to know from those at Microsoft (CC'd some) if the
mechanism I've proposed is something closer to what they could
eventually use to gc windows.git.

Sorry that I couldn't get to this message sooner, I was traveling. While I was gone, the others who you CC'd volunteered me as the best person to respond ;)

In the interest of full disclosure and hopefully starting an interesting discussion, I want to share as much detail as possible as well as a few future directions that can inform our actions now. Here are some rough ideas that we are thinking about in this space:

  I. Use the multi-pack index (MIDX) [1] to track "packfile state" so we can do GC/repack incrementally.
 II. Replace our prefetch packs model with partial clone [2].
III. Stop including all trees and focus the fetch down a cone of the working directory.  IV. Provide a way to defer certain read-only commands to a remote when the local repo doesn't have sufficient data.

I know that now it doesn't GC now, and they have some side-channel
mechanism for pre-deploying large (daily?) packs to clients, if it's
adjusted as I suggest gc could be told not to touch packs of that size,
leaving only stray small packs from "git pull" and loose objects to GC.

I may also have entirely misunderstood how it works, this is from brief
in-person conversations at Git Merge.

But as far as mainlining some of that eventually I think it would be
good to get feedback on whether the mechanism I proposed would get in
their way more or less than what Duy has, or be entirely irrelevant
because they need something else.

Thanks!

The GVFS cache servers pre-compute daily packfiles filled with every commit and tree introduced that day. When a client calls 'git fetch' the GVFS hook runs a "prefetch" command to get these daily packs from the cache servers and place them in an alternate we call the "shared object cache". GVFS also disables the "receive-pack" portion of the fetch. The MIDX is updated to cover the new packs.

Something we are going to enable soon is the addition of "hourly packs" where the cache servers keep a list of up to 24 hourly packs and on prefetch the client receives an extra pack that concatenates that day's pack up to that point. We are doing this because the refs they are fetching are far enough ahead of the daily snapshots, which triggers a decent amount of loose object downloads when those refs are checked out. Since the next day will create a new daily pack, we can delete that hourly pack (I don't think we do this currently). At the very least the MIDX prefers the duplicates in the new packfile.

We do not GC in GVFS-enabled repos. This doesn't destroy clients' disks so far because the prefetch packs do not contain blobs -- blobs are a large portion of the full repo size. As the repo grows, we are re-examining how to make GC and repack work in this environment. An important part of any solution will be to make it incremental: we cannot afford to create a second copy of the repo and we can't take the repo offline for a significant portion of time.

I'm not sure that we can use Duy's solution out-of-the-box, in particular because we haven't integrated our prefetch packs with the "promisor" concept from partial clone. Hence, GC will fail with missing objects. I think the core idea is good: perform GC incrementally and don't touch previously-GC'd packs. Since it is rare for an object to be reachable for a long time and then become unreachable, performing GC and then keeping the resulting pack forever is a good idea. Especially when paired with the MIDX so you care less about the number of packfiles.

I.

One of our thoughts is to use the MIDX to mark the packfiles with metadata. For instance, we could mark new packfiles as "raw" and after enough time we could run a GC on just one packfile (or a batch of packfiles) to remove unreachable objects and mark it as "clean". In our situation, we don't want to immediately GC the daily prefetch pack because a large portion of those objects will be reachable in a few days due to branch integrations.

II.

Our long-term plan is to remove these GVFS-specific fetch patterns and replace them with partial clone logic. We still see the cache servers as an important part of this process, but they are essentially read-only copies of the main server object database (no refs). We haven't done the work to see how expensive it will be to replace the precomputed prefetch packs with a partial clone negotiation. Then we will need to see how to make the shared object cache be transparent to the user.

An aside: it would be good to investigate how Git could provide replication like a cache server with minimal setup, including client configuration. The GVFS protocol has a "gvfs/config" endpoint that can provide a list of cache server URLs, including a default. (For the Windows repo, these URLs point to load balancers that pass-through to an array of machines. I imagine most users will only need one server per location.) Does Git already have a way to redirect the call to upload-pack?

III.

Partial clone has options for filtering trees by paths [3]. We want to take advantage of this functionality to reduce the number of objects stored on disk. In a perfect world, we would use the virtualization layer in GVFS to auto-detect the paths that are important to the user and create a sparse filter for them that way. Partial clone doesn't have a mechanism for that right now (we require a sparse-checkout specification to be present on the remote) but perhaps it could be part of a verb in protocol v2. We are working to collect sparse-checkout files from our users to see if it is possible to cluster their usage patterns into a reasonably small list of sparse-checkout specifications. Alternatively, how large would it be to dynamically send a list of paths based on the current virtual projection? In either case, we expect some extra trees by including some amount of sibling paths as we expect users to be interested in reading nearby files or inspecting history for those paths.

IV.

One big reason we started computing daily prefetch packs for GVFS is that certain important Git commands became too slow if all objects were obtained from the loose-object download. Simple commands like "git log" would take forever. With all commits and trees, we can run "git log -- path" at the same speed as a vanilla repo. "git diff" can still be slow because it needs blob contents, but hopefully the diff was for a single commit and there are not too many calls. "git grep" could be a disaster.

When we start restricting to sparse-checkout specifications, not all "git log -- deep/paths/in/particular" commands will have enough data locally to run successfully.

One approach is to have a way to pass-through to a remote. Certain commands could be requested (via a protocol v2 verb) to the server. The server can compute the answer and send a (paged) response faster than the client can download objects and then compute locally. This also prevents users from getting a bloated object database full of loose blobs they don't need.

At the very least, we will want to detect that we are missing many objects and ask the user something like "Did you really want to do that? It will cause 1000+ objects to be downloaded. [Y/n]"


I hope this combo of information and half-baked ideas is helpful. We are definitely approaching a lot of interesting scale limits with many different possible solutions.

Thanks,
-Stolee

[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@xxxxxxxxxxxxx/T/#u
    [RFC PATCH 00/18] Multi-pack index (MIDX)

[2] https://public-inbox.org/git/20171214152404.35708-1-git@xxxxxxxxxxxxxxxxx/T/#u
    [PATCH v2] Partial clone design document

[3] https://github.com/git/git/blob/0afbf6caa5b16dcfa3074982e5b48e27d452dbbb/Documentation/rev-list-options.txt#L727-L733
    rev-list-options.txt Documentation for "--filter=sparse:*"



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux