Re: [PATCH] repack: add `repack.honorpackkeep` config var

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 3 Mar 2014 11:12:29 -0800

On Fri, Feb 28, 2014 at 10:05 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Fri, Feb 28, 2014 at 10:09:08AM -0700, Nasser Grainawi wrote:
>
>> > Exactly. The two features (bitmaps and .keep) are not compatible with
>> > each other, so you have to prioritize one. If you are using static .keep
>> > files, you might want them to continue being respected at the expense of
>> > using bitmaps for that repo. So I think you want a separate option from
>> > --write-bitmap-index to allow the appropriate flexibility.
>>
>> Has anyone thought about how to make them compatible?
>
> Yes, but it's complicated and not likely to happen soon.
>
> Having .keep files means that you are not including some objects in the
> newly created pack. Each bit in a commit's bitmap corresponds to one
> object in the pack, and whether it is reachable from that commit. The
> bitmap is only useful if we can calculate the full reachability from it,
> and it has no way to specify objects outside of the pack.
>
> To fix this, you would need to change the on-disk format of the bitmaps
> to somehow reference objects outside of the pack. Either by having the
> bitmaps index a repo-global set of objects, or by permitting a list of
> "edge" objects that are referenced from the pack, but not included (and
> then when assembling the full reachable list, you would have to recurse
> across "edge" objects to find their reachable list in another pack,
> etc).
>
> So it's possible, but it would complicate the scheme quite a bit, and
> would not be backwards compatible with either JGit or C Git.

Colby Ranger always wanted to add this to the bitmap scheme. Construct
a partial pack bitmap on a partial pack of "recent" objects, with edge
pointers naming objects that are not in this pack but whose closures
need to be considered part of the bitmap. Its complicated in-memory
because you need to fuse together two or more bitmaps (the partial
pack one, and the larger historical kept pack) before running the
"want AND NOT have" computation.

Colby did not find time to work on this in JGit, so it just didn't get
implemented. But we did consider it, as the servers at Google we built
bitmap for use a multi-level pack scheme and don't want to rebuild
packs all of the time.

>> We're using Martin Fick's git-exproll script which makes heavy use of
>> keeps to reduce pack file churn. In addition to the on-disk benefits
>> we get there, the driving factor behind creating exproll was to
>> prevent Gerrit from having two large (30GB+) mostly duplicated pack
>> files open in memory at the same time. Repacking in JGit would help in
>> a single-master environment, but we'd be back to having this problem
>> once we go to a multi-master setup.
>>
>> Perhaps the solution here is actually something in JGit where it could
>> aggressively try to close references to pack files
>
> In C git we don't worry about this too much, because our programs tend
> to be short-lived, and references to the old pack will go away quickly.
> Plus it is all mmap'd, so as we simply stop accessing the pages of the
> old pack, they should eventually be dropped if there is memory pressure.
>
> I seem to recall that JGit does not mmap its packfiles. Does it pread?

JGit does not mmap because you can't munmap() until the Java GC gets
around to freeing the tiny little header object that contains the
memory address of the start of the mmap segment. This can take ages,
to the point where you run out of virtual address space in the process
and s**t starts to fail left and right inside of the JVM. The GC is
just unable to prioritize finding those tiny headers and getting them
out of the heap so the munmap can take place safely.

So yea, JGit does pread() for the blocks but it holds those in its own
buffer cache inside of the Java heap. Where a 4K disk block is a 4K
memory array that puts pressure on the GC to actually wake up and free
resources that are unused. What Nasser is talking about is JGit may
take a long time to realize one pack is unused and start kicking those
blocks out of its buffer cache. Those blocks are reference counted and
the file descriptor JGit preads from is held open so long as at least
one block is in the buffer cache. By keeping the file open we force
the filesystem to keep the inode alive a lot longer, which means the
disk needs a huge amount of free space to store the unlinked but still
open 30G pack files from prior GC generations.

> In that case, I'd expect unused bits from the duplicated packfile to get
> dropped from the disk cache over time. If it loads whole packfiles into
> memory, then yes, it should probably close more aggressively.

Its more than that, its the inode being kept alive by the open file
descriptor...
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html