Re: [PATCH v4 00/32] Lockfile correctness and refactoring

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Fri, 12 Sep 2014 16:21:09 +0200

On 09/10/2014 10:13 AM, Jeff King wrote:
> [...]
> I did coincidentally have an interesting experience with our lockfile
> code earlier today, which I'd like to relate.
> 
> I was running pack-refs on a repository with a very large number of
> loose refs (about 1.8 million). [...] Each call to lock_ref_sha1_basic
> allocates a "struct lock_file", which then gets added to the global
> lock_file list. Each one contains a fixed PATH_MAX buffer (4K on this
> machine). After we're done updating the ref, we leak the lock_file
> struct, since there's no way to remove it from the list.
> 
> As a result, git tried to allocate 7G of RAM and got OOM-killed (the
> machine had only 8G). In addition to thrashing the disk even harder,
> since there was no room left for disk cache while we touched millions of
> loose refs. :)
> 
> Your change in this series to use a strbuf would make this a lot better.
> But I still wonder how hard it would be to just remove lock_file structs
> from the global list when they are committed or rolled back. That would
> presumably also make the "avoid transitory valid states" patch from your
> series a bit easier, too (you could prepare the lockfile in peace, and
> then link it in fully formed, and do the opposite when removing it).
> 
> I think with your strbuf patch, this leak at least becomes reasonable.
> So maybe it's not worth going further. But I'd be interested to hear
> your thoughts since you've been touching the area recently.

I've thought about this too, but it didn't seem to be worth the effort.
(Though your use case certainly adds a bit of justification.)

To make that change, we would have to remove entries from the list of
lock_file objects in a way that the code can be interrupted at any time
by a signal while leaving it in a state that is traversable by the
signal handler.

I think that can be done pretty easily with a singly-linked list. But
with a singly-linked list, we would have to iterate through the list to
find the node that needs to be removed. This could get expensive if
there are a lot of nodes in the list (see below).

So we would probably want to switch to using a doubly-linked list. I
think this would also be fairly simple, given that the signal handler
only needs to iterate through the list in a single direction. You'd just
have to be careful about adjusting the pointers in the right order to
let (say) a forwards traversal always work.

Then the callers who use heap-allocated lock_file objects would have to
be changed to free them when they're done with them, probably using a
special function that releases the strbufs, too. Callers using
statically-allocated lock_file objects would probably not have to be
changed.

But...

The ref-transaction code is, I think, moving in the direction of
updating all references in a single transaction. This means that we
would need to hold locks for all of the references at once anyway. So it
might be all for naught.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html