On Wed, 2015-06-24 at 05:14 -0400, Jeff King wrote: > On Tue, Jun 23, 2015 at 02:18:36PM -0400, David Turner wrote: > > > > Can you describe a bit more about the reflog handling? > > > > > > One of the problems we've had with large-ref repos is that the reflog > > > storage is quite inefficient. You can pack all the refs, but you may > > > still be stuck with a bunch of reflog files with one entry, wasting a > > > whole inode. Doing a "git repack" when you have a million of those has > > > horrible cold-cache performance. Basically anything that isn't > > > one-file-per-reflog would be a welcome change. :) > > > > Reflogs are stored in the database as well. There is one header entry > > per ref to indicate that a reflog is present, and then one database > > entry per reflog entry; the entries are stored consecutively and > > immediately following the header so that it's fast to iterate over them. > > OK, that make sense. I did notice that the storage for the refdb grows > rapidly. If I add a millions refs (like refs/tags/$i) with a simple > reflog message "foo", I ended up with a 500MB database file. > > That's _probably_ OK, because a million is getting into crazy > territory[1]. But it's 500 bytes per ref, each with one reflog entry. > Our ideal lower bound is probably something like 100 bytes per reflog > entry: > > - 20 bytes for old sha1 > - 20 bytes for new sha1 > - ~50 bytes for name, email, timestamp > - ~6 bytes for refname (1000000 is the longest unique part) > > That assumes we store binary[2] (and not just the raw reflog lines), and > reconstruct the reflog lines on the fly. It also assumes we use some > kind of trie-like storage (where we can amortize the cost of storing > "refs/tags/" across all of the entries). > > Of course that neglects lmdb's overhead, and the storage of the ref tip > itself. But it would hopefully give us a ballpark for an optimal > solution. We don't have to hit that, of course, but it's food for > thought. > > [1] The homebrew/homebrew repository on GitHub has almost half a million > ref updates. Since this is storing not just refs but all ref > updates, that's actually the interesting number (and optimizing the > per-reflog-entry size is more interesting than the per-ref size). > > [2] I'm hesitant to suggest binary formats in general, but given that > this is a blob embedded inside lmdb, I think it's OK. If we were to > pursue the log-structured idea I suggested earlier, I'm torn on > whether it should be binary or not. I could try a binary format. I was optimizing for simplicity, debuggability, recoverability, compatibility with the choice of the text format, but I wouldn't have to. I don't know how much this will save. Unfortunately, given the way LMDB works, a trie-like storage to save refs/tags does not seem possible (of course, we could hard-code some hacks like \001=refs/rags, \002=refs/heads, etc but that is a micro-optimization that might not be worth it. Also, the reflog header has some overhead (it's an entire extra record per ref). The header exists to implement reflog creation/existence checking. I didn't really try to understand why we have the distinction between empty and nonexistent reflogs; I just copied it. If we didn't have that distinction, we could eliminate that overhead. > > Thanks, that's valuable. For the refs backend, opening the LMDB > > database for writing is sufficient to block other writers. Do you think > > it would be valuable to provide a git hold-ref-lock command that simply > > reads refs from stdin and keeps them locked until it reads EOF from > > stdin? That would allow cross-backend ref locking. > > I'm not sure what you would use it for. If you want to update the refs, > then you can specify a whole transaction with "git update-ref --stdin", > and that should work whatever backend you choose. Is there some other > operation you want where you hold the lock for a longer period of time? I'm sure I had a reason for this at the time I wrote it, but now I can't think of what it was. Nevermind! -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html