On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote: > Yes, I was hoping for reader atomicity. But I may OK foregoing that if > the transaction either all goes through, or all fails. A partially > stuck transaction because the process died in the middle of the commit > step creates a mess for an administrator to undo. Does she rename > "foo.lock" to "foo"? Or delete "foo.lock"? Agreed, there's no real rollback or recovery process. I do think shooting for reader atomicity is worth doing. Lack of atomicity can cause odd things to happen with operations like pruning, for example. If I'm trying to get a list of all of the reachable objects, for example, I might have to readdir() a bunch of directories (let's forget even that a single readdir() is not necessarily atomic). If I try to atomically move "refs/heads/z/foo" to "refs/heads/a/foo" there is a reasonable chance that a reader may see only the deletion and not the addition. I don't have any known cases of this biting anyone, but it's somewhat scary. > > But I'm not sure if you meant to contrast here a system where we didn't > > use packed-refs at all (though of course such a system is very much not > > atomic by the definition above). > > No, I really did mean the current system. Gerrit Code Review servers > create a lot of references throughout the day. Its easy to accumulate > a few thousand new loose references in a 24 hour period. Even if you > have 600k existing refs in packed-refs, you still have 2k new/modified > refs since last nightly cron ran git gc. I do think you'd be better served by just calling pack-refs more frequently, then. Nightly is too infrequent for a busy repo. And under something like reftables, you'd end up doing the equivalent of a pack-refs every N updates anyway. We actually pack refs quite aggressively at GitHub. Way more than I would consider reasonable, but it's never been a big bottleneck, so I've never looked into it. We don't do it for every update, but every update triggers a "consider syncing objects into shared storage" job, which will pack the refs. So in a hypothetical repo that's constantly updating we probably pack refs at least once a minute. But we're generally on low-latency local disks. It sounds like you emphatically are not. > > Good goal. Just to play devil's advocate, the simplest way to do that > > with the current code would be to gzip packed-refs (and/or store sha1s > > as binary). That works against the "mmap and binary search" plan, > > though. :) > > Yes it does. I tried to cover that later under "Alternatives > considered > bzip". :) Yeah, sorry, I read and responded to the document a bit out of order. I agree it's a dead-end. :) > >> Reference names should be encoded with UTF-8. > > > > Don't we usually treat refnames as byte sequences (subject to a few > > rules, as in check_ref_format())? It seems like the encoding should be > > out-of-scope for the storage format. > > True that git-core treats them as byte sequences, but JGit treats them as UTF-8. I think we can probably stick with that, unless the UTF-8ness is really important? I guess it might matter if it impacts the sorting order. > Yes. My "near constant time" claim was because I had my head buried > thinking about disk IO operations when I wrote this, not the algorithm > that happens on the CPU. One of the applications for reftable is on a > slower-than-usual storage system, where reading a block of a file > costs enough milliseconds that it doesn't matter how good or bad the > CPU algorithm is. OK, that explains a lot of the decisions better. If you're planning on evolving this proposal document, I think it would make sense to talk about that in the objectives section. (I say "if" because I am happy for the mailing list discussion to serve as a rationale document). > But you say this yourself below, you can do this using a geometric > scheme or something to bound the cost of these rewrites such that you > aren't frequently rewriting the whole database. Right, but that sounds like math. I wanted you to spoonfeed me the geometric algorithm (and its bound proof). ;) > >> Compaction is similar to the update process, but an explicit temporary > >> file must be used: > >> > >> 1. Atomically create `$GIT_DIR/reftable.lock`. > >> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`. > >> 3. Compute the update transaction (e.g. compare expected values). > >> 4. Select files from (2) to collapse as part of this transaction. > >> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`. > >> 6. Write modified and collapsed references to temp file. > >> 7. Rename temp file to `reftable.${n + 1}`. > >> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ... > >> 9. Delete `reftable.lock`. I think the "stack" implementation is what makes me most uncomfortable with this proposal. Atomicity with filesystem operations and especially readdir() is one of the things I think is most flaky about the current system. Here's an idea for an alternative implementation. 1. Write out reftables to files named after the hash of their content (e.g., $GIT_DIR/reftables/1234abcd...). 2. The first block of the each reftable has a backpointer to the previous table. 3. There's a well-known name (e.g., $GIT_DIR/reftable) that represents the current state. We update it with the usual .lock/rename dance. That gives readers an easy atomic view; once they've called open() on "reftable", they follow back-pointers instead of computing the names (n-1, n-2, etc). They may still find that a compaction has removed a file they need, but: - they only have to restart due to an actual compaction. They'll never be tricked by a regular update. - you can compact without immediately deleting the old reftables. So you might compact, and then delete the reftables N seconds later. Any reader which completes the read within N seconds wouldn't have to restart. I think I can anticipate your answer, though. If you have a system where the latency to open and read a file is high, then you've just serialized the latencies as you walk the chain. Whereas with predictable names, you can pre-fetch refname.i through refname.j in parallel. How deep would you anticipate stacks getting? Would it be feasible for the tip to contain the names of the tables in the entire chain? If we're talking about 20 (or even 32) bytes per name, you could still fit over a hundred names in a 4K inode. It doesn't escape me that I'm basically reinventing RefTree here, with reftables instead of tree objects. But I think breaking away from using real Git objects opens up a lot of efficiency tricks (like the prefix compression, and the parallel-fetch thing above). And it removes a lot of the gc complexity. > I think I was imaging updates are less frequent than reads, and a > reader is going to readdir(), and then immediately open every file in > the stack to setup the merge-join iteration. If the reader retains the > file descriptor, the reader can keep that file in their stack. > > There is risk of a reader live-locking; the reader might have done a > readdir, starts opening the stack, sees ENOENT. In which case the > reader starts over. If an updater is performing compactions faster > than a reader can readdir and open paths, its live-lock for the > reader. That certainly is one motivation to not always perform a > compaction. I guess I don't have much faith in the atomicity of readdir(). And it would be nice if we could tell the difference between "oops, reftable.5 was racily deleted" and "it is not supposed to be there due to a previous compaction". So I foresee always walking back to 0, stopping at the first ENOENT, and then doing a final readdir() to see if any new items appeared. If one has, we can't tell if it's a compaction in progress or a regular update, and we have to restart. So I'm worried about live-locking with a regular updater, not even a compacting one. > > I liked a lot of what I saw in the rest of it (e.g., handling symrefs, > > which packed-refs does not). Some bits seemed complicated. E.g., I > > actually wonder how much restarts help in practice if you have > > reasonably-sized blocks, and they complicate things a lot). > > My implementation lets me tweak the restart table with a command line > option, so I was able to run a number of experiments for you. A 64k > block doesn't fit more than a few thousand references, so restart 4000 > effectively disables restarts. > > block restart size lookup > 4k 16 29M 90.2 usec > 8k 16 29M 76.4 usec > > 64k 16 29M 147.7 usec > 64k 64 28M 134.3 usec > 64k 256 27M 143.4 usec > > 4k 4000 28M 104.0 usec > 8k 4000 28M 117.1 usec > 64k 4000 27M 288.5 usec Thanks for these numbers. I was really thinking that blocks would be on the order of 4K (where you can see that the restarts help very little). For local disk that's a pretty reasonable size. For high-latency fetches to a specialized database, maybe not. > For $REASONS, I favor a larger block size in some cases, even though > the lookup times get worse. For example, being able to use 64k/64 may > be a sweet spot for that particular IO system I mentioned above. Right, that makes a lot more sense. > > as long as you're still > > wasting only 4K when you store 200 megabytes. > > I don't think this is fair. Its better thought of as a ratio. > [...] > 5 refs in reftable is ~204 bytes, because of the optional features > being disabled on small files. If reftable was forced to fill out to a > full 4K block, that is a penalty of 1907%. This might seem like > nothing, but for cases where the table has to travel on the network, > or is being stored in a tail-block-optimized filesystem, its a huge > waste to pad the file out. Yeah, my assumption was that anything under 4K is basically going to take 4K. But I agree it's very dependent on the underlying storage mechanism. > > I also realize that beggars can't be choosers. If you have a working > > system that performs well, I should consider shutting up. :) > > I have it in Java for JGit; I don't yet have a C implementation. I'm a beggar; I'll take even a well-developed plan. :) The implementation on this doesn't seem overly complex. My main concerns are what we're asking from the filesystem in terms of atomicity, and what possible races there are. > > One thing I didn't see is reflogs. They don't strictly have to be part > > of a ref-storage solution. But they do still consume at least one inode > > per ref in the current system. If you reflog everything, which in my > > opinion you should. Having an audit trail of ref updates is very useful > > for debugging (both bugs in Git, and trying to figure out what went > > wrong when a user goes berserk with "git push --prune"). > > Yea, I glossed over that and ignored them. Mostly because one system > where I want to use reftable already has a separate database handling > the reflogs. In another (Gerrit Code Review), we disable reflogs for > the insane refs/changes/ namespace, as nearly every reference is > created once, and never modified. Even for created-once refs, I've found an audit trail of created when, by whom, using what program to be quite valuable. Long ago we tried to use reflogs for that, but these days we literally just write the refname and its reflog entry to an "audit_log" file. It's not used for reachability and it's never pruned. It just grows forever without bound. I think some variant of that could work for reflog storage (with reachability and whole-file-rewrite expiration, obviously). The biggest drawback is that traversing the reflogs for one ref requires walking over the entries for all refs. But I'm not sure how much that would hurt in practice. Many reflog operations look at all the reflogs anyway (e.g., reachability, expiration). And finding the entry for a single ref (e.g., ref@{1.day.ago}) is bounded in far back it has to walk. > One could abuse the reftable format to store a reflog. If the refname > field is just a sequence of bytes, one could store keys like refname + > '\0' + timestamp, and reuse the symbolic ref value format to store the > old/new/message, as its just a length delimited string. Gross, but it could work. I actually think David's LMDB proposal did something similar (encoding the entries in the keyname), but I'd have to double-check. > I'm loath to allocate more bits to denote a reflog entry vs. ref entry > in the same file, but I could see some advantages to that. Commits > would only have to write one new reftable for the combined update + > log record. Yes, but I'd worry that the reflog entries (which would have to hang around in the reftable until being expired) would slow performance for normal ref lookups. In my experience ref lookups are very frequent, and reflog lookups are not. It's OK to segment them and have worse performance for the reflogs. -Peff