Junio C Hamano wrote: > Han-wen wrote: >> From: Jonathan Nieder <jrnieder@xxxxxxxxx> >> >> Shawn Pearce explains: >> >> Some repositories contain a lot of references (e.g. android at 866k, >> rails at 31k). The reftable format provides: >> >> - Near constant time lookup for any single reference, even when the >> repository is cold and not in process or kernel cache. >> - Near constant time verification a SHA-1 is referred to by at least >> one reference (for allow-tip-sha1-in-want). > > Not quite grammatical sentence? Perhaps "if" after "verification? Good catch, thanks. [...] >> using pandoc 2.2.1. The result required the following additional >> minor changes: >> >> - removed the [TOC] directive to add a table of contents, since >> asciidoc does not support it >> - replaced git-scm.com/docs links with linkgit: directives that link >> to other pages within Git's documentation > > There are many > > ’ > > funny-quotes where we would prefer to place vanilla single quotes, > which may also need to be corrected in the conversion toolchain. Looks like Han-Wen is taking care of this (thanks!). > Typoes pointed out below may probably be from the original where > they should be corrected. I'm happy to do one final update the doc in JGit to match what we end up with and then replace it with a pointer to Git's copy once that lands. [...] >> +Repositories with many loose references occupy a large number of disk >> +blocks from the local file system, as each reference is its own file >> +storing 41 bytes (and another file for the corresponding reflog). This >> +negatively affects the number of inodes available when a large number of >> +repositories are stored on the same filesystem. Readers can be penalized >> +due to the larger number of syscalls required to traverse and read the >> +`$GIT_DIR/refs` directory. > > Another downside is that we cannot arrange atomic updates to > multiple refs over loose refs, even though the "lookup of a single > reference does not require linear scan" unlike packed-refs, (as long > as the filesystem does its job). Worth mentioning? Yes, this was another major part of the motivation (avoiding the complication of the "atomic" multi-ref updates to packed-refs that Git and JGit had to learn). [...] >> +References stored in a reftable are peeled, a record for an annotated >> +(or signed) tag records both the tag object, and the object it refers >> +to. > > OK. Peeled results are recorded in packed-refs file because quite > often when we use a tag object, what we actually want to access is > the commit object it points at. We do so here for the same reason? > > Not a rhetorical question, but if it invites a question from a > reader, it may deserve to be described before readers ask it. For a single tag ref, peeling to a commit is not very expensive. But for batch lookups e.g. when serving a response to an ls-remote request, it adds up, and having the peeled results recorded helps. [...] >> +Directory/file conflicts >> +^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> +The reftable format accepts both `refs/heads/foo` and >> +`refs/heads/foo/bar` as distinct references. >> + >> +This property is useful for retaining log records in reftable, but may >> +confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain >> +references. Users of reftable may choose to continue to reject `foo` and >> +`foo/bar` type conflicts to prevent problems for peers. [...] > "users ... may choose" implies that it is up to the implementation > of reftable user which one to show, so given a single repository, > "jgit" may show "refs/heads/foo" while "libgit2" may choose to show > the other one. > > I am not sure if that is desirable---I suspect that we want to > record which one needs to be chosen so that these "D/F conflicts > disallowing" users can make consistent choices, but I dunno. Yes, I think it would be better to explicitly say that Git will continue to reject D/F conflicts for refs (*not* reflogs) even though the format can support them in principle. If we choose to permit them some day in the future, I believe that would be a separate repository format extension and protocol capability to avoid confusing old versions of Git. [...] >> +Symbolic references use `0x3`, followed by the complete name of the >> +reference target. No compression is applied to the target name. > > Is there a place in the file format where an incomplete name can be > stored? If not, I think it makes it easier to read if we drop > "complete" from the sentence. The sentence about "no compression" covers the lack of prefix encoding, so I suppose I agree. Might make sense to say "full name" to convey that we're talking about rev-parse --symbolic-full-name, not a relative path like symlinks support. [...] >> +Log block format >> +^^^^^^^^^^^^^^^^ >> + >> +Unlike ref and obj blocks, log blocks are always unaligned. >> + >> +Log blocks are variable in size, and do not match the `block_size` >> +specified in the file header or footer. Writers should choose an >> +appropriate buffer size to prepare a log block for deflation, such as >> +`2 * block_size`. > > I can guess the reason behind this design decision, but the readers > may not be able to. Should we write it down here, or would it make > too much irrelevant details? I don't have a strong opinion. It sounds like Han-Wen sees something to explain there, so I suppose it would be nice to spell out. (My take: reflog lookups are not on the critical path for most operations; especially, random accesses do not need to be fast. From a performance perspective, the best we can do is to compress them well to decrease I/O cost, hence there's not much value to alignment.) [...] > This is a tangent but in a repository at hosting provider, whose > primary (and often the only) source of updates are by end-user > pushing into it, if reflogs are enabled, whose name and email are > recorded in the logs? The committer or tagger of the object that > sits at the tip of the ref after the update? What happens when a > blob is pushed to update a ref? Or would it be just a single "user" > that represents the "server operator"? The latter, "server operator" (GIT_COMMITTER_IDENT at the server). Committer in commit objects is forgeable, hence wouldn't be very useful here. > We know in a non-bare repository an individual contributor works on > typically records only one <name, email> in the reflog: the user who > works in it. > > What I am trying to get at is if it makes more sense to have a small > table of unique <name, email> pairs used in the file and have > log_data record a single varint that is the index into that > "committer ident" table. I would suspect that it would give us > significantly more gain than mere <> two bytes per log_data entry. That's true, and a good idea for the next rev of the format. [...] >> +A 68-byte footer appears at the end: >> + >> +.... >> + 'REFT' >> + uint8( version_number = 1 ) >> + uint24( block_size ) >> + uint64( min_update_index ) >> + uint64( max_update_index ) >> + >> + uint64( ref_index_position ) >> + uint64( (obj_position << 5) | obj_id_len ) [...] >> +* `obj_id_len`: number of bytes used to abbreviate object identifiers in >> +obj blocks. > > Should we write "this can be up to 31" somewhere? It is more than > enough for SHA-1 and not quite sufficient for SHA-256 (unless we say > "we store obj_id_len-1 here")? Oh! I'll take a closer look and then follow up. Thanks for looking it over, Jonathan