Re: [PATCH v13 04/13] reftable: file format documentation

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 20 May 2020 11:52:02 -0700

Junio C Hamano wrote:
> Han-wen wrote:

>> From: Jonathan Nieder <jrnieder@xxxxxxxxx>
>>
>> Shawn Pearce explains:
>>
>> Some repositories contain a lot of references (e.g. android at 866k,
>> rails at 31k). The reftable format provides:
>>
>> - Near constant time lookup for any single reference, even when the
>>   repository is cold and not in process or kernel cache.
>> - Near constant time verification a SHA-1 is referred to by at least
>>   one reference (for allow-tip-sha1-in-want).
>
> Not quite grammatical sentence?  Perhaps "if" after "verification?

Good catch, thanks.

[...]
>> using pandoc 2.2.1.  The result required the following additional
>> minor changes:
>>
>> - removed the [TOC] directive to add a table of contents, since
>>   asciidoc does not support it
>> - replaced git-scm.com/docs links with linkgit: directives that link
>>   to other pages within Git's documentation
>
> There are many
>
> 	’
>
> funny-quotes where we would prefer to place vanilla single quotes,
> which may also need to be corrected in the conversion toolchain.

Looks like Han-Wen is taking care of this (thanks!).

> Typoes pointed out below may probably be from the original where
> they should be corrected.

I'm happy to do one final update the doc in JGit to match what we end
up with and then replace it with a pointer to Git's copy once that
lands.

[...]
>> +Repositories with many loose references occupy a large number of disk
>> +blocks from the local file system, as each reference is its own file
>> +storing 41 bytes (and another file for the corresponding reflog). This
>> +negatively affects the number of inodes available when a large number of
>> +repositories are stored on the same filesystem. Readers can be penalized
>> +due to the larger number of syscalls required to traverse and read the
>> +`$GIT_DIR/refs` directory.
>
> Another downside is that we cannot arrange atomic updates to
> multiple refs over loose refs, even though the "lookup of a single
> reference does not require linear scan" unlike packed-refs, (as long
> as the filesystem does its job).  Worth mentioning?

Yes, this was another major part of the motivation (avoiding the
complication of the "atomic" multi-ref updates to packed-refs that Git
and JGit had to learn).

[...]
>> +References stored in a reftable are peeled, a record for an annotated
>> +(or signed) tag records both the tag object, and the object it refers
>> +to.
>
> OK.  Peeled results are recorded in packed-refs file because quite
> often when we use a tag object, what we actually want to access is
> the commit object it points at.  We do so here for the same reason?
>
> Not a rhetorical question, but if it invites a question from a
> reader, it may deserve to be described before readers ask it.

For a single tag ref, peeling to a commit is not very expensive.  But
for batch lookups e.g. when serving a response to an ls-remote
request, it adds up, and having the peeled results recorded helps.

[...]
>> +Directory/file conflicts
>> +^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +The reftable format accepts both `refs/heads/foo` and
>> +`refs/heads/foo/bar` as distinct references.
>> +
>> +This property is useful for retaining log records in reftable, but may
>> +confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
>> +references. Users of reftable may choose to continue to reject `foo` and
>> +`foo/bar` type conflicts to prevent problems for peers.
[...]
> "users ... may choose" implies that it is up to the implementation
> of reftable user which one to show, so given a single repository,
> "jgit" may show "refs/heads/foo" while "libgit2" may choose to show
> the other one.
>
> I am not sure if that is desirable---I suspect that we want to
> record which one needs to be chosen so that these "D/F conflicts
> disallowing" users can make consistent choices, but I dunno.

Yes, I think it would be better to explicitly say that Git will continue
to reject D/F conflicts for refs (*not* reflogs) even though the format
can support them in principle.

If we choose to permit them some day in the future, I believe that would
be a separate repository format extension and protocol capability to
avoid confusing old versions of Git.

[...]
>> +Symbolic references use `0x3`, followed by the complete name of the
>> +reference target. No compression is applied to the target name.
>
> Is there a place in the file format where an incomplete name can be
> stored?  If not, I think it makes it easier to read if we drop
> "complete" from the sentence.

The sentence about "no compression" covers the lack of prefix encoding,
so I suppose I agree.

Might make sense to say "full name" to convey that we're talking about
rev-parse --symbolic-full-name, not a relative path like symlinks
support.

[...]
>> +Log block format
>> +^^^^^^^^^^^^^^^^
>> +
>> +Unlike ref and obj blocks, log blocks are always unaligned.
>> +
>> +Log blocks are variable in size, and do not match the `block_size`
>> +specified in the file header or footer. Writers should choose an
>> +appropriate buffer size to prepare a log block for deflation, such as
>> +`2 * block_size`.
>
> I can guess the reason behind this design decision, but the readers
> may not be able to.  Should we write it down here, or would it make
> too much irrelevant details?

I don't have a strong opinion.  It sounds like Han-Wen sees something to
explain there, so I suppose it would be nice to spell out.

(My take: reflog lookups are not on the critical path for most
operations; especially, random accesses do not need to be fast.  From a
performance perspective, the best we can do is to compress them well to
decrease I/O cost, hence there's not much value to alignment.)

[...]
> This is a tangent but in a repository at hosting provider, whose
> primary (and often the only) source of updates are by end-user
> pushing into it, if reflogs are enabled, whose name and email are
> recorded in the logs?  The committer or tagger of the object that
> sits at the tip of the ref after the update?  What happens when a
> blob is pushed to update a ref?  Or would it be just a single "user"
> that represents the "server operator"?

The latter, "server operator" (GIT_COMMITTER_IDENT at the server).

Committer in commit objects is forgeable, hence wouldn't be very
useful here.

> We know in a non-bare repository an individual contributor works on
> typically records only one <name, email> in the reflog: the user who
> works in it.
>
> What I am trying to get at is if it makes more sense to have a small
> table of unique <name, email> pairs used in the file and have
> log_data record a single varint that is the index into that
> "committer ident" table.  I would suspect that it would give us
> significantly more gain than mere <> two bytes per log_data entry.

That's true, and a good idea for the next rev of the format.

[...]
>> +A 68-byte footer appears at the end:
>> +
>> +....
>> +    'REFT'
>> +    uint8( version_number = 1 )
>> +    uint24( block_size )
>> +    uint64( min_update_index )
>> +    uint64( max_update_index )
>> +
>> +    uint64( ref_index_position )
>> +    uint64( (obj_position << 5) | obj_id_len )
[...]
>> +* `obj_id_len`: number of bytes used to abbreviate object identifiers in
>> +obj blocks.
>
> Should we write "this can be up to 31" somewhere?  It is more than
> enough for SHA-1 and not quite sufficient for SHA-256 (unless we say
> "we store obj_id_len-1 here")?

Oh!  I'll take a closer look and then follow up.

Thanks for looking it over,
Jonathan