Re: reftable [v2]: new ref storage format

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 17 Jul 2017 12:51:42 -0700

Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> This is an updated draft after discussion on list with Peff, Michael
> Haggerty, and Dave Borowitz.
>
> You can read a rendered version of this here:
> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>
> Biggest changes from the first proposal are:
>
> - reflog is now integrated into reftable
> - block headers slightly different
> - Peff's stack management idea is used
> - Michael's compaction idea is used

Just a few comments.

> A variable number of 4-byte `restart_offset` values follows the
> records.  Offsets are relative to the start of the block (0 in first
> block to include file header) and refer to the first byte of any
> `ref_record` whose name has not been prefixed compressed.  Readers can
> start linear scans from any of these records.

It is unclear what "0 in first block to include file header" wants
to say.  Do I write "8" if I want to express the offset of the first
record in the first block, or do I write "0"?

> The 2-byte `number_of_restarts + 1` stores the number of entries in
> the `restart_offset` list.

It is unclear whose responsibility it is to add this "1".  Does this
mean a reader thinks there is one entry in the restart table when it
sees "0" in the number_of_restarts field (hence you can have max
65536 entries in total)?

> Readers can use the restart count to binary search between restarts
> before starting a linear scan.  The `number_of_restarts` field must be
> the last 2 bytes of the block as specified by `block_len` from the
> block header.

Does the new term "restart count" mean the same thing as
number_of_restarts?

> ### Log block format
>
> A log block is written as:
>
>     'g'
>     uint24( block_len )
>     zlib_deflate {
>       log_record*
>       int32( restart_offset )*
>       int16( number_of_restarts )
>     }
>
> Log blocks look similar to ref blocks, except `block_type = 'g'`.

Is there a general recommended strategy for writers to choose how
many entries to include in a single physical block?  I understand
that the deflated whole must fit in the physical block whose length
is defined in the footer of the whole file, and in general you would
not know how small the data deflates down to before compressing,
right?

> Log record keys are structured as:
>
>     ref_name '\0' reverse_int32( time_sec )
>
> where `time_sec` is the update time in seconds since the epoch.  The
> `reverse_int32` function inverses the value so lexographical ordering
> the network byte order time sorts more recent records first:
>
>     reverse_int(int32 t) {
>       return 0xffffffff - t;
>     }

Is 2038 an issue, or by that time we'd all be retired together with
this file format and it won't be our problem?

As the file format uses delta compression with restarts, a reader
needs to sequencially scan some bounded number of entries to get the
contents of a specific entry anyway, so I am wondering if it is
worth storing a longer timestamp in varint() for an restart entry
and express the timestamp on delta entries as difference to the
previous entry.

> ### Log index
>
> The log index stores the log key (`refname \0 reverse_int32(time_sec)`)
> for the last log record of every log block in the file, supporting
> bounded-time lookup.

This assumes that timestamps never wildly rewind in the reflog,
doesn't it?  Is that a sensible assumption?

Or does "the last log record" in the above mean "the log record with
largest timestamp?  ... ah, not that is still not sufficient.  You'd
still need to assume that timestamps on entries in one log block must
be all older than the ones on entries in later log blocks.  Hmph...

Also it is not clear to me how reflogs for two refs would be
intermixed in the log blocks, and what log keys for the entries must
be recorded in the log index, to facilitate efficient lookup.  Is it
assumed that a consecutive sequence of log blocks record reflogs for
the same ref, before the sequence of log blocks switch to record
reflogs for another ref, or something?

> A log index block must be written if 2 or more log blocks are written
> to the file.  If present, the log index appears after the first log
> block.  There is no padding used to align the log index to block
> alignment.
>
> Log index format is identical to ref index, except the keys are 5
> bytes longer to include `'\0'` and the 4-byte `reverse_int32(time)`.
> Records use `block_offset` to refer to the start of a log block.

I am assuming that we do not care about being able to quickly
determine master@{24028}; I would imagine that it wouldn't be too
hard to add an index to help such query, but I offhand would not
know the details until I figure out how the format handles reflog
entries for multiple refs first.