Re: reftable [v6]: new ref storage format

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Mon, 14 Aug 2017 14:13:23 +0200

On 08/07/2017 03:47 AM, Shawn Pearce wrote:
> 6th iteration of the reftable storage format.

Thanks!

> Changes from v5:
> - extensions.refStorage = reftable is used to select this format.
> 
> - Log records can be explicitly deleted (for refs/stash).
> - Log records may use Michael Haggerty's chained idea to compress before zlib.
>   This saved ~5.8% on one of my example repositories.

Meh. Do you think that's worth the complexity? The percentage savings
will presumably be even lower for repositories that store significant
information in their reflog messages.

> [...]
> #### ref record
> 
> A `ref_record` describes a single reference, storing both the name and
> its value(s). Records are formatted as:
> 
>     varint( prefix_length )
>     varint( (suffix_length << 3) | value_type )
>     suffix
>     value?
> 
> [...]
> - `0x0`: deletion; no value data (see transactions, below)
> - `0x1`: one 20-byte object id; value of the ref
> - `0x2`: two 20-byte object ids; value of the ref, peeled target
> - `0x3`: symref and text: `varint( text_len ) text`
> 
> Symbolic references use `0x3` with a `text` string starting with `"ref: "`,
> followed by the complete name of the reference target.  No
> compression is applied to the target name.  Other types of contents
> that are also reference like, such as `FETCH_HEAD` and `MERGE_HEAD`,
> may also be stored using type `0x3`.

I'm still relatively negative on storing "other" references (except
`HEAD`) in reftable. Here are my thoughts:

* "Other" references are not considered for reachability, so there
  is no need for their modification to be done atomically.

* "Other" references don't have or need reflogs.

* The refs API would have to provide a way for other Git code to
  read and write "other" references including their extra
  information, and the users of that information would have to
  be rewritten to use the new API.

* Presumably there are other programs in the wild (e.g., scripts)
  that want to read that information. They wouldn't be able to
  extract it from reftable files themselves, so we would also have
  to provide a command-line tool to read (and write?) such files.

> Types `0x4..0x7` are reserved for future use.

Regardless, I suggest allocating separate `value_type`s for generic
symrefs (which then wouldn't need a `ref: ` prefix) vs. for "other"
references.

> [...]
> ### Ref index

It wasn't clear to me whether (in the case of a multi-level index) ref
index blocks have to be aligned in `block_size` blocks (both their
maximum size and their alignment). I don't see a reason for that to be
required, though of course a compactor implementation might choose to
block-align these blocks based on the filesystem that is in use.

For that matter, I don't see an intrinsic reason that object blocks or
object index blocks need to be block aligned.

In fact, the only way I can see that the current reftable proposal makes
use of `block_size` is so that `obj_record`s can record `block_delta` in
units of `block_size` rather than in units of bytes. (And given that I'm
skeptical about the value of the object index, that justification seems
thin.)

I totally accept that *you* want to align your blocks, and I'm totally
supportive of a format that permits a reftable compactor to write
reftables that are block-aligned. It just still seems to me that it
imposes more complexity than necessary on *other* reftable compactor
implementations that don't care about block alignment.

Aside from the object index, I think it would be straightforward to
write a reftable reader that is totally ignorant of `block_size`.

So unless I've overlooked something, I think the following plan wouldn't
cause you any extra trouble, but would make it easier to implement a
compactor that doesn't care about block sizes or object indexes:

If a reftable has an object index, then `block_size` must be specified,
and ref blocks *must* be aligned to start at multiples of `block_size`.

However, if a reftable has no object index, then its `block_size` is
only a hint about the typical block size; e.g., "if you want to read a
full block, then try reading `block_size` and you'll probably get the
whole thing". And if `block_size` is zero, then readers get no guidance
about the typical block size (which would be just fine for an mmap-based
reader).

Essentially, choices about block alignment would become a
quality-of-implementation issue for reftable compactors, and readers
would hardly need to care.

> [...]
> #### index record
> 
> An index record describes the last entry in another block.
> Index records are written as:
> 
>     varint( prefix_length )
>     varint( (suffix_length << 3) | 0 )
>     suffix
>     varint( block_position )
> 
> Index records use prefix compression exactly like `ref_record`.
> 
> Index records store `block_position` after the suffix, specifying the
> absolute position in bytes (from the start of the file) of the block
> that ends with this reference.

Is there a reason that the index lists the *last* refname that is
contained in a block rather than the *first* refname? I can't think of a
reason to choose one vs. the other, but your choice was initially
surprising. I don't think it matters either way; I was just curious.

Do I understand correctly that all `block_position`s are *byte*
addresses, even in the `ref_index` where they should all be multiples of
the block size (except the zeroth one)? I think that's OK, but note that
it will waste more than a byte per `ref_index` and `obj_index` record,
on average.

> Readers must examine the block header at `block_position` to determine
> if the next block is another level index block, or the leaf-level ref
> block.

For scanning through a whole namespace, like `refs/tags/`, I guess you
only need to use a binary search to find the beginning of the range.
Then you would read serially forwards from there, continuing from one
`ref_block` to the next, until you find a refname that doesn't start
with `refs/tags/`. In other words, there is no reason to binary search
to find the end of the namespace, correct? [1]

The same approach would be used to scan the reflog of a reference.

[1] I suppose binary searching to find the end of the namespace might be
useful for a high-latency filesystem, as you could initiate a pre-fetch
for all of the storage blocks that are expected to be needed rather than
initiating the read of the next block only after having processed the
previous block.

> [...]
> #### log record
> [...]
>     log_chained {
>       old_id
>       varint( time_seconds )
>       not_same_committer {
>         sint16( tz_offset )
>         varint( name_length    )  name
>         varint( email_length   )  email
>       }?
>       not_same_message {
>         varint( message_length )  message
>       }?
>     }
> ```
> 
> Log record entries use `log_type` to indicate what follows:
> 
> - `0x0`: deletion; no log data.
> - `0x1`: standard git reflog data using `log_data` above.
> - `0x2..0x3`: reserved for future use.
> - `0x4..0x7`: `log_chained`, with conditional members.
> 
> [...]
> For `log_type = 0x4..0x7` the `log_chained` section is used instead to
> compress information that already appeared in a prior log record.  The
> `log_chained` always includes `old_id` for this record, as `new_id` is
> implied by the prior (by file order, more recent) record's `old_id`.
> 
> The `not_same_committer` block appears if `log_type & 0x1` is true,
> `not_same_message` block appears if `log_type & 0x2` is true.  When
> one of these blocks is missing, its values are implied by the prior
> (more recent) log record.

Just to make sure that we are on the same page...

`old_id` and `new_id` in adjacent reflog entries are not always
identical. If you run `git reflog expire` or `git reflog delete` without
the `--rewrite` option, then the to-be-deleted entries are just dropped
without changing the neighboring entries to chain together again.

This would have to be supported in your proposal by writing a full
`log_data` record for the entry following such a gap.

> [...]
> #### Importing logs
> 
> When importing from `$GIT_DIR/logs` writers should globally order all
> log records roughly by timestamp while preserving file order, and
> assign unique, increasing `update_index` values for each log line.
> Newer log records get higher `update_index` values.
> 
> Although an import may write only a single reftable file, the reftable
> file must span many unique `update_index`, as each log line requires
> its own `update_index` to preserve semantics.

Thinking out loud here: A really high-quality importer might want to
group together, under the same `update_index`, ref updates that are
thought originally to have been done in the same transaction.

* Only group entries with the same timestamps and log messages
  should be grouped together.

* There should not be more than one update to a particular
  reference in a single transaction.

* Ideally, it would avoid creating states between `update-index`es
  where references D/F-conflict with each other. (Given that reflogs
  can be expired, this is not always possible in the general case.)

* It could theoretically even reconstruct "branch rename" operations
  that required temporary reference names into a single `update_index`.

But I doubt that it is worth the effort. (The whole idea gives me nasty
flashbacks from working on cvs2svn/cvs2git.)

> [...]
> ### Layout
> 
> The `$GIT_DIR/refs` path is a file when reftable is configured, not a
> directory.  This prevents loose references from being stored.
> 
> A collection of reftable files are stored in the `$GIT_DIR/reftable/`
> directory:
> 
>     00000001_UF4paF
>     00000002_bUVgy4
> 
> where reftable files are named by a unique name such as produced by
> the function:
> 
>     mktemp "${update_index}_XXXXXX"

Please note that if reflogs are compacted into a separate "reflog-only"
file, then the same `update_index` might appear in the filename of both
a "reflog-only" file and a value-only file. I don't think that this is a
problem, but we shouldn't bake assumptions about uniqueness into the system.

I'm a little bit worried that users might automatically think that a
filename that includes a string like `UF4paF` is a temporary file and
"clean it up" (with disastrous consequences). It might be prudent to
give the files names that don't look so garbagy.

And wouldn't it be nice to tack a filename extension onto the end of
these filenames to make them more easily recognizable?

> The stack ordering file is `$GIT_DIR/refs` and lists the current
> files, one per line, in order, from oldest (base) to newest (most
> recent):
> 
>     $ cat .git/refs
>     00000001_UF4paF
>     00000002_bUVgy4
> 
> Readers must read `$GIT_DIR/refs` to determine which files are
> relevant right now, and search through the stack in reverse order
> (last reftable is examined first).
> 
> Reftable files not listed in `refs` may be new (and about to be added
> to the stack by the active writer), or ancient and ready to be pruned.

It might be good to think about how readers should work. The easy
implementation would be:

1. Open and read the `refs` file
2. Open each of the reftable files that it mentions
3. If any of the files is missing, goto 1 (with some checks to avoid
   infinite loops).
4. Read from the now-open files as long as you need to.

This would give you a self-consistent snapshot of the global reference
state. However, a long-running program (especially one dealing with
reachability) might want to check at strategic moments that the `refs`
file hasn't changed out from under it while it was running, or even lock
the `refs` file during critical operations.

It would be possible to avoid opening all of the reftable files right
away in the hope that the reference that you seek is in one of the top
few files. But this quickly gets tricky because you might read some
references from the top reftable, then need to dig deeper for another
reference only to find out that one of the deeper files is no longer in
the stack. So in one program run you might end up seeing reference
values from two different snapshots.

> [...]

Michael