On 08/07/2017 03:47 AM, Shawn Pearce wrote: > 6th iteration of the reftable storage format. Thanks! > Changes from v5: > - extensions.refStorage = reftable is used to select this format. > > - Log records can be explicitly deleted (for refs/stash). > - Log records may use Michael Haggerty's chained idea to compress before zlib. > This saved ~5.8% on one of my example repositories. Meh. Do you think that's worth the complexity? The percentage savings will presumably be even lower for repositories that store significant information in their reflog messages. > [...] > #### ref record > > A `ref_record` describes a single reference, storing both the name and > its value(s). Records are formatted as: > > varint( prefix_length ) > varint( (suffix_length << 3) | value_type ) > suffix > value? > > [...] > - `0x0`: deletion; no value data (see transactions, below) > - `0x1`: one 20-byte object id; value of the ref > - `0x2`: two 20-byte object ids; value of the ref, peeled target > - `0x3`: symref and text: `varint( text_len ) text` > > Symbolic references use `0x3` with a `text` string starting with `"ref: "`, > followed by the complete name of the reference target. No > compression is applied to the target name. Other types of contents > that are also reference like, such as `FETCH_HEAD` and `MERGE_HEAD`, > may also be stored using type `0x3`. I'm still relatively negative on storing "other" references (except `HEAD`) in reftable. Here are my thoughts: * "Other" references are not considered for reachability, so there is no need for their modification to be done atomically. * "Other" references don't have or need reflogs. * The refs API would have to provide a way for other Git code to read and write "other" references including their extra information, and the users of that information would have to be rewritten to use the new API. * Presumably there are other programs in the wild (e.g., scripts) that want to read that information. They wouldn't be able to extract it from reftable files themselves, so we would also have to provide a command-line tool to read (and write?) such files. > Types `0x4..0x7` are reserved for future use. Regardless, I suggest allocating separate `value_type`s for generic symrefs (which then wouldn't need a `ref: ` prefix) vs. for "other" references. > [...] > ### Ref index It wasn't clear to me whether (in the case of a multi-level index) ref index blocks have to be aligned in `block_size` blocks (both their maximum size and their alignment). I don't see a reason for that to be required, though of course a compactor implementation might choose to block-align these blocks based on the filesystem that is in use. For that matter, I don't see an intrinsic reason that object blocks or object index blocks need to be block aligned. In fact, the only way I can see that the current reftable proposal makes use of `block_size` is so that `obj_record`s can record `block_delta` in units of `block_size` rather than in units of bytes. (And given that I'm skeptical about the value of the object index, that justification seems thin.) I totally accept that *you* want to align your blocks, and I'm totally supportive of a format that permits a reftable compactor to write reftables that are block-aligned. It just still seems to me that it imposes more complexity than necessary on *other* reftable compactor implementations that don't care about block alignment. Aside from the object index, I think it would be straightforward to write a reftable reader that is totally ignorant of `block_size`. So unless I've overlooked something, I think the following plan wouldn't cause you any extra trouble, but would make it easier to implement a compactor that doesn't care about block sizes or object indexes: If a reftable has an object index, then `block_size` must be specified, and ref blocks *must* be aligned to start at multiples of `block_size`. However, if a reftable has no object index, then its `block_size` is only a hint about the typical block size; e.g., "if you want to read a full block, then try reading `block_size` and you'll probably get the whole thing". And if `block_size` is zero, then readers get no guidance about the typical block size (which would be just fine for an mmap-based reader). Essentially, choices about block alignment would become a quality-of-implementation issue for reftable compactors, and readers would hardly need to care. > [...] > #### index record > > An index record describes the last entry in another block. > Index records are written as: > > varint( prefix_length ) > varint( (suffix_length << 3) | 0 ) > suffix > varint( block_position ) > > Index records use prefix compression exactly like `ref_record`. > > Index records store `block_position` after the suffix, specifying the > absolute position in bytes (from the start of the file) of the block > that ends with this reference. Is there a reason that the index lists the *last* refname that is contained in a block rather than the *first* refname? I can't think of a reason to choose one vs. the other, but your choice was initially surprising. I don't think it matters either way; I was just curious. Do I understand correctly that all `block_position`s are *byte* addresses, even in the `ref_index` where they should all be multiples of the block size (except the zeroth one)? I think that's OK, but note that it will waste more than a byte per `ref_index` and `obj_index` record, on average. > Readers must examine the block header at `block_position` to determine > if the next block is another level index block, or the leaf-level ref > block. For scanning through a whole namespace, like `refs/tags/`, I guess you only need to use a binary search to find the beginning of the range. Then you would read serially forwards from there, continuing from one `ref_block` to the next, until you find a refname that doesn't start with `refs/tags/`. In other words, there is no reason to binary search to find the end of the namespace, correct? [1] The same approach would be used to scan the reflog of a reference. [1] I suppose binary searching to find the end of the namespace might be useful for a high-latency filesystem, as you could initiate a pre-fetch for all of the storage blocks that are expected to be needed rather than initiating the read of the next block only after having processed the previous block. > [...] > #### log record > [...] > log_chained { > old_id > varint( time_seconds ) > not_same_committer { > sint16( tz_offset ) > varint( name_length ) name > varint( email_length ) email > }? > not_same_message { > varint( message_length ) message > }? > } > ``` > > Log record entries use `log_type` to indicate what follows: > > - `0x0`: deletion; no log data. > - `0x1`: standard git reflog data using `log_data` above. > - `0x2..0x3`: reserved for future use. > - `0x4..0x7`: `log_chained`, with conditional members. > > [...] > For `log_type = 0x4..0x7` the `log_chained` section is used instead to > compress information that already appeared in a prior log record. The > `log_chained` always includes `old_id` for this record, as `new_id` is > implied by the prior (by file order, more recent) record's `old_id`. > > The `not_same_committer` block appears if `log_type & 0x1` is true, > `not_same_message` block appears if `log_type & 0x2` is true. When > one of these blocks is missing, its values are implied by the prior > (more recent) log record. Just to make sure that we are on the same page... `old_id` and `new_id` in adjacent reflog entries are not always identical. If you run `git reflog expire` or `git reflog delete` without the `--rewrite` option, then the to-be-deleted entries are just dropped without changing the neighboring entries to chain together again. This would have to be supported in your proposal by writing a full `log_data` record for the entry following such a gap. > [...] > #### Importing logs > > When importing from `$GIT_DIR/logs` writers should globally order all > log records roughly by timestamp while preserving file order, and > assign unique, increasing `update_index` values for each log line. > Newer log records get higher `update_index` values. > > Although an import may write only a single reftable file, the reftable > file must span many unique `update_index`, as each log line requires > its own `update_index` to preserve semantics. Thinking out loud here: A really high-quality importer might want to group together, under the same `update_index`, ref updates that are thought originally to have been done in the same transaction. * Only group entries with the same timestamps and log messages should be grouped together. * There should not be more than one update to a particular reference in a single transaction. * Ideally, it would avoid creating states between `update-index`es where references D/F-conflict with each other. (Given that reflogs can be expired, this is not always possible in the general case.) * It could theoretically even reconstruct "branch rename" operations that required temporary reference names into a single `update_index`. But I doubt that it is worth the effort. (The whole idea gives me nasty flashbacks from working on cvs2svn/cvs2git.) > [...] > ### Layout > > The `$GIT_DIR/refs` path is a file when reftable is configured, not a > directory. This prevents loose references from being stored. > > A collection of reftable files are stored in the `$GIT_DIR/reftable/` > directory: > > 00000001_UF4paF > 00000002_bUVgy4 > > where reftable files are named by a unique name such as produced by > the function: > > mktemp "${update_index}_XXXXXX" Please note that if reflogs are compacted into a separate "reflog-only" file, then the same `update_index` might appear in the filename of both a "reflog-only" file and a value-only file. I don't think that this is a problem, but we shouldn't bake assumptions about uniqueness into the system. I'm a little bit worried that users might automatically think that a filename that includes a string like `UF4paF` is a temporary file and "clean it up" (with disastrous consequences). It might be prudent to give the files names that don't look so garbagy. And wouldn't it be nice to tack a filename extension onto the end of these filenames to make them more easily recognizable? > The stack ordering file is `$GIT_DIR/refs` and lists the current > files, one per line, in order, from oldest (base) to newest (most > recent): > > $ cat .git/refs > 00000001_UF4paF > 00000002_bUVgy4 > > Readers must read `$GIT_DIR/refs` to determine which files are > relevant right now, and search through the stack in reverse order > (last reftable is examined first). > > Reftable files not listed in `refs` may be new (and about to be added > to the stack by the active writer), or ancient and ready to be pruned. It might be good to think about how readers should work. The easy implementation would be: 1. Open and read the `refs` file 2. Open each of the reftable files that it mentions 3. If any of the files is missing, goto 1 (with some checks to avoid infinite loops). 4. Read from the now-open files as long as you need to. This would give you a self-consistent snapshot of the global reference state. However, a long-running program (especially one dealing with reachability) might want to check at strategic moments that the `refs` file hasn't changed out from under it while it was running, or even lock the `refs` file during critical operations. It would be possible to avoid opening all of the reftable files right away in the hope that the reference that you seek is in one of the top few files. But this quickly gets tricky because you might read some references from the top reftable, then need to dig deeper for another reference only to find out that one of the deeper files is no longer in the stack. So in one program run you might end up seeing reference values from two different snapshots. > [...] Michael