Re: The design of refs backends, linked worktrees and submodules

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 19 Jan 2017 14:30:20 +0100

On 01/19/2017 12:55 PM, Duy Nguyen wrote:
> I've started working on fixing the "git gc" issue with multiple
> worktrees, which brings me back to this. Just some thoughts. Comments
> are really appreciated.
> 
> In the current code, files backend has special cases for both
> submodules (explicitly) and linked worktrees (hidden behind git_path).

There is another terrible hack also needed to implement linked
worktrees, namely that the `refs/bisect/` hierarchy is manually inserted
into the `ref_cache`, because otherwise it wouldn't be noticed when
iterating over loose references via `readdir()`.

Other similar hacks would be required if other reference subtrees are
declared to be per-worktree.

> But if a backend has to handle this stuff, all future backends have to
> too. Which does not sound great. Imagine we have "something" in
> addition to worktrees and submodules in future, then all backends have
> to learn about it.

Agreed, the status quo is not pretty.

I kindof think that it would have been a better design to store the
references for all linked worktrees in the main repository's ref-store.
For example, the "bisect" refs for a worktree named "<name>" could have
been stored under "refs/worktrees/<name>/bisect/*". Then either:

* teach the associated tools to read/write references there directly
(probably with DWIM rules to make command-line use easier), or
* treat these references as if they were actually at a standard place
like `refs/worktree/bisect/*`; i.e., users would need to know that they
were per-worktree references, but wouldn't need to worry about the true
locations, or
* treat these references as if they were actually in their traditional
locations (though it is not obvious how this scheme could be expanded to
cover new per-worktree references).

> So how about we move worktree and submodule support back to front-end?
> 
> The backend deals with refs, period. The file-based backend will be
> given a directory where refs live in and it work on that. Backends do
> not use git_path(). Backends do not care about $GIT_DIR. Access to odb
> (e.g. sha-1 validation) if needed is abstracted out via a set of
> callbacks. This allows submodules to give access to submodule's
> separate odb. And it's getting close to the "struct repository"
> mentioned somewhere in refs "TODO" comments, even though we are
> nowhere close to introducing that struct.

This is a topic that I have thought a lot about. I definitely like this
direction. In fact I've dabbled around with some first steps; see branch
`submodule-hash` in my fork on GitHub [1]. That branch associates a
`ref_store` more closely with the directory where the references are
stored, as opposed to having a 1:1 relationship between `ref_store`s and
submodules.

I would like to see the separation of a concept "iterate over all
reachability roots" that is independent of other ref iteration. Then it
wouldn't have to include reference names, except basically for use in
error messages. So for linked worktrees, in place of the reference name
it might emit a string like "<worktree>:<refname>". (Of course it would
get its information by iterating over all of the linked reference stores
using their reference iteration APIs.)

> For that to work, I'll probably need to add a "composite" ref_store
> that combines two file-based backends (for per-repo and per-worktree
> refs) to represent a unified ref store. I think your work on ref
> iterator makes way for that. A bit worried about transactions though,
> because I think per-repo and per-worktree updates will be separated in
> two transactions. But that's probably ok because future backends, like
> lmdb, will have to go through the same route anyway.

Yes, that was the main motivation for the ref-iterator work.

Regarding transactions, the commit pointed at by branch
`split-transaction` in my fork shows how I think the
`transaction_commit()` method could be split into two parts,
`transaction_prepare()` and `transaction_finish()`. The idea would be
that the driver function, `ref_transaction_commit()`, calls
`transaction_prepare()` for each `ref_store` involved in the
transaction, passing each one the reference updates for references that
live in that reference store. Those methods would verify that the part
of the transaction that lives in that ref-store "should" go through,
without actually committing anything. Then `transaction_finish()` would
be called on each ref store, and that method would commit the updates.
You probably couldn't get a bulletproof kind of compound transaction out
of this (e.g., if the computer's power goes out, one `ref_store`'s
updates might be committed but another one's not). But it would probably
be good enough to cover everyday reasons for transaction failures, like
pre-checksums not matching up.

Let me braindump some more information about this topic. A files backend
for a repository (ignoring submodules for the moment) currently consists
of five interacting parts, each of which looks a lot like a ref-store
itself:

* A loose reference ref-store for the main repo
* A loose reference ref-store for the per-subtree references
* A ref_cache in front of the two loose reference stores
* A packed ref-store
* A second ref_cache in front of the packed ref-store

But these ref-stores are currently coupled very tightly and have
peculiarities:

* The caching code is tightly coupled to the ref-store behind it.
* It is hard to imagine a packed refs-store that doesn't have some kind
of caching in front of it.
* There are tricky ordering constraints between writes to packed and
loose references to avoid races.
* The packed ref-store can't store symbolic refs, nor can it store
reflogs. It currently relies on the loose ref-store for those things.
* There is no packed-refs ref-store associated with per-worktree refs.
* Packed references are currently locked via `*.lock` files located near
the corresponding loose references.
* There are constraints that span refstores. For example, you aren't
allowed to create a packed ref that D/F conflicts with a loose ref or
vice versa.
* Symrefs, which are loose, can point at packed references.

I've taken some stabs at picking these apart into separate ref stores,
but haven't had time to make very satisfying progress. By the time of
GitMerge I might have a better feeling for whether I can devote some
time to this project.

Michael

[1] https://github.com/mhagger/git