After some discussion in [1] (in particular, about preserving the functionality of the connectivity check as much as possible) and some in-office discussion, here's an updated design. Overview ======== This is an update of the design in [1]. The main difference between this and other related work [1] [2] [3] is that we can still check connectivity between locally-created objects without having to consult a remote server for any information. In addition, the object loader writes to an incomplete packfile. This (i) ensures that Git has immediate access to the object, (ii) ensures that not too many files are written during a single Git invocation, and (iii) prevents some unnecessary copies (compared to, for example, transmitting entire objects through the protocol). Local repo layout ================= Objects in the local repo are further divided into "homegrown" and "imported" objects. "Imported" objects must be in a packfile that has a "<pack name>.remote" file with arbitrary text (similar to the ".keep" file). They come from clones, fetches, and the object loader (see below). "Homegrown" objects are every other object. Object loader ============= The object loader is a process that can obtain objects from elsewhere, given their hashes, and write their packed representation to a client-given file. The first time a missing object is needed during an invocation of Git, Git creates a temporary packfile and writes the header with a placeholder number of objects. Then, it starts the object loader, passing in the name of that temporary packfile. Whenever a missing object is needed, Git sends the hash of the missing object and expects the loader to append (with O_APPEND) the object to that packfile. Git keeps track of the object offsets as it goes, and Git can use the contents of that incomplete packfile. This is similar to what "git fast-import" does. When Git exits, it writes the number of objects in the header, writes the packfile checksum, moves the packfile to its final location, and writes a .idx and a .remote file. Connectivity check ================== An object walk is performed as usual from the tips (see the documentation for fsck etc. for which tips they use). A "homegrown" object is valid if each object it references: 1. is a "homegrown" object, 2. is an "imported" object, or 3. is referenced by an "imported" object. The references of an "imported" object are not checked. Performance notes ----------------- Because of rule 3 above, iteration through every "imported" object (or, at least, every "imported" object of a certain type) is sometimes required. For fsck, this should be fine because (i) this is not a regression since currently all objects must be iterated through anyway, and (ii) fsck prioritizes correctness over speed. For fetch, the speed of the connectivity check is immaterial; the connectivity check no longer needs to be performed because all objects obtained from the remote are, by definition, "imported" objects. There might be connectivity checks run during other commands like "receive-pack". I don't expect partial clones to use these often. These commands will still work, but performance of these is a secondary concern in this design. Impact on other tools ===================== "git gc" will need to not do anything to an "imported" object, even if it is unreachable, without ensuring that the connectivity check will succeed in that object's absence. (Special attention to rule 3 under "Connectivity check".) If this design stands, the initial patch set will probably have "git gc" not touch "imported" packs at all, trivially satisfying the above. In the future, "git gc" will either need to expel such objects into loose objects (like what is currently done for normal packs), treating them like a "homegrown" object (unreachable, so it won't interfere with future connectivity checks), or delete them outright - but there may be race conditions to think of. "git repack" will need to differentiate between packs with ".remote" and packs without. [1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@xxxxxxxxxx/ [2] https://public-inbox.org/git/20170714132651.170708-1-benpeart@xxxxxxxxxxxxx/ [3] https://public-inbox.org/git/20170803091926.1755-1-chriscool@xxxxxxxxxxxxx/