On 3/8/2017 1:50 PM, git@xxxxxxxxxxxxxxxxx wrote:
From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx> [RFC] Partial Clone and Fetch ============================= [...] E. Unresolved Thoughts ====================== *TODO* The server should optionally return (in a side-band?) a list of the blobs that it omitted from the packfile (and possibly the sizes or sha1_object_info() data for them) during the fetch-pack/upload-pack operation. This would allow the client to distinguish from invalid SHAs and missing ones. Size information would allow the client to maybe choose between various servers.
Since I first posted this, Jonathan Tan has started a related discussion on missing blob support. https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@xxxxxxxxxxxxxx/T/ I want to respond to both of these threads here. ------------------------------------------------- Missing-Blob Support ==================== Let me offer up an alternative idea for representing missing blobs. This is differs from both of our previous proposals. (I don't have any code for this new proposal, I just want to think out loud a bit and see if this is a direction worth pursuing -- or a complete non-starter.) Both proposals talk about detecting and adapting to a missing blob and ways to recover -- when we fail to find a blob. Comments on the thread asked about: () being able to detect missing blobs vs corrupt repos () being unable to detect duplicate blobs () expense of blob search. Suppose we store "positive" information about missing blobs? This would let us know that a blob is intentionally missing and possibly some meta-data about it. 1. Suppose we update the .pack file format slightly. () We use the 5 value in "enum object_type" to mean a "missing-blob". () We update git-pack-object as I did in my RFC, but have it create type 5 entries for the blobs that are omitted, rather than nothing. () Hopefully, the same logic that currently keeps pack-object from sending unnecessary blobs on subsequent fetches can also be used to keep it from sending unnecessary missing-blob entries. () The type 5 missing-blob entry would contain the SHA-1 of the blob and some meta-data to be explained later. 2. Make a similar change in the .idx format and git-index-pack to include them there. Then blob lookup operations could definitively determine that a blob exists and is just not present locally. 3. With this, packfile-based blob-lookup operations can get a "missing-blob" result. () It should be possible to short-cut searching in other packfiles (because we don't have to assume that the blob was just misplaced in another packfile). () Lookup can still look for the corresponding loose blob (in case a previous lookup already "faulted it in"). 4. We can then think about dynamically fetching it. () Several techniques for this are currently being discussed on the mailing list in other threads, so I won't go into this here. () There has also been debate about whether this should yield a loose blob or a new packfile. I think both forms have merit and depend on whether we are limited to asking for a single blob or can make a batch request. () A dynamically-fetched loose blob is placed in the normal loose blob directory hierarchy so that subsequent lookups can find it as mentioned above. () A dynamically-fetched packfile (with one or more blobs) is written to the ODB and then the lookup operation completes. {} I want to isolate these packfiles from the main packfiles, so that they behave like a second-stage lookup and don't affect the caching/LRU nature of the existing first-stage packfile lookup. {} I also don't want the ambiguity of having 2 primary packfiles with a blob marked as missing in 1 and present in the other. 5. git-repack should be updated to "do the right thing" and squash missing-blob entries. 6. And etc. Missing-Blob Entry Data ======================= A missing-blob entry needs to contain the SHA-1 value of the blob (obviously). Other fields are nice to have, but are not necessary. Here are a few fields to consider. A. The SHA-1 (20 bytes) B. The raw size of the blob (5? bytes). () This is the cleaned size of the file as stored. The server does not (and should not) have any knowledge of the smudging that may happen. () This may be useful if whatever dynamic-fetch-hook wants to customize its behavior, such as individually fetching large blobs and batch fetching smaller ones from the same server. () GVFS found it necessary to create a custom server end-point to get blob size data so that "ls -l" could show file sizes for non-present virtualized files. () 5 bytes (uint:40) should be more than enough for this. C. A server "hint" (20 bytes) () Instructions to help the client fetch the blob. () If I have multiple remotes configured, a missing-blob should be fetched from the same server that created the missing-blob entry (since it may be the only one that has it). () If a blob is very large (and was omitted for this reason), the server may want to redirect the client to a geographically closer CDN. () This is the SHA-1 of a file in the repository of a hook (or a set of parameters to be used by a hook). {} This is a bit of *hand-wave* right now, but the idea is that you can use the information here to individually fetch a blob or batch fetch a set of blobs that have the same hint. {} Yes, there are security concerns here, so perhaps the hint file should just contain parameters for a stock git-fetch-pack or git-fetch-blob-pack or curl command (or wrapper script) that "does the right thing". {} I thought this would be more compact than listing detailed fetch data per-blob. And we don't have to define yet another syntax. For example, we can let the SHA-1 point to an administrator configured shell script and be done. () We assume that the SHA-1 file is present locally (not missing). This might refer to a pinned file in a special ".git*" file (that we never omit) in HEAD. Or it might be in a branch that all clients are assumed to have. Concluding Thoughts =================== Combining the ideas here with the partial clone/fetch parameters and the various blob back-filling proposals gives us the ability to create and work with sparse repos. () Filtering can be based upon blob size; this could be seen as an alternative solution to LFS for repos with large objects. () Filtering could also be based upon pathnames (such as a sparse-checkout filter) and greatly help performance on very large repos where developers only work with small areas of the tree. Thanks Jeff