On 05/03/2017 09:38 AM, Jeff Hostetler wrote:
On 3/8/2017 1:50 PM, git@xxxxxxxxxxxxxxxxx wrote:
From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>
[RFC] Partial Clone and Fetch
=============================
[...]
E. Unresolved Thoughts
======================
*TODO* The server should optionally return (in a side-band?) a list
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation. This would allow the client to distinguish from invalid
SHAs and missing ones. Size information would allow the client to
maybe choose between various servers.
Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@xxxxxxxxxxxxxx/T/
I want to respond to both of these threads here.
-------------------------------------------------
Thanks for your input. I see that you have explained both "storing
'positive' information about missing blobs" and "what to store with
those positive information"; I'll just comment on the former for now.
Missing-Blob Support
====================
Let me offer up an alternative idea for representing
missing blobs. This is differs from both of our previous
proposals. (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)
Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.
Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.
I thought about this (see "Some alternative designs" in [1]), listing
some similar benefits, but concluded that "it is difficult to scale to
large repos".
Firstly, to be clear, by large repos I meant (and mean) the svn-style
"monorepos" that Jonathan Nieder mentions as use case "A" [2].
My concern is that such lists (whether in separate file(s) or in .idx
files) would be too unwieldy to manipulate. Even if we design things to
avoid modifying such lists (for example, by adding a new list whenever
we fetch instead of trying to modify an existing one), we would at least
need to sort their contents (for example, when generating an .idx in the
first place). For a repo with 10M-100M blobs [3], this might be doable
on today's computers, but I would be concerned if a repo would exceed
such numbers.
[1] <20170426221346.25337-1-jonathantanmy@xxxxxxxxxx>
[2] <20170503182725.GC28740@xxxxxxxxxxxxxxxxxxxxxxxxx>
[3] In Microsoft's announcement of Git Virtual File System [4], they
mentioned "over 3.5 million files" in the Windows codebase. I'm not sure
if this refers to files in a snapshot (that is, working copy) or all
historical versions.
[4]
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/
1. Suppose we update the .pack file format slightly.
() We use the 5 value in "enum object_type" to mean a
"missing-blob".
() We update git-pack-object as I did in my RFC, but have it
create type 5 entries for the blobs that are omitted,
rather than nothing.
() Hopefully, the same logic that currently keeps pack-object
from sending unnecessary blobs on subsequent fetches can
also be used to keep it from sending unnecessary missing-blob
entries.
() The type 5 missing-blob entry would contain the SHA-1 of the
blob and some meta-data to be explained later.
My original idea was to have sorted list(s) of hashes in separate
file(s) much like the currently existing shallow file; it would have the
semantics of "a hash here might be present or absent; if it is absent,
use the hook". (Initially I thought that one list would be sufficient,
but after reading your idea and considering it some more, multiple lists
might be better.)
Your idea of storing them in an .idx (and possibly corresponding .pack
file) is similar, I think. Although mine is probably simpler - at least,
we wouldn't need a new object_type.
As described above, I don't think this list-of-hashes idea will work
(because of the large numbers of blobs involved), but I'll compare it to
yours anyway just in case we end up being convinced that this general
idea works.
2. Make a similar change in the .idx format and git-index-pack
to include them there. Then blob lookup operations could
definitively determine that a blob exists and is just not
present locally.
3. With this, packfile-based blob-lookup operations can get a
"missing-blob" result.
() It should be possible to short-cut searching in other
packfiles (because we don't have to assume that the blob
was just misplaced in another packfile).
() Lookup can still look for the corresponding loose blob
(in case a previous lookup already "faulted it in").
The binary search to lookup a packfile offset from a .idx file (which
involves disk reads) would take longer for all lookups (not just lookups
for missing blobs) - I think I prefer keeping the lists separate, to
avoid pessimizing the (likely) usual case where the relevant blobs are
all already in local repo storage.
4. We can then think about dynamically fetching it.
() Several techniques for this are currently being
discussed on the mailing list in other threads,
so I won't go into this here.
() There has also been debate about whether this should
yield a loose blob or a new packfile. I think both
forms have merit and depend on whether we are limited
to asking for a single blob or can make a batch request.
() A dynamically-fetched loose blob is placed in the normal
loose blob directory hierarchy so that subsequent
lookups can find it as mentioned above.
() A dynamically-fetched packfile (with one or more blobs)
is written to the ODB and then the lookup operation
completes.
{} I want to isolate these packfiles from the main
packfiles, so that they behave like a second-stage
lookup and don't affect the caching/LRU nature of
the existing first-stage packfile lookup.
{} I also don't want the ambiguity of having 2 primary
packfiles with a blob marked as missing in 1 and
present in the other.
With my idea, the second-stage lookup is done on the list of missing
hashes; there is no division between packfiles.
5. git-repack should be updated to "do the right thing" and
squash missing-blob entries.
6. And etc.