On Fri, 21 Jul 2017 12:24:52 -0400 Ben Peart <peartben@xxxxxxxxx> wrote: > Today we have 3.5 million objects * 30 bytes per entry = 105 MB of > promises. Given the average developer only hydrates 56K files (2 MB > promises) that is 103 MB to download that no one will ever need. We > would like to avoid that if possible as this would be a significant > regression in clone times from where we are today. > > I'm also concerned about the performance of merging in promises given we > have 100M objects today and growing so the number of promises over time > could get pretty large. After some thought, maybe a hybrid solution is best, in which it is permissible but optional for some missing objects to have promises. In that case, it is more of a "size cache" (which stores the type as well) rather than a true promise. When fetching, the client can optionally request for the sizes and types of missing objects. This is good for the large-blob case, in which we can always have size information of missing blobs, and we can subsequently add blob-size filtering (as a parameter) to "git log -S" and friends to avoid needing to resolve a missing object. And this is, as far as I can tell, also good for the many-blob case - just have an empty size cache all the time. (And in the future, use cases could come up that desire non-empty but non-comprehensive caches - for example, a directory lister working on a partial clone that only needs to cache the sizes of frequently accessed directories.) Another option is to have a repo-wide option that toggles between mandatory entries in the "size cache" and prohibited entries. Switching to mandatory provides stricter fsck and negative lookups, but I think it's not worth it for both the developers and users of Git to have to know about these two modes. > >> I think we should have a flag (off by default) that enables someone to > >> say that promised objects are optional. If the flag is set, > >> "is_promised_object" will return success and pass the OBJ_ANY type and a > >> size of -1. > >> > >> Nothing today is using the size and in the two places where the object > >> type is being checked for consistency (fsck_cache_tree and > >> fsck_handle_ref) the test can add a test for OBJ_ANY as well. > >> > >> This will enable very large numbers of objects to be omitted from the > >> clone without triggering a download of the corresponding number of > >> promised objects. > > > > Eventually I plan to use the size when implementing parameters for > > history-searching commands (e.g. "git log -S"), but it's true that > > that's in the future. > > > > Allowing promised objects to be optional would indeed solve the issue of > > downloading too many promises. It would make the code more complicated, > > but I'm not sure by how much. > > > > For example, in this fsck patch, the easiest way I could think of to > > have promised objects was to introduce a 3rd state, called "promised", > > of "struct object" - one in which the type is known, but we don't have > > access to the full "struct commit" or equivalent. And thus fsck could > > assume that if the "struct object" is "parsed" or "promised", the type > > is known. Having optional promised objects would require that we let > > this "promised" state have a type of OBJ_UNKNOWN (or something like > > that) - maybe that would be fine, but I haven't looked into this in > > detail. > > > > Caveats apply as I only did a quick look but I only found the two > locations that were checking the object type for consistency. I haven't looked into detail, but you are probably right.