Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: > In order to determine the code changes in sha1_file.c necessary, I > investigated the following: > (1) functions in sha1_file that take in a hash, without the user > regarding how the object is stored (loose or packed) > (2) functions in sha1_file that operate on packed objects (because I > need to check callers that know about the loose/packed distinction > and operate on both differently, and ensure that they can handle > the concept of objects that are neither loose nor packed) > > For (1), I looked through all non-static functions in sha1_file.c that > take in an unsigned char * parameter. The ones that are relevant, and my > modifications to them to resolve this problem, are: > - sha1_object_info_extended (fixed in this commit) > - sha1_object_info (auto-fixed by sha1_object_info_extended) > - read_sha1_file_extended (fixed by fixing read_object) > - read_object_with_reference (auto-fixed by read_sha1_file_extended) > - force_object_loose (only called from builtin/pack-objects.c, which > already knows that at least one pack contains this object) > - has_sha1_file_with_flags (fixed in this commit) > - assert_sha1_type (auto-fixed by sha1_object_info) > > As described in the list above, several changes have been included in > this commit to fix the necessary functions. > > For (2), I looked through the same functions as in (1) and also > for_each_packed_object. The ones that are relevant are: > - parse_pack_index > - http - indirectly from http_get_info_packs > - find_pack_entry_one > - this searches a single pack that is provided as an argument; the > caller already knows (through other means) that the sought object > is in a specific pack > - find_sha1_pack > - fast-import - appears to be an optimization to not store a > file if it is already in a pack > - http-walker - to search through a struct alt_base > - http-push - to search through remote packs > - has_sha1_pack > - builtin/fsck - fixed in this commit > - builtin/count-objects - informational purposes only (check if loose > object is also packed) > - builtin/prune-packed - check if object to be pruned is packed (if > not, don't prune it) > - revision - used to exclude packed objects if requested by user > - diff - just for optimization > - for_each_packed_object > - reachable - only to find recent objects > - builtin/fsck - fixed in this commit > - builtin/cat-file - see below > > As described in the list above, builtin/fsck has been updated. I have > left builtin/cat-file alone; this means that cat-file > --batch-all-objects will only operate on objects physically in the repo. One thing I wonder is what the performance impact of a change like this to the codepath that wants to see if an object does _not_ exist in the repository. When creating a new object by hashing raw data, we see if an object with the same name already exists before writing the compressed loose object out (or comparing the payload to detect hash collision). With a "missing blob" support, we'd essentially spawn an extra process every time we want to create a new blob locally, and most of the time that is done only to hear the external command to say "no, we've never heard of such an object", with a possibly large latency. If we do not have to worry about that (or if it is no use to worry about it, because we cannot avoid it if we wanted to do the lazy loading of objects from elsewhere), then the patch presented here looked like a sensible first step towards the stated goal. Thanks.