On Fri, Oct 26 2018, Junio C Hamano wrote: > "Jansen, Geert" <gerardu@xxxxxxxxxx> writes: > >> The index-pack command determines if a sha1 collision test is needed by >> checking the existence of a loose object with the given hash. In my tests, I >> can improve performance of “git clone” on Amazon EFS by 8x when used with a >> non-default mount option (lookupcache=pos) that's required for a Gitlab HA >> setup. My assumption is that this check is unnecessary when cloning into a new >> repository because the repository will be empty. > > My knee-jerk reaction is that your insight that we can skip the "dup > check" when starting from emptiness is probably correct, but your > use of .cloning flag as an approximation of "are we starting from > emptiness?" is probably wrong, at least for two reasons. > > - "git clone --reference=..." does not strictly start from > emptiness, and would still have to make sure that incoming pack > does not try to inject an object with different contents but with > the same name. > > - "git init && git fetch ..." starts from emptiness and would want > to benefit from the same optimization as you are implementing > here. > > As to the implementation, I think the patch adjusts the right "if()" > condition to skip the collision test here. > >> - if (startup_info->have_repository) { >> + if (startup_info->have_repository && !cloning) { >> read_lock(); >> collision_test_needed = >> has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK); > > I just do not think !cloning is quite correct. Geert: Thanks for working on this. A GitLab instance I'm involved in managing at work has suffered from this issue, e.g. with "fork" being a "clone" under the hood on GitLab, and taking ages on the NetApp NFS filer due to this issue, so I'm very interested in this moving forward. But as Junio notes the devil's in the details, another one I thought of is: GIT_OBJECT_DIRECTORY=/some/other/repository git clone ... It seems to me that it's better to stick this into setup_git_directory_gently() in setup.c and check various edge cases there. I.e. make this a new "have_object_already" member of the startup_info struct. That would be set depending on whether we find objects/packs in the objects dir, and would know about GIT_OBJECT_DIRECTORY (either just punting, or looking at those too). It would also need to know about read_info_alternates(), depending on when that's checked it would handle git clone --reference too since it just sets it up via add_to_alternates_file(). Then you'd change sha1_object() to just check startup_info->have_objects_already instead of startup_info->have_repository, and since we'd checked "did we have objects already?" it would work for the init && fetch case Junio mentioned. It would also be very nice to have a test for this, even if it's something OS-specific that only works on Linux after we probe for strace(1). Testing (without your patch, because git-am barfs on it , seems to dislake the base64 encoding): rm -rf /tmp/df; strace -f git clone --bare git@xxxxxxxxxx:git/git.git /tmp/df 2>&1 | grep '".*ENOENT' 2>&1|perl -pe 's/^.*?"([^"]+)".*/$1/'|sort|uniq -c|sort -nr|less I see we also check if packed-refs exists ~2800 times, and check for each ref we find on the remote. Those are obviously less of a performance issue when cloning in the common case, but perhaps another place where we can insert a "don't check, we don't have anything already" condition.