Am 15.06.2017 um 15:25 schrieb Jeff King: > On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote: >>> I'm not really sure how, though, short of caching the directory >>> contents. That opens up questions of whether and when to invalidate the >>> cache. If the cache were _just_ about short hashes, it might be OK to >>> just assume that it remains valid through the length of the program (so >>> worst case, a simultaneous write might mean that we generate a sha1 >>> which just became ambiguous, but that's generally going to be racy >>> anyway). >>> >>> The other downside of course is that we'd spend RAM on it. We could >>> bound the size of the cache, I suppose. >> >> You mean like an in-memory pack index for loose objects? In order to >> avoid the readdir call in sha1_name.c::find_short_object_filename()? >> We'd only need to keep the hashes of found objects. An oid_array >> would be quite compact. > > Yes, that's what I was thinking of. An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times a factor of 1.5 from alloc_nr). How many loose objects do we have to expect? 30 MB for a million of them sounds not too bad, but can there realistically be orders of magnitude more? >> Non-racy writes inside a process should not be ignored (write, read >> later) -- e.g. checkout does something like that. > > Ideally, yes. Though thinking on this more, it's racy even today, > because we rely on the in-memory packed_git list. We usually re-check the > directory only when we look for an object and it's missing. So any new > packs which have been added while the process runs will be missed when > doing short-sha1 lookups (and actually, find_short_packed_object does > not even seem to do the usual retry-on-miss). > > A normal process like "checkout" isn't writing new packs, but a > simultaneous repack could be migrating its new objects to a pack behind > its back. (It also _can_ write packs, but only for large blobs). > > Given that we are talking about finding "unique" abbreviations here, and > that those abbreviations can become invalidated immediately anyway as > more objects are added (even by the same process), I'm not sure we need > to strive for absolute accuracy. Agreed. And processes that add objects and use them later probably use full hashes (at least checkout does). So here's a patch for caching loose object hashes in an oid_array without a way to invalidate or release the cache. It uses a single oid_array for simplicity, but reads each subdirectory individually and remembers that fact using a bitmap. --- cache.h | 4 ++++ sha1_name.c | 69 +++++++++++++++++++++++++++++++++++++++++++------------------ 2 files changed, 53 insertions(+), 20 deletions(-) diff --git a/cache.h b/cache.h index 4d92aae0e8..8c6748907b 100644 --- a/cache.h +++ b/cache.h @@ -11,6 +11,7 @@ #include "string-list.h" #include "pack-revindex.h" #include "hash.h" +#include "sha1-array.h" #ifndef platform_SHA_CTX /* @@ -1586,6 +1587,9 @@ extern struct alternate_object_database { struct strbuf scratch; size_t base_len; + uint32_t loose_objects_subdir_bitmap[8]; + struct oid_array loose_objects_cache; + char path[FLEX_ARRAY]; } *alt_odb_list; extern void prepare_alt_odb(void); diff --git a/sha1_name.c b/sha1_name.c index 5126853bb5..ae6a5c3abe 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -77,10 +77,46 @@ static void update_candidates(struct disambiguate_state *ds, const struct object /* otherwise, current can be discarded and candidate is still good */ } +static void read_loose_object_subdir(struct alternate_object_database *alt, + int subdir_nr) +{ + struct strbuf *buf; + char hex[GIT_MAX_HEXSZ]; + DIR *dir; + struct dirent *de; + size_t bitmap_index = subdir_nr / 32; + uint32_t bitmap_mask = 1 << (subdir_nr % 32); + + if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask) + return; + alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask; + + buf = alt_scratch_buf(alt); + strbuf_addf(buf, "%02x/", subdir_nr); + xsnprintf(hex, sizeof(hex), "%02x", subdir_nr); + + dir = opendir(buf->buf); + if (!dir) + return; + + while ((de = readdir(dir)) != NULL) { + struct object_id oid; + + if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2) + continue; + memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2); + if (!get_oid_hex(hex, &oid)) + oid_array_append(&alt->loose_objects_cache, &oid); + } + closedir(dir); +} + +static int match_sha(unsigned, const unsigned char *, const unsigned char *); + static void find_short_object_filename(struct disambiguate_state *ds) { + int subdir_nr = ds->bin_pfx.hash[0]; struct alternate_object_database *alt; - char hex[GIT_MAX_HEXSZ]; static struct alternate_object_database *fakeent; if (!fakeent) { @@ -95,29 +131,22 @@ static void find_short_object_filename(struct disambiguate_state *ds) } fakeent->next = alt_odb_list; - xsnprintf(hex, sizeof(hex), "%.2s", ds->hex_pfx); for (alt = fakeent; alt && !ds->ambiguous; alt = alt->next) { - struct strbuf *buf = alt_scratch_buf(alt); - struct dirent *de; - DIR *dir; - - strbuf_addf(buf, "%.2s/", ds->hex_pfx); - dir = opendir(buf->buf); - if (!dir) - continue; + int pos; - while (!ds->ambiguous && (de = readdir(dir)) != NULL) { - struct object_id oid; + read_loose_object_subdir(alt, subdir_nr); - if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2) - continue; - if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2)) - continue; - memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2); - if (!get_oid_hex(hex, &oid)) - update_candidates(ds, &oid); + pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx); + if (pos < 0) + pos = -1 - pos; + while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) { + const struct object_id *oid; + oid = alt->loose_objects_cache.oid + pos; + if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash)) + break; + update_candidates(ds, oid); + pos++; } - closedir(dir); } } -- 2.13.0