Hi Matheus, Thanks for the quick response. On Thu, Sep 24, 2020 at 10:53 PM Matheus Tavares <matheus.bernardino@xxxxxx> wrote: > > Hi, Phil > > Thanks for the bug report and for the crash logs. > > On Thu, Sep 24, 2020 at 11:36 PM Phil Hord <phil.hord@xxxxxxxxx> wrote: > > > > Hi list. It's been a while. > > > > I can reliably reproduce a segfault in git grep when searching the > > object DB. It is caused by a race when threads > 2. > > > > I have a patch that fixes the problem, but I don't know exactly why. > > Can someone else explain it and/or offer a better solution? > > > > ==== > > > > diff --git packfile.c packfile.c > > index e69012e7f2..52b7b54aeb 100644 > > --- packfile.c > > +++ packfile.c > > @@ -1812,7 +1812,9 @@ void *unpack_entry(struct repository *r, struct > > packed_git *p, off_t obj_offset, > > if (!base) > > continue; > > > > + obj_read_lock(); > > delta_data = unpack_compressed_entry(p, &w_curs, > > curpos, delta_size); > > + obj_read_unlock(); > > > > if (!delta_data) { > > error("failed to unpack compressed delta " > > > > ==== > > Hmm, obj_read_mutex is a recursive lock, so by adding the > obj_read_lock() call here, we are incrementing the lock value to [at > least] 2 (because oid_object_info_extended() had already incremented > once). Therefore, when unpack_compressed_entry() calls obj_read_unlock() > before inflating the entry, the lock is not actually released, only > decremented. So the effect of this change is that the phase III of > unpack_entry() is performed entirely without releasing the lock. Ah, there's the piece I was missing. I didn't realize we temporarily drop the lock inside unpack_compressed_entry. And obviously I also didn't notice that we were already holding a lock before entering unpack_entry. Thanks for the detailed explanation. > Your crash log shows us that the segfault happened when trying to > memcpy() the string `base` (in unpack_entry()). I.e., the same string > that we had previously added to the delta base cache, right before > calling unpack_compressed_entry() and releasing the lock. The > problematic sequence is: > add `base` to the cache -> release lock -> inflate -> > acquire lock -> try to use `base` > > Maybe, when a thread X releases the lock to perform decompression, > thread Y acquires the lock and tries to add another base to the cache. > The cache is full, so Y has to remove entries, and it ends up free()'ing > the base that was just inserted by X. Later, X tries to dereference > `base` in patch_entry(), which would cause a segfault. It would also > explain why your change solves the segfault. > > I'm not sure, though, why this entry would be removed, since it was just > added... Maybe Y was adding a huge base to the cache, which required > removing all entries? I think your theory is correct here on both counts. The repo I am searching has some sinfully large objects, and when there is a cache limit, they are likely to exceed it. > Anyways, you mentioned you can reproduce the failure quite reliably in your > repo with 20 threads, right? Could you please check with the following patch > applied? It adds a `base` copy to the cache instead of the original string, > allowing us to keep running decompression in parallel but without the risk of > having another thread free()'ing `base` in the mean time. Yes, your patch reliably solves the problem, too. The performance is only slightly improved over holding the lock during inflate in my case, but it does save a bit of time. Surprisingly, performance seems to peak between 5 and 10 threads for me in both cases. It reliably gets slightly worse at 20 threads and 40 threads. In all cases when threads > 2, this search appears to be 10%-20% slower than running without this change. Even when threads=1, though, the result is about 10% slower. Perhaps it is worth avoiding the base_dup when !!obj_read_use_lock. > diff --git a/packfile.c b/packfile.c > index 9ef27508f2..79f83b6034 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -1772,14 +1772,15 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > while (delta_stack_nr) { > void *delta_data; > void *base = data; > - void *external_base = NULL; > unsigned long delta_size, base_size = size; > int i; > > data = NULL; > > - if (base) > - add_delta_base_cache(p, obj_offset, base, base_size, type); > + if (base) { > + char *base_dup = memcpy(xmalloc(base_size), base, base_size); > + add_delta_base_cache(p, obj_offset, base_dup, base_size, type); > + } > > if (!base) { > /* > @@ -1799,7 +1800,6 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > p->pack_name); > mark_bad_packed_object(p, base_oid.hash); > base = read_object(r, &base_oid, &type, &base_size); > - external_base = base; > } > } > > @@ -1818,7 +1818,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > "at offset %"PRIuMAX" from %s", > (uintmax_t)curpos, p->pack_name); > data = NULL; > - free(external_base); > + free(base); > continue; > } > > @@ -1838,7 +1838,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > error("failed to apply delta"); > > free(delta_data); > - free(external_base); > + free(base); > } > > if (final_type) > --