Patrick Steinhardt <ps@xxxxxx> writes: > On Thu, Aug 22, 2024 at 02:47:43AM -0700, karthik nayak wrote: >> Patrick Steinhardt <ps@xxxxxx> writes: >> >> > It was recently reported that concurrent reads and writes may cause the >> > reftable backend to segfault. The root cause of this is that we do not >> > properly keep track of reftable readers across reloads. >> > >> > Suppose that you have a reftable iterator and then decide to reload the >> > stack while iterating through the iterator. When the stack has been >> > rewritten since we have created the iterator, then we would end up >> > discarding a subset of readers that may still be in use by the iterator. >> > The consequence is that we now try to reference deallocated memory, >> > which of course segfaults. >> > >> > One way to trigger this is in t5616, where some background maintenance >> > jobs have been leaking from one test into another. This leads to stack >> > traces like the following one: >> > >> > + git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 --refetch origin >> > AddressSanitizer:DEADLYSIGNAL >> > ================================================================= >> > ==657994==ERROR: AddressSanitizer: SEGV on unknown address 0x7fa0f0ec6089 (pc 0x55f23e52ddf9 bp >> > 0x7ffe7bfa1700 sp 0x7ffe7bfa1700 T0) >> > ==657994==The signal is caused by a READ memory access. >> > #0 0x55f23e52ddf9 in get_var_int reftable/record.c:29 >> > #1 0x55f23e53295e in reftable_decode_keylen reftable/record.c:170 >> > #2 0x55f23e532cc0 in reftable_decode_key reftable/record.c:194 >> > #3 0x55f23e54e72e in block_iter_next reftable/block.c:398 >> > #4 0x55f23e5573dc in table_iter_next_in_block reftable/reader.c:240 >> > #5 0x55f23e5573dc in table_iter_next reftable/reader.c:355 >> > #6 0x55f23e5573dc in table_iter_next reftable/reader.c:339 >> > #7 0x55f23e551283 in merged_iter_advance_subiter reftable/merged.c:69 >> > #8 0x55f23e55169e in merged_iter_next_entry reftable/merged.c:123 >> > #9 0x55f23e55169e in merged_iter_next_void reftable/merged.c:172 >> > #10 0x55f23e537625 in reftable_iterator_next_ref reftable/generic.c:175 >> > #11 0x55f23e2cf9c6 in reftable_ref_iterator_advance refs/reftable-backend.c:464 >> > #12 0x55f23e2d996e in ref_iterator_advance refs/iterator.c:13 >> > #13 0x55f23e2d996e in do_for_each_ref_iterator refs/iterator.c:452 >> > #14 0x55f23dca6767 in get_ref_map builtin/fetch.c:623 >> > #15 0x55f23dca6767 in do_fetch builtin/fetch.c:1659 >> > #16 0x55f23dca6767 in fetch_one builtin/fetch.c:2133 >> > #17 0x55f23dca6767 in cmd_fetch builtin/fetch.c:2432 >> > #18 0x55f23dba7764 in run_builtin git.c:484 >> > #19 0x55f23dba7764 in handle_builtin git.c:741 >> > #20 0x55f23dbab61e in run_argv git.c:805 >> > #21 0x55f23dbab61e in cmd_main git.c:1000 >> > #22 0x55f23dba4781 in main common-main.c:64 >> > #23 0x7fa0f063fc89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 >> > #24 0x7fa0f063fd44 in __libc_start_main_impl ../csu/libc-start.c:360 >> > #25 0x55f23dba6ad0 in _start (git+0xadfad0) (BuildId: 803b2b7f59beb03d7849fb8294a8e2145dd4aa27) >> > >> >> The stacktrace is for iterating over refs, what I don't understand is >> where in this flow do we actually reload the stack. > > Basically, whenever you call into the reftable backend we check whether > we need to reload the stack. So, when creating a reftable iterator, > reading a single ref, writing refs and so on. So in the above code flow > we had a ref iterator, but during iteration we ended up reading other > refs, as well. > Yes, makes sense. Thanks for clarifying. >> > While it is somewhat awkward that the maintenance processes survive >> > tests in the first place, it is totally expected that reftables should >> > work alright with concurrent writers. Seemingly they don't. >> > >> > The only underlying resource that we need to care about in this context >> > is the reftable reader, which is responsible for reading a single table >> > from disk. These readers get discarded immediately (unless reused) when >> > calling `reftable_stack_reload()`, which is wrong. We can only close >> > them once we know that there are no iterators using them anymore. >> > >> > Prepare for a fix by converting the reftable readers to be refcounted. >> > >> >> Okay so my understanding is that `refcounted` refers to a reference >> count which keeps tracks of the stacks which are referring to the >> reader. The name is also used in `struct blame_origin` in blame.{c,h}. >> Makes a lot more sense now :) > > Yup. > >> > diff --git a/reftable/reader.h b/reftable/reader.h >> > index 88b4f3b421..3710ee09b4 100644 >> > --- a/reftable/reader.h >> > +++ b/reftable/reader.h >> > @@ -50,6 +50,8 @@ struct reftable_reader { >> > struct reftable_reader_offsets ref_offsets; >> > struct reftable_reader_offsets obj_offsets; >> > struct reftable_reader_offsets log_offsets; >> > + >> > + uint64_t refcount; >> >> Wonder if there is a chance that we decrement refcount from 0 and hence >> cause a wraparound. > > This should never happen in practice. And if it does, we would hit a > BUG(): > > void reftable_reader_decref(struct reftable_reader *r) > { > if (!r) > return; > if (!r->refcount) > BUG("cannot decrement ref counter of dead reader"); > if (--r->refcount) > return; > block_source_close(&r->source); > FREE_AND_NULL(r->name); > reftable_free(r); > } > > If the refcount is at zero already, we hit the bug. > > Patrick Yup, this seems correct. Thanks.
Attachment:
signature.asc
Description: PGP signature