On Wed, Jan 08, 2025 at 09:40:58AM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@xxxxxx> writes: > > > Hm. The problem is when Git dies in the middle of a transaction: > > > > 1. We write the temporary table. > > 2. We compute the not-so-random suffix. > > 3. We write the temporary "tables.list" file. > > 4. We move the temporary table into place using the not-so-random > > suffix. > > 5. Git dies before updating "tables.list". > > > > Now we have the temporary table moved into place, but "tables.list" > > hasn't been updated yet. When the next Git process comes along and wants > > to update the table it would result in an error if it computed the same > > suffix. > > Here, I hear that we _do_ depend on the suffix being relatively > unique. Once our random number generator decides to give the same > number twice to cause collision, the reftable data gets corrupt? No, there is no corruption. We may fail to update the stack when there are colliding files, but that's it. > > The reftable library knows to clean up such stale tables when not > > referenced by the "tables.list" file, but it doesn't do so on every > > write. So this would likely still cause issues in practice. > > > > I already though about this scenario when writing my mail, but didn't > > really think about it as "correctness". But I guess it is. > > Hmph. I am not sure how I should feel about this. Our reliance on > hash functions (which can be made to collide) not colliding is one > thing, but is it sensibly safe to rely on a cryptographically > unpredictable random generator not to yield the same suffix twice > during the lifetime of an previous invocation for correctness? This is why I've been hesistant to call it a "correctness" issue, as there is no corruption involved here. It's more of a denial of service as you may not be able to update the stack anymore until you remove the occluding file. But turns out I misremembered from 9abda98149 (reftable/stack: fix use of unseeded randomness, 2023-12-11): things indeed work alright. To demonstrate, let's update `format_name()` like this: diff --git a/reftable/stack.c b/reftable/stack.c index 531660a49f..b7422679df 100644 --- a/reftable/stack.c +++ b/reftable/stack.c @@ -659,7 +659,7 @@ int reftable_stack_add(struct reftable_stack *st, static int format_name(struct reftable_buf *dest, uint64_t min, uint64_t max) { char buf[100]; - uint32_t rnd = (uint32_t)git_rand(); + uint32_t rnd = 123; snprintf(buf, sizeof(buf), "0x%012" PRIx64 "-0x%012" PRIx64 "-%08x", min, max, rnd); reftable_buf_reset(dest); And then we create an occluding file and try to update: $ ~/Development/git/build/bin-wrappers/git init repo --ref-format=reftable Initialized empty Git repository in /tmp/repo/.git/ $ cd repo/ $ ls .git/reftable/ 0x000000000001-0x000000000001-0000007b.ref tables.list $ touch .git/reftable/0x000000000002-0x000000000002-0000007b.ref $ ~/Development/git/build/git commit --allow-empty -mx [main (root-commit) 08d02ef] x $ ls .git/reftable/ 0x000000000001-0x000000000002-0000007b.ref tables.list $ git show commit 08d02efa88f085f02c31285bf909d8d3a25c70dd (HEAD -> main) Author: Patrick Steinhardt <ps@xxxxxx> Date: Wed Jan 8 19:03:46 2025 +0100 x So the stack gets updated as expected, no corruption there. But this _can_ be a problem on Windows, where the file cannot be deleted in case it was still open. This is also documented as the reason in "Documentation/techincal/reftable.txt". That should've gotten better though now with the improvements I made to our rename-emulation, as it now uses POSIX semantics. That being said, I still don't think that swapping out `git_rand()` for `rand()` is the right thing to do. It does not solve the issue, but only a symptom thereof. Git can still die whenever the OpenSSL CSPRNG fails as there are other uses of `git_rand()` or `csprng_bytes()` in the codebase. So this change would regress something that works everywhere but on NonStop and make the suffixes predictable. The ia64 machine in question is being EOLd end of 2025. And the fact that this has never been a problem before v2.48.0-rc2, and that the machine was rebooted for maintenance immediately before running the tests, indicates to me that something fishy is going on on that platform. Patrick