[PATCH v3 00/12] fix "git reflog expire" race & get rid of EISDIR in refs API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A re-roll of
https://lore.kernel.org/git/cover-00.11-00000000000-20210716T140631Z-avarab@xxxxxxxxx/;
hopefully addressing all the comments by Peff & Junio, thanks
both!

Changes:

 * Lots of commit message/comment changes, see range-diff

 * I referred to an assert() I forgot to add, the series now juggles
   an assert() of thei oid back and forth, for clarity of what we're
   actually doing.

   As noted in v2 I've got a series prepped for this one to finally
   clean up those loose ends/dead code, but let's focus on the
   behavior changes for now.

 * There's now a 12th patch to remove the "ENOTDIR" case Peff noted,

Ævar Arnfjörð Bjarmason (12):
  refs/packet: add missing BUG() invocations to reflog callbacks
  refs/files: remove unused REF_DELETING in lock_ref_oid_basic()
  refs/files: remove unused "extras/skip" in lock_ref_oid_basic()
  refs/files: remove unused "skip" in lock_raw_ref() too
  refs/debug: re-indent argument list for "prepare"
  refs API: pass the "lock OID" to reflog "prepare"
  refs: make repo_dwim_log() accept a NULL oid
  refs/files: add a comment about refs_reflog_exists() call
  reflog expire: don't lock reflogs using previously seen OID
  refs/files: remove unused "oid" in lock_ref_oid_basic()
  refs/files: remove unused "errno == EISDIR" code
  refs/files: remove unused "errno != ENOTDIR" condition

 builtin/reflog.c      |  17 +++---
 reflog-walk.c         |   3 +-
 refs.c                |   5 +-
 refs.h                |   4 +-
 refs/debug.c          |  10 ++--
 refs/files-backend.c  | 128 +++++++++++-------------------------------
 refs/packed-backend.c |   5 ++
 7 files changed, 58 insertions(+), 114 deletions(-)

Range-diff against v2:
 1:  30bd7679a5c =  1:  737d2d8c3d1 refs/packet: add missing BUG() invocations to reflog callbacks
 2:  033c0cec33d !  2:  dfb9e34076e refs/files: remove unused REF_DELETING in lock_ref_oid_basic()
    @@ Metadata
      ## Commit message ##
         refs/files: remove unused REF_DELETING in lock_ref_oid_basic()
     
    -    The lock_ref_oid_basic() function has gradually been replaced by use
    -    of the file transaction API, there are only 4 remaining callers of
    -    it.
    -
    -    None of those callers pass REF_DELETING, the last such caller went
    -    away in 8df4e511387 (struct ref_update: move "have_old" into "flags",
    -    2015-02-17). This is the start of even more removal of unused code in
    -    and around this function.
    +    The lock_ref_oid_basic() function has gradually been by most callers
    +    no longer performing a low-level "acquire lock, update and release",
    +    and instead using the ref transaction API. So there are only 4
    +    remaining callers of lock_ref_oid_basic().
    +
    +    None of those callers pass REF_DELETING anymore, the last caller went
    +    away in 92b1551b1d (refs: resolve symbolic refs first,
    +    2016-04-25).
    +
    +    Before that we'd refactored and moved this code in:
    +
    +     - 8df4e511387 (struct ref_update: move "have_old" into "flags",
    +       2015-02-17)
    +
    +     - 7bd9bcf372d (refs: split filesystem-based refs code into a new
    +       file, 2015-11-09)
    +
    +     - 165056b2fc (lock_ref_for_update(): new function, 2016-04-24)
    +
    +    We then finally stopped using it in 92b1551b1d (noted above). So let's
    +    remove the handling of this parameter.
    +
    +    By itself this change doesn't benefit us much, but it's the start of
    +    even more removal of unused code in and around this function in
    +    subsequent commits.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
 3:  94ffcd8cfda =  3:  0f2262fec69 refs/files: remove unused "extras/skip" in lock_ref_oid_basic()
 4:  61f9e0fc864 =  4:  7fb7ff97491 refs/files: remove unused "skip" in lock_raw_ref() too
 5:  95101c322b7 =  5:  4e526c34aaa refs/debug: re-indent argument list for "prepare"
 6:  e93465f4137 !  6:  295594fe8ae refs API: pass the "lock OID" to reflog "prepare"
    @@ Commit message
         refs API: pass the "lock OID" to reflog "prepare"
     
         Don't pass the object ID we pass into reflog_expire() back to the
    -    caller, but rather our locked OID.
    +    caller, but rather our locked OID. As the assert shows these two were
    +    the same thing in practice as we'd exit earlier in this function if we
    +    couldn't lock the desired OID.
     
    -    As the assert shows these two were the same thing in practice as we'd
    -    exit earlier in this function if we couldn't lock the desired OID, but
    -    as part of removing the passing of the OID to other functions further
    -    on I'm splitting up these concerns.
    +    This is in preparation for a subsequent change of not having
    +    reflog_expire() pass in the OID at all, also remove the "const" from
    +    the parameter since the "struct ref_lock" does not have it on its
    +    "old_oid" member.
     
         As we'll see in a subsequent commit we don't actually want to assert
         that we locked a given OID, we want this API to do the locking and
    -    tell us what the OID is, but for now let's just setup the scaffolding
    -    for that.
    +    tell us what the OID is, but for now let's just setup the basic
    +    scaffolding for that.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
    @@ refs/files-backend.c: static int files_reflog_expire(struct ref_store *ref_store
      	}
      
     -	(*prepare_fn)(refname, oid, cb.policy_cb);
    ++	assert(oideq(&lock->old_oid, oid));
     +	(*prepare_fn)(refname, &lock->old_oid, cb.policy_cb);
      	refs_for_each_reflog_ent(ref_store, refname, expire_reflog_ent, &cb);
      	(*cleanup_fn)(cb.policy_cb);
 7:  0fff2d32cfc =  7:  e45ec439db0 refs: make repo_dwim_log() accept a NULL oid
 8:  1e25b7c59c5 !  8:  8ae8e5ac029 refs/files: add a comment about refs_reflog_exists() call
    @@ refs/files-backend.c: static int files_reflog_expire(struct ref_store *ref_store
      	}
     +
     +	/*
    -+	 * When refs are deleted their reflog is deleted before the
    -+	 * ref itself deleted. This race happens because there's no
    -+	 * such thing as a lock on the reflog, instead we always lock
    -+	 * the "loose ref" (even if packet) above with
    ++	 * When refs are deleted, their reflog is deleted before the
    ++	 * ref itself is deleted. This is because there is no separate
    ++	 * lock for reflog; instead we take a lock on the ref with
     +	 * lock_ref_oid_basic().
     +	 *
    -+	 * If race happens we've got nothing more to do, we were asked
    -+	 * to delete the reflog, and it's not there anymore. Great!
    ++	 * If a race happens and the reflog doesn't exist after we've
    ++	 * acquired the lock that's OK. We've got nothing more to do;
    ++	 * We were asked to delete the reflog, but someone else
    ++	 * deleted it! The caller doesn't care that we deleted it,
    ++	 * just that it is deleted. So we can return successfully.
     +	 */
      	if (!refs_reflog_exists(ref_store, refname)) {
      		unlock_ref(lock);
 9:  60d6cf342fc !  9:  1050743e27c reflog expire: don't lock reflogs using previously seen OID
    @@ Commit message
     
          1. Lookup the reflog name/OID via dwim_log()
          2. With that OID, lock the reflog
    -     3. Later in builtin/reflog.c we use the we looked as input to
    +     3. Later in builtin/reflog.c we use the OID we looked as input to
             lookup_commit_reference_gently(), assured that it's equal to the
             OID we got from dwim_log().
     
    -    What do I mean with "mostly" above? It mostly mitigates it because
    -    we'll still run into cases where the ref is locked and being updated
    -    as we want to expire it, and other git processes wanting to update the
    -    refs will in turn race with us as we expire the reflog.
    +    We can be sure that this change is safe to make because between
    +    dwim_log (step #1) and lock_ref_oid_basic (step #2) there was no other
    +    logic relevant to the OID or expiry run in the cmd_reflog_expire()
    +    caller.
     
    -    This remaining race can in turn be mitigated with the
    +    We can thus treat that code as a black box, before and after this
    +    change it would get an OID that's been locked, the only difference is
    +    that now we mostly won't be failing to get the lock due to the TOCTOU
    +    race[0]. That failure was purely an implementation detail in how the
    +    "current OID" was looked up, it was divorced from the locking
    +    mechanism.
    +
    +    What do we mean with "mostly"? It mostly mitigates it because we'll
    +    still run into cases where the ref is locked and being updated as we
    +    want to expire it, and other git processes wanting to update the refs
    +    will in turn race with us as we expire the reflog.
    +
    +    That remaining race can in turn be mitigated with the
         core.filesRefLockTimeout setting, see 4ff0f01cb7 ("refs: retry
         acquiring reference locks for 100ms", 2017-08-21). In practice if that
         value is high enough we'll probably never have ref updates or reflog
    @@ Commit message
     
         See [1] for an initial report of how this impacted "git gc" and a
         large discussion about this change in early 2019. In particular patch
    -    looked good to Michael Haggerty his[2], but that seems to not have
    -    made it to the ML archive, its content is quoted in full in my [3].
    +    looked good to Michael Haggerty, see his[2]. That message seems to not
    +    have made it to the ML archive, its content is quoted in full in my
    +    [3].
     
         I'm leaving behind now-unused code the refs API etc. that takes the
         now-NULL "oid" argument, and other code that can be simplified now
         that we never have on OID in that context, that'll be cleaned up in
         subsequent commits, but for now let's narrowly focus on fixing the
    -    "git gc" issue
    +    "git gc" issue. As the modified assert() shows we always pass a NULL
    +    oid to reflog_expire() now.
     
         Unfortunately this sort of probabilistic contention is hard to turn
         into a test. I've tested this by running the following three subshells
    @@ Commit message
             )
     
         Before this change the "reflog expire" would fail really quickly with
    -    a "but expected" error.
    +    the "but expected" error noted above.
     
         After this change both the "pull" and "reflog expire" will run for a
         while, but eventually fail because I get unlucky with
    @@ Commit message
         With core.filesRefLockTimeout set to 10 seconds (it can probably be a
         lot lower) I managed to run all four of these concurrently for about
         an hour, and accumulated ~125k commits, auto-gc's and all, and didn't
    -    have a single failure.
    +    have a single failure. The loops visibly stall while waiting for the
    +    lock, but that's expected and desired behavior.
     
    +    0. https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use
         1. https://lore.kernel.org/git/87tvg7brlm.fsf@xxxxxxxxxxxxxxxxxxx/
         2. http://lore.kernel.org/git/b870a17d-2103-41b8-3cbc-7389d5fff33a@xxxxxxxxxxxx
         3. https://lore.kernel.org/git/87pnqkco8v.fsf@xxxxxxxxxxxxxxxxxxx/
    @@ refs/files-backend.c: static int files_reflog_expire(struct ref_store *ref_store
      				  REF_NO_DEREF,
      				  &type, &err);
      	if (!lock) {
    +@@ refs/files-backend.c: static int files_reflog_expire(struct ref_store *ref_store,
    + 		}
    + 	}
    + 
    +-	assert(oideq(&lock->old_oid, oid));
    ++	assert(!oid);
    + 	(*prepare_fn)(refname, &lock->old_oid, cb.policy_cb);
    + 	refs_for_each_reflog_ent(ref_store, refname, expire_reflog_ent, &cb);
    + 	(*cleanup_fn)(cb.policy_cb);
10:  09dd8836437 ! 10:  753c20f89bf refs/files: remove unused "oid" in lock_ref_oid_basic()
    @@ Commit message
         In the preceding commit the last caller that passed a non-NULL OID was
         changed to pass NULL to lock_ref_oid_basic(). As noted in preceding
         commits use of this API has been going away (we should use ref
    -    transactions, or lock_raw_rew()), so we're unlikely to gain new
    +    transactions, or lock_raw_ref()), so we're unlikely to gain new
         callers that want to pass the "oid".
     
         So let's remove it, doing so means we can remove the "mustexist"
         condition, and therefore anything except the "flags =
    -    RESOLVE_REF_NO_RECURSE" case. Furthermore, since the verify_lock()
    -    function we called did most of its work when the "oid" was passed (as
    -    "old_oid") we can inline the trivial part of it that remains in what's
    -    now its only caller.
    +    RESOLVE_REF_NO_RECURSE" case.
    +
    +    Furthermore, since the verify_lock() function we called did most of
    +    its work when the "oid" was passed (as "old_oid") we can inline the
    +    trivial part of it that remains in its only remaining caller. Without
    +    a NULL "oid" passed it was equivalent to calling refs_read_ref_full()
    +    followed by oidclr().
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
11:  96c3e5e9f79 = 11:  8a71bbef972 refs/files: remove unused "errno == EISDIR" code
 -:  ----------- > 12:  452253d597d refs/files: remove unused "errno != ENOTDIR" condition
-- 
2.32.0.874.ge7a9d58bfcf




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux