Re: [PATCH] bcache: fix insane -ESRCH error in bch_journal_replay

Coly Li <colyli@xxxxxxx> · Fri, 13 May 2022 23:22:32 +0800

On 2/15/22 4:28 PM, Minlan Wang wrote:
bch_keylist_empty is checked to make sure key insert is successful

Signed-off-by: Minlan Wang <wangminlan@xxxxxxxxxxxxxxx>
---
We've been using a pretty old version of bcache (from
linux-4.18-rc8) and experiencing occasional data corruption
after a new register session of bcache.  It turns out the
data corruption is caused by this code in our old version of
bcache:

in bch_journal_replay:
		 ...
                  for (k = i->j.start;
                       k < bset_bkey_last(&i->j);
                       k = bkey_next(k)) {
                          trace_bcache_journal_replay_key(k);

                          bch_keylist_init_single(&keylist, k);

                          ret = bch_btree_insert(s, &keylist, i->pin, NULL);
                          if (ret)
                                  goto err;
		 ...
bch_journal_replay returns -ESRCH, then in run_cache_set:
		 ...
                  if (j->version < BCACHE_JSET_VERSION_UUID)
                          __uuid_write(c);

                  bch_journal_replay(c, &journal);
		 ...
There's no message warning about this key insert error in
journal replay, nor does run_cache_set check for return
value of bch_journal_replay either, so later keys in jset
got lost silently and cause data corruption.

We checked the key which caused this failure in
bch_journal_replay, it is a key mapped to 2 btree node, and
the first btree node is full.
The code path causing this problem is:
1. k1 mapped to btree node n1, n2


Hi Minlan,


Could you please provide more details about how the key 'k1' mapped to 
btree node n1 and n2 both?

Thanks.


Coly Li



2. When mapping into n1, bch_btree_insert_node goes into
    split case, top half of k1, say k1.1, is inserted into
    splitted n1: say n1.1, n1.2. Since the bottom half of k1,
    say k1.2, is left in insert_keys, so bch_btree_insert_node
    returns -EINTR.
    This happens in this code:
    in bch_btree_insert_node:
         ...
         } else {
                 /* Invalidated all iterators */
                 int ret = btree_split(b, op, insert_keys, replace_key);

                 if (bch_keylist_empty(insert_keys))
                         return 0;
                 else if (!ret)
                         return -EINTR;
                 return ret;
         }
         ...
3. For return value -EINTR, another
    bch_btree_map_nodes_recurse in bcache_btree_root is
    triggered, with the wrong "from" key: which is
    START_KEY(&k1), instead of START_KEY(&k1.2)
4. n1.2 is revisted, since it has no overlapping range for
    k1.2, bch_btree_insert_keys sets
    op->insert_collision = true
    in the following code path:
    btree_insert_fn
    --> bch_btree_insert_node
      --> bch_btree_insert_keys
5. n2 is visisted and k1.2 is inserted here
6. bch_btree_insert detects op.op.insert_collision, and
    returns -ESRCH

In this case, the key is successfully inserted, though
bch_btree_insert returns -ESRCH.

We tried the latest version of bcache in linux mainline,
which has commit ce3e4cfb59cb (bcache: add failure check to
run_cache_set() for journal replay), the cache disk register
failed with this message:
bcache: replay journal failed, disabling caching

This is not what is expected, since the data in disk in
intact, journal replay should success.
I'm wondering if the checking of bch_btree_insert's return
value is really neccessary in bch_journal_replay, could we
just check bch_keylist_empty(&keylist) and make the replay
continue if that is empty?

We are using this patch for a work around for this issue now.

  drivers/md/bcache/journal.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 61bd79babf7a..35b8cbcd73c5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -380,6 +380,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
  			bch_keylist_init_single(&keylist, k);
  
  			ret = bch_btree_insert(s, &keylist, i->pin, NULL);
+			if ((ret == -ESRCH) && (bch_keylist_empty(&keylist)))
+				ret = 0;
  			if (ret)
  				goto err;