On 2/15/22 4:28 PM, Minlan Wang wrote:
bch_keylist_empty is checked to make sure key insert is successful
Signed-off-by: Minlan Wang <wangminlan@xxxxxxxxxxxxxxx>
---
We've been using a pretty old version of bcache (from
linux-4.18-rc8) and experiencing occasional data corruption
after a new register session of bcache. It turns out the
data corruption is caused by this code in our old version of
bcache:
in bch_journal_replay:
...
for (k = i->j.start;
k < bset_bkey_last(&i->j);
k = bkey_next(k)) {
trace_bcache_journal_replay_key(k);
bch_keylist_init_single(&keylist, k);
ret = bch_btree_insert(s, &keylist, i->pin, NULL);
if (ret)
goto err;
...
bch_journal_replay returns -ESRCH, then in run_cache_set:
...
if (j->version < BCACHE_JSET_VERSION_UUID)
__uuid_write(c);
bch_journal_replay(c, &journal);
...
There's no message warning about this key insert error in
journal replay, nor does run_cache_set check for return
value of bch_journal_replay either, so later keys in jset
got lost silently and cause data corruption.
We checked the key which caused this failure in
bch_journal_replay, it is a key mapped to 2 btree node, and
the first btree node is full.
The code path causing this problem is:
1. k1 mapped to btree node n1, n2
Hi Minlan,
Could you please provide more details about how the key 'k1' mapped to
btree node n1 and n2 both?
Thanks.
Coly Li
2. When mapping into n1, bch_btree_insert_node goes into
split case, top half of k1, say k1.1, is inserted into
splitted n1: say n1.1, n1.2. Since the bottom half of k1,
say k1.2, is left in insert_keys, so bch_btree_insert_node
returns -EINTR.
This happens in this code:
in bch_btree_insert_node:
...
} else {
/* Invalidated all iterators */
int ret = btree_split(b, op, insert_keys, replace_key);
if (bch_keylist_empty(insert_keys))
return 0;
else if (!ret)
return -EINTR;
return ret;
}
...
3. For return value -EINTR, another
bch_btree_map_nodes_recurse in bcache_btree_root is
triggered, with the wrong "from" key: which is
START_KEY(&k1), instead of START_KEY(&k1.2)
4. n1.2 is revisted, since it has no overlapping range for
k1.2, bch_btree_insert_keys sets
op->insert_collision = true
in the following code path:
btree_insert_fn
--> bch_btree_insert_node
--> bch_btree_insert_keys
5. n2 is visisted and k1.2 is inserted here
6. bch_btree_insert detects op.op.insert_collision, and
returns -ESRCH
In this case, the key is successfully inserted, though
bch_btree_insert returns -ESRCH.
We tried the latest version of bcache in linux mainline,
which has commit ce3e4cfb59cb (bcache: add failure check to
run_cache_set() for journal replay), the cache disk register
failed with this message:
bcache: replay journal failed, disabling caching
This is not what is expected, since the data in disk in
intact, journal replay should success.
I'm wondering if the checking of bch_btree_insert's return
value is really neccessary in bch_journal_replay, could we
just check bch_keylist_empty(&keylist) and make the replay
continue if that is empty?
We are using this patch for a work around for this issue now.
drivers/md/bcache/journal.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 61bd79babf7a..35b8cbcd73c5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -380,6 +380,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
bch_keylist_init_single(&keylist, k);
ret = bch_btree_insert(s, &keylist, i->pin, NULL);
+ if ((ret == -ESRCH) && (bch_keylist_empty(&keylist)))
+ ret = 0;
if (ret)
goto err;