bch_keylist_empty is checked to make sure key insert is successful Signed-off-by: Minlan Wang <wangminlan@xxxxxxxxxxxxxxx> --- We've been using a pretty old version of bcache (from linux-4.18-rc8) and experiencing occasional data corruption after a new register session of bcache. It turns out the data corruption is caused by this code in our old version of bcache: in bch_journal_replay: ... for (k = i->j.start; k < bset_bkey_last(&i->j); k = bkey_next(k)) { trace_bcache_journal_replay_key(k); bch_keylist_init_single(&keylist, k); ret = bch_btree_insert(s, &keylist, i->pin, NULL); if (ret) goto err; ... bch_journal_replay returns -ESRCH, then in run_cache_set: ... if (j->version < BCACHE_JSET_VERSION_UUID) __uuid_write(c); bch_journal_replay(c, &journal); ... There's no message warning about this key insert error in journal replay, nor does run_cache_set check for return value of bch_journal_replay either, so later keys in jset got lost silently and cause data corruption. We checked the key which caused this failure in bch_journal_replay, it is a key mapped to 2 btree node, and the first btree node is full. The code path causing this problem is: 1. k1 mapped to btree node n1, n2 2. When mapping into n1, bch_btree_insert_node goes into split case, top half of k1, say k1.1, is inserted into splitted n1: say n1.1, n1.2. Since the bottom half of k1, say k1.2, is left in insert_keys, so bch_btree_insert_node returns -EINTR. This happens in this code: in bch_btree_insert_node: ... } else { /* Invalidated all iterators */ int ret = btree_split(b, op, insert_keys, replace_key); if (bch_keylist_empty(insert_keys)) return 0; else if (!ret) return -EINTR; return ret; } ... 3. For return value -EINTR, another bch_btree_map_nodes_recurse in bcache_btree_root is triggered, with the wrong "from" key: which is START_KEY(&k1), instead of START_KEY(&k1.2) 4. n1.2 is revisted, since it has no overlapping range for k1.2, bch_btree_insert_keys sets op->insert_collision = true in the following code path: btree_insert_fn --> bch_btree_insert_node --> bch_btree_insert_keys 5. n2 is visisted and k1.2 is inserted here 6. bch_btree_insert detects op.op.insert_collision, and returns -ESRCH In this case, the key is successfully inserted, though bch_btree_insert returns -ESRCH. We tried the latest version of bcache in linux mainline, which has commit ce3e4cfb59cb (bcache: add failure check to run_cache_set() for journal replay), the cache disk register failed with this message: bcache: replay journal failed, disabling caching This is not what is expected, since the data in disk in intact, journal replay should success. I'm wondering if the checking of bch_btree_insert's return value is really neccessary in bch_journal_replay, could we just check bch_keylist_empty(&keylist) and make the replay continue if that is empty? We are using this patch for a work around for this issue now. drivers/md/bcache/journal.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 61bd79babf7a..35b8cbcd73c5 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -380,6 +380,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) bch_keylist_init_single(&keylist, k); ret = bch_btree_insert(s, &keylist, i->pin, NULL); + if ((ret == -ESRCH) && (bch_keylist_empty(&keylist))) + ret = 0; if (ret) goto err; -- 2.18.4