hello Kent long long no see, glad to hear you again. >> Then two steps: >> A) update k1 to k2 in btree node memory; >> bch_btree_insert_keys(b, op, insert_keys, replace_key) >> B) Write the bset(contains k2) to cache disk by a 30s delay work >> bch_btree_leaf_dirty(b, journal_ref). >> But before the 30s delay work write the bset to cache device, >> these things happend: >> A) GC works, and reclaim the bucket k2 point to; >> B) Allocator works, and invalidate the bucket k2 point to, >> and increase the gen of the bucket, and place it into free_inc >> fifo; >> C) Until now, the 30s delay work still does not finish work, >> so in the disk, the key still is k1, it is dirty and stale >> (its gen is smaller than the gen of the bucket). and then the >> machine power off suddenly happens; >> D) When the machine power on again, after the btree reconstruction, >> the stale dirty key appear. > Only prior to journal replay, right? Or did you uncover something more severe? No, it's after the journal replay, and in write_dirty_finish(), when replace a dirty key with a clean key by calling bch_btree_insert(), no journal will write. >> In bch_extent_bad(), when expensive_debug_checks is off, it would >> treat the dirty key as good even it is stale keys, and it would >> cause bellow probelms: >> A) In read_dirty() it would cause machine crash: >> BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); >> B) It could be worse when reads hits stale dirty keys, it would >> read old incorrect data. >Neither of these can happen until after journal replay is finished. Prior to >journal replay we expect to find stale dirty keys - if we find any after journal >replay then it's indicative of a real bug. As I said previous, since no journal writes after inserting a replace key in writeback, so this issue has nothing to do with journal. This is a real problem in my environment, after running IO sometimes, I turn off the power suddenly, then turn on the power, and the machine crash in read_dirty() due to the stale && dirty keys. Kent Overstreet <kent.overstreet@xxxxxxxxx> 于2018年12月18日周二 下午10:02写道: > > On Tue, Dec 18, 2018 at 02:37:14PM +0800, Junhui Tang wrote: > > From 8ca52771f1d3b83ff55dc80ba633901a081963bf Mon Sep 17 00:00:00 2001 > > From: Tang Junhui <tang.junhui.linux@xxxxxxxxx> > > Date: Tue, 18 Dec 2018 10:38:26 +0800 > > Subject: [PATCH] bcache: treat stale && dirty keys as bad keys > > > > Stale && dirty keys can be produced in the follow way: > > After writeback in write_dirty_finish(), dirty keys k1 will > > replace by clean keys k2 > > ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key); > > ==>btree_insert_fn(struct btree_op *b_op, struct btree *b) > > ==>static int bch_btree_insert_node(struct btree *b, > > struct btree_op *op, > > struct keylist *insert_keys, > > atomic_t *journal_ref, > > Then two steps: > > A) update k1 to k2 in btree node memory; > > bch_btree_insert_keys(b, op, insert_keys, replace_key) > > B) Write the bset(contains k2) to cache disk by a 30s delay work > > bch_btree_leaf_dirty(b, journal_ref). > > But before the 30s delay work write the bset to cache device, > > these things happend: > > A) GC works, and reclaim the bucket k2 point to; > > B) Allocator works, and invalidate the bucket k2 point to, > > and increase the gen of the bucket, and place it into free_inc > > fifo; > > C) Until now, the 30s delay work still does not finish work, > > so in the disk, the key still is k1, it is dirty and stale > > (its gen is smaller than the gen of the bucket). and then the > > machine power off suddenly happens; > > D) When the machine power on again, after the btree reconstruction, > > the stale dirty key appear. > > Only prior to journal replay, right? Or did you uncover something more severe? > > > In bch_extent_bad(), when expensive_debug_checks is off, it would > > treat the dirty key as good even it is stale keys, and it would > > cause bellow probelms: > > A) In read_dirty() it would cause machine crash: > > BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); > > B) It could be worse when reads hits stale dirty keys, it would > > read old incorrect data. > > Neither of these can happen until after journal replay is finished. Prior to > journal replay we expect to find stale dirty keys - if we find any after journal > replay then it's indicative of a real bug. > > > > > This patch tolerate the existence of these stale && dirty keys, > > and treat them as bad key in bch_extent_bad(). > > > > Signed-off-by: Tang Junhui <tang.junhui.linux@xxxxxxxxx> > > --- > > drivers/md/bcache/extents.c | 14 ++++++++------ > > 1 file changed, 8 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c > > index 1d09674..f9eb03c 100644 > > --- a/drivers/md/bcache/extents.c > > +++ b/drivers/md/bcache/extents.c > > @@ -535,6 +535,7 @@ static bool bch_extent_bad(struct btree_keys *bk, > > const struct bkey *k) > > { > > struct btree *b = container_of(bk, struct btree, keys); > > unsigned i, stale; > > + char buf[80]; > > > > if (!KEY_PTRS(k) || > > bch_extent_invalid(bk, k)) > > @@ -544,19 +545,20 @@ static bool bch_extent_bad(struct btree_keys > > *bk, const struct bkey *k) > > if (!ptr_available(b->c, k, i)) > > return true; > > > > - if (!expensive_debug_checks(b->c) && KEY_DIRTY(k)) > > - return false; > > - > > for (i = 0; i < KEY_PTRS(k); i++) { > > stale = ptr_stale(b->c, k, i); > > > > + if (stale && KEY_DIRTY(k)) { > > + bch_extent_to_text(buf, sizeof(buf), k); > > + pr_info("stale dirty pointer, stale %u, key: %s", > > + stale, > > + buf); > > + } > > + > > btree_bug_on(stale > 96, b, > > "key too stale: %i, need_gc %u", > > stale, b->c->need_gc); > > > > - btree_bug_on(stale && KEY_DIRTY(k) && KEY_SIZE(k), > > - b, "stale dirty pointer"); > > - > > if (stale) > > return true; > > > > -- > > 1.8.3.1