Re: [PATCH] bcache: treat stale && dirty keys as bad keys

Junhui Tang <tang.junhui.linux@xxxxxxxxx> · Wed, 19 Dec 2018 09:30:15 +0800

hello Kent
long long no see, glad to hear you again.
>> Then two steps:
>> A) update k1 to k2 in btree node memory;
>>    bch_btree_insert_keys(b, op, insert_keys, replace_key)
>> B) Write the bset(contains k2) to cache disk by a 30s delay work
>>    bch_btree_leaf_dirty(b, journal_ref).
>> But before the 30s delay work write the bset to cache device,
>> these things happend:
>> A) GC works, and reclaim the bucket k2 point to;
>> B) Allocator works, and invalidate the bucket k2 point to,
>>    and increase the gen of the bucket, and place it into free_inc
>>    fifo;
>> C) Until now, the 30s delay work still does not finish work,
>>    so in the disk, the key still is k1, it is dirty and stale
>>    (its gen is smaller than the gen of the bucket). and then the
>>    machine power off suddenly happens;
>> D) When the machine power on again, after the btree reconstruction,
>>    the stale dirty key appear.

> Only prior to journal replay, right? Or did you uncover something more severe?
No, it's after the journal replay, and in write_dirty_finish(), when
replace a dirty key with
a clean key by calling bch_btree_insert(), no journal will write.

>> In bch_extent_bad(), when expensive_debug_checks is off, it would
>> treat the dirty key as good even it is stale keys, and it would
>> cause bellow probelms:
>> A) In read_dirty() it would cause machine crash:
>>    BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
>> B) It could be worse when reads hits stale dirty keys, it would
>>    read old incorrect data.

>Neither of these can happen until after journal replay is finished. Prior to
>journal replay we expect to find stale dirty keys - if we find any after journal
>replay then it's indicative of a real bug.
As I said previous, since no journal writes after inserting a replace key in
writeback, so this issue has nothing to do with journal.

This is a real problem in my environment, after running IO sometimes, I turn off
the power suddenly,  then turn on the power, and the machine crash in
read_dirty()
due to the stale && dirty keys.

Kent Overstreet <kent.overstreet@xxxxxxxxx> 于2018年12月18日周二 下午10:02写道：
>
> On Tue, Dec 18, 2018 at 02:37:14PM +0800, Junhui Tang wrote:
> > From 8ca52771f1d3b83ff55dc80ba633901a081963bf Mon Sep 17 00:00:00 2001
> > From: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
> > Date: Tue, 18 Dec 2018 10:38:26 +0800
> > Subject: [PATCH] bcache: treat stale && dirty keys as bad keys
> >
> > Stale && dirty keys can be produced in the follow way:
> > After writeback in write_dirty_finish(), dirty keys k1 will
> > replace by clean keys k2
> > ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
> > ==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
> > ==>static int bch_btree_insert_node(struct btree *b,
> >        struct btree_op *op,
> >        struct keylist *insert_keys,
> >        atomic_t *journal_ref,
> > Then two steps:
> > A) update k1 to k2 in btree node memory;
> >    bch_btree_insert_keys(b, op, insert_keys, replace_key)
> > B) Write the bset(contains k2) to cache disk by a 30s delay work
> >    bch_btree_leaf_dirty(b, journal_ref).
> > But before the 30s delay work write the bset to cache device,
> > these things happend:
> > A) GC works, and reclaim the bucket k2 point to;
> > B) Allocator works, and invalidate the bucket k2 point to,
> >    and increase the gen of the bucket, and place it into free_inc
> >    fifo;
> > C) Until now, the 30s delay work still does not finish work,
> >    so in the disk, the key still is k1, it is dirty and stale
> >    (its gen is smaller than the gen of the bucket). and then the
> >    machine power off suddenly happens;
> > D) When the machine power on again, after the btree reconstruction,
> >    the stale dirty key appear.
>
> Only prior to journal replay, right? Or did you uncover something more severe?
>
> > In bch_extent_bad(), when expensive_debug_checks is off, it would
> > treat the dirty key as good even it is stale keys, and it would
> > cause bellow probelms:
> > A) In read_dirty() it would cause machine crash:
> >    BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
> > B) It could be worse when reads hits stale dirty keys, it would
> >    read old incorrect data.
>
> Neither of these can happen until after journal replay is finished. Prior to
> journal replay we expect to find stale dirty keys - if we find any after journal
> replay then it's indicative of a real bug.
>
> >
> > This patch tolerate the existence of these stale && dirty keys,
> > and treat them as bad key in bch_extent_bad().
> >
> > Signed-off-by: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
> > ---
> >  drivers/md/bcache/extents.c | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c
> > index 1d09674..f9eb03c 100644
> > --- a/drivers/md/bcache/extents.c
> > +++ b/drivers/md/bcache/extents.c
> > @@ -535,6 +535,7 @@ static bool bch_extent_bad(struct btree_keys *bk,
> > const struct bkey *k)
> >  {
> >   struct btree *b = container_of(bk, struct btree, keys);
> >   unsigned i, stale;
> > + char buf[80];
> >
> >   if (!KEY_PTRS(k) ||
> >       bch_extent_invalid(bk, k))
> > @@ -544,19 +545,20 @@ static bool bch_extent_bad(struct btree_keys
> > *bk, const struct bkey *k)
> >   if (!ptr_available(b->c, k, i))
> >   return true;
> >
> > - if (!expensive_debug_checks(b->c) && KEY_DIRTY(k))
> > - return false;
> > -
> >   for (i = 0; i < KEY_PTRS(k); i++) {
> >   stale = ptr_stale(b->c, k, i);
> >
> > + if (stale && KEY_DIRTY(k)) {
> > + bch_extent_to_text(buf, sizeof(buf), k);
> > + pr_info("stale dirty pointer, stale %u, key: %s",
> > +     stale,
> > +     buf);
> > + }
> > +
> >   btree_bug_on(stale > 96, b,
> >        "key too stale: %i, need_gc %u",
> >        stale, b->c->need_gc);
> >
> > - btree_bug_on(stale && KEY_DIRTY(k) && KEY_SIZE(k),
> > -      b, "stale dirty pointer");
> > -
> >   if (stale)
> >   return true;
> >
> > --
> > 1.8.3.1