Re: Recover from "journal entries X-Y missing! (replaying X-Z)", "IO error on writing btree."

Coly Li <colyli@xxxxxxx> · Thu, 21 Mar 2019 22:16:32 +0800

Hi Junhui,

Now I am able to understand your patch. Yes this patch may fix one of
the condition that jset get lost.

We should have this fix in v5.1, I will handle the format issue. And if
you don't mind I may re-compose a commit log to explain what exactly is
fixed.

Thanks.

Coly Li

On 2019/3/21 7:04 下午, Junhui Tang wrote:
> I meet this bug and send a patch before,
> Please have a try with  this patch.
> 
> https://www.spinics.net/lists/linux-bcache/msg06555.html
> 
> From: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
> Date: Wed, 12 Sep 2018 04:42:14 +0800
> Subject: [PATCH] bcache: fix failure in journal relplay
> 
> journal replay failed with messages:
> Sep 10 19:10:43 ceph kernel: bcache: error on
> bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
> 2057493-2057567 missing! (replaying 2057493-2076601), disabling
> caching
> 
> The reason is in journal_reclaim(), we send discard command and
> reclaim those journal buckets whose seq is old than the last_seq_now,
> but before we write a journal with last_seq_now, the machine is
> restarted, so the journal with the last_seq_now is not written to
> the journal bucket, and the last_seq_wrote in the newest journal is
> old than last_seq_now which we expect to be, so when we doing
> replay, journals from last_seq_wrote to last_seq_now are missing.
> 
> It's hard to write a journal immediately after journal_reclaim(),
> and it harmless if those missed journal are caused by discarding
> since those journals are already wrote to btree node. So, if miss
> seqs are started from the beginning journal, we treat it as normal,
> and only print a message to show the miss journal, and point out
> it maybe caused by discarding.
> 
> Signed-off-by: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
> ---
>  drivers/md/bcache/journal.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 10748c6..9b4cd2e 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -328,9 +328,13 @@ int bch_journal_replay(struct cache_set *s,
> struct list_head *list)
>   list_for_each_entry(i, list, list) {
>   BUG_ON(i->pin && atomic_read(i->pin) != 1);
> 
> - cache_set_err_on(n != i->j.seq, s,
> -"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
> + if (n != i->j.seq && n == start)
> + pr_info("bcache: journal entries %llu-%llu may be discarded!
> (replaying %llu-%llu)",
>   n, i->j.seq - 1, start, end);
> + else
> + cache_set_err_on(n != i->j.seq, s,
> +        "bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
> +        n, i->j.seq - 1, start, end);
> 
>   for (k = i->j.start;
>        k < bset_bkey_last(&i->j);
> -- 
> 1.8.3.1
> 
> 
> Coly Li <colyli@xxxxxxx <mailto:colyli@xxxxxxx>> 于2019年3月21日周四 下
> 午12:52写道：
> 
> On 2019/3/21 3:33 上午, Dennis Schridde wrote:
>> On Mittwoch, 20. März 2019 12:16:29 CET Coly Li wrote:
>>> On 2019/3/20 5:42 上午, Dennis Schridde wrote:
>>>> Hello!
>>>>
>>>> During boot my bcache device cannot be activated anymore and
>>>> hence the filesystem content is inaccessible.  It appears that
>>>> parts of the journal are corrupted, since dmesg says: ```
>>>> bcache: register_bdev() registered backing device sda3 bcache:
>>>> error on UUID: bcache: journal entries X-Y missing! (replaying
>>>> X-Z) , disabling caching bcache: bch_count_io_errors() nvme0n1:
>>>> IO error on writing btree. bcache: bch_btree_insert() error -5
>>>> bcache: bch_cached_dev_attach() Can't attach sda3: shutting
>>>> down bcache: register_cache() registered cache device nvme0n1
>>>> bcache: bch_count_io_errors() nvme0n1: IO error on writing
>>>> btree. bcache: bch_count_io_errors() nvme0n1: IO error on
>>>> writing btree. bcache: bch_count_io_errors() nvme0n1: IO error
>>>> on writing btree. bcache: bch_count_io_errors() nvme0n1: IO
>>>> error on writing btree. bcache: bch_count_io_errors() nvme0n1:
>>>> IO error on writing btree. bcache: bch_count_io_errors()
>>>> nvme0n1: IO error on writing btree. bcache: cache_set_free()
>>>> Cache set UUID unregistered ```
>>>>
>>>> UUID represents a UUID.  X, Y, Z are integers, with X<Y<Z,
>>>> Y=X+12 and Z=Y+116.
>>>>
>>>> Error -5 is EIO, i.e. a generic I/O error.  Is there a way to
>>>> get more information on where that error originates from and
>>>> what exactly is broken? Did bcache just detect broken data, or
>>>> is the device itself broken?  Which device, the HDD or the NVMe
>>>> SSD?
>>>>
>>>> Is there a way to recover from this without loosing all data
>>>> on the drive?  Is it maybe possible to just discard the
>>>> journal entries >X and return to the state the block device was
>>>> at point X, loosing only modifications after that point?
>>>>
>>>> Background: The situation appeared after my computer was
>>>> running for a few hours and the screen stayed dark when I tried
>>>> to wake the monitor from standby.  The machine did not react to
>>>> NumLock or Ctrl+Alt+Entf, so I issued a magic SysRq and tried
>>>> to safely reboot the machine by slowly typing REISUB. Sadly
>>>> after this the machine ended up in the state described above.
>>>
>>> It seems some journal set was lost during bch_journal_replay()
>>> after reboot and start cache set.
>>>
>>> During my test for a journal deadlock fix, I also observe this
>>> issue. I change the journal buckets number from 256 to 8, such
>>> problem can be observe almost every reboot.
>>>
>>> This one is not fixed yet and I am currently working on it.
>>>
>>> What kernel version do you use ?  I though this issue was only
>>> introduced by my current changes, but from your report it seems
>>> such problem happens in upstream kernel as well.
> 
>> I was using Linux 5.0.2 (with Gentoo patches, which are minimal,
>> AFAIK).
> 
>> I would have expected that S and/or U in REISUB would write all
>> bcache metadata to disk and prevent such problems.  Is this a wrong
>> assumption?
> 
>> Will your patches allow me to use the cache again, or will they
>> prevent the metadata from breaking in the first place?
> 
> Now I am still looking for the reason how such problem happens. Once I
> have a fix, I will let you know.
> 
> Thanks.
> 
> Coly Li
> 
> 
> 

-- 

Coly Li