Re: Recover from "journal entries X-Y missing! (replaying X-Z)", "IO error on writing btree."

Coly Li <colyli@xxxxxxx> · Fri, 22 Mar 2019 14:14:04 +0800

On 2019/3/21 10:16 下午, Coly Li wrote:
> Hi Junhui,
> 
> Now I am able to understand your patch. Yes this patch may fix one of
> the condition that jset get lost.
> 
> We should have this fix in v5.1, I will handle the format issue. And if
> you don't mind I may re-compose a commit log to explain what exactly is
> fixed.

Hi Junhui,

When I review the patch, I feel there is one point I still do not
understand, could you please give me more hint ?

>From your commit log, "so when we doing replay, journals from
last_seq_wrote to last_seq_now are missing.", can you show me the code
to explain how such condition happens ?

Thanks in advance.

Coly Li

> On 2019/3/21 7:04 下午, Junhui Tang wrote:
>> I meet this bug and send a patch before,
>> Please have a try with  this patch.
>>
>> https://www.spinics.net/lists/linux-bcache/msg06555.html
>>
>> From: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
>> Date: Wed, 12 Sep 2018 04:42:14 +0800
>> Subject: [PATCH] bcache: fix failure in journal relplay
>>
>> journal replay failed with messages:
>> Sep 10 19:10:43 ceph kernel: bcache: error on
>> bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
>> 2057493-2057567 missing! (replaying 2057493-2076601), disabling
>> caching
>>
>> The reason is in journal_reclaim(), we send discard command and
>> reclaim those journal buckets whose seq is old than the last_seq_now,
>> but before we write a journal with last_seq_now, the machine is
>> restarted, so the journal with the last_seq_now is not written to
>> the journal bucket, and the last_seq_wrote in the newest journal is
>> old than last_seq_now which we expect to be, so when we doing
>> replay, journals from last_seq_wrote to last_seq_now are missing.
>>
>> It's hard to write a journal immediately after journal_reclaim(),
>> and it harmless if those missed journal are caused by discarding
>> since those journals are already wrote to btree node. So, if miss
>> seqs are started from the beginning journal, we treat it as normal,
>> and only print a message to show the miss journal, and point out
>> it maybe caused by discarding.
>>
>> Signed-off-by: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
>> ---
>>  drivers/md/bcache/journal.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 10748c6..9b4cd2e 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -328,9 +328,13 @@ int bch_journal_replay(struct cache_set *s,
>> struct list_head *list)
>>   list_for_each_entry(i, list, list) {
>>   BUG_ON(i->pin && atomic_read(i->pin) != 1);
>>
>> - cache_set_err_on(n != i->j.seq, s,
>> -"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
>> + if (n != i->j.seq && n == start)
>> + pr_info("bcache: journal entries %llu-%llu may be discarded!
>> (replaying %llu-%llu)",
>>   n, i->j.seq - 1, start, end);
>> + else
>> + cache_set_err_on(n != i->j.seq, s,
>> +        "bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
>> +        n, i->j.seq - 1, start, end);
>>
>>   for (k = i->j.start;
>>        k < bset_bkey_last(&i->j);
>> -- 
>> 1.8.3.1
>>
>>
>> Coly Li <colyli@xxxxxxx <mailto:colyli@xxxxxxx>> 于2019年3月21日周四 下
>> 午12:52写道：
>>
>> On 2019/3/21 3:33 上午, Dennis Schridde wrote:
>>> On Mittwoch, 20. März 2019 12:16:29 CET Coly Li wrote:
>>>> On 2019/3/20 5:42 上午, Dennis Schridde wrote:
>>>>> Hello!
>>>>>
>>>>> During boot my bcache device cannot be activated anymore and
>>>>> hence the filesystem content is inaccessible.  It appears that
>>>>> parts of the journal are corrupted, since dmesg says: ```
>>>>> bcache: register_bdev() registered backing device sda3 bcache:
>>>>> error on UUID: bcache: journal entries X-Y missing! (replaying
>>>>> X-Z) , disabling caching bcache: bch_count_io_errors() nvme0n1:
>>>>> IO error on writing btree. bcache: bch_btree_insert() error -5
>>>>> bcache: bch_cached_dev_attach() Can't attach sda3: shutting
>>>>> down bcache: register_cache() registered cache device nvme0n1
>>>>> bcache: bch_count_io_errors() nvme0n1: IO error on writing
>>>>> btree. bcache: bch_count_io_errors() nvme0n1: IO error on
>>>>> writing btree. bcache: bch_count_io_errors() nvme0n1: IO error
>>>>> on writing btree. bcache: bch_count_io_errors() nvme0n1: IO
>>>>> error on writing btree. bcache: bch_count_io_errors() nvme0n1:
>>>>> IO error on writing btree. bcache: bch_count_io_errors()
>>>>> nvme0n1: IO error on writing btree. bcache: cache_set_free()
>>>>> Cache set UUID unregistered ```
>>>>>
>>>>> UUID represents a UUID.  X, Y, Z are integers, with X<Y<Z,
>>>>> Y=X+12 and Z=Y+116.
>>>>>
>>>>> Error -5 is EIO, i.e. a generic I/O error.  Is there a way to
>>>>> get more information on where that error originates from and
>>>>> what exactly is broken? Did bcache just detect broken data, or
>>>>> is the device itself broken?  Which device, the HDD or the NVMe
>>>>> SSD?
>>>>>
>>>>> Is there a way to recover from this without loosing all data
>>>>> on the drive?  Is it maybe possible to just discard the
>>>>> journal entries >X and return to the state the block device was
>>>>> at point X, loosing only modifications after that point?
>>>>>
>>>>> Background: The situation appeared after my computer was
>>>>> running for a few hours and the screen stayed dark when I tried
>>>>> to wake the monitor from standby.  The machine did not react to
>>>>> NumLock or Ctrl+Alt+Entf, so I issued a magic SysRq and tried
>>>>> to safely reboot the machine by slowly typing REISUB. Sadly
>>>>> after this the machine ended up in the state described above.
>>>>
>>>> It seems some journal set was lost during bch_journal_replay()
>>>> after reboot and start cache set.
>>>>
>>>> During my test for a journal deadlock fix, I also observe this
>>>> issue. I change the journal buckets number from 256 to 8, such
>>>> problem can be observe almost every reboot.
>>>>
>>>> This one is not fixed yet and I am currently working on it.
>>>>
>>>> What kernel version do you use ?  I though this issue was only
>>>> introduced by my current changes, but from your report it seems
>>>> such problem happens in upstream kernel as well.
>>
>>> I was using Linux 5.0.2 (with Gentoo patches, which are minimal,
>>> AFAIK).
>>
>>> I would have expected that S and/or U in REISUB would write all
>>> bcache metadata to disk and prevent such problems.  Is this a wrong
>>> assumption?
>>
>>> Will your patches allow me to use the cache again, or will they
>>> prevent the metadata from breaking in the first place?
>>
>> Now I am still looking for the reason how such problem happens. Once I
>> have a fix, I will let you know.
>>
>> Thanks.
>>
>> Coly Li
>>
>>
>>
> 

-- 

Coly Li