Re: Recover from "journal entries X-Y missing! (replaying X-Z)", "IO error on writing btree."

Coly Li <colyli@xxxxxxx> · Fri, 22 Mar 2019 15:04:20 +0800

On 2019/3/22 2:14 下午, Coly Li wrote:
> On 2019/3/21 10:16 下午, Coly Li wrote:
>> Hi Junhui,
>>
>> Now I am able to understand your patch. Yes this patch may fix one of
>> the condition that jset get lost.
>>
>> We should have this fix in v5.1, I will handle the format issue. And if
>> you don't mind I may re-compose a commit log to explain what exactly is
>> fixed.
> 
> Hi Junhui,
> 
> When I review the patch, I feel there is one point I still do not
> understand, could you please give me more hint ?
> 
> From your commit log, "so when we doing replay, journals from
> last_seq_wrote to last_seq_now are missing.", can you show me the code
> to explain how such condition happens ?

Aha, I realize this is for discard enabled condition. Hmm, but discard
is disabled by default.

Hi Dennis,

Is discard enabled in your environment ? Or it is just disabled by default.

Thanks.

Coly Li

> 
>> On 2019/3/21 7:04 下午, Junhui Tang wrote:
>>> I meet this bug and send a patch before,
>>> Please have a try with  this patch.
>>>
>>> https://www.spinics.net/lists/linux-bcache/msg06555.html
>>>
>>> From: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
>>> Date: Wed, 12 Sep 2018 04:42:14 +0800
>>> Subject: [PATCH] bcache: fix failure in journal relplay
>>>
>>> journal replay failed with messages:
>>> Sep 10 19:10:43 ceph kernel: bcache: error on
>>> bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
>>> 2057493-2057567 missing! (replaying 2057493-2076601), disabling
>>> caching
>>>
>>> The reason is in journal_reclaim(), we send discard command and
>>> reclaim those journal buckets whose seq is old than the last_seq_now,
>>> but before we write a journal with last_seq_now, the machine is
>>> restarted, so the journal with the last_seq_now is not written to
>>> the journal bucket, and the last_seq_wrote in the newest journal is
>>> old than last_seq_now which we expect to be, so when we doing
>>> replay, journals from last_seq_wrote to last_seq_now are missing.
>>>
>>> It's hard to write a journal immediately after journal_reclaim(),
>>> and it harmless if those missed journal are caused by discarding
>>> since those journals are already wrote to btree node. So, if miss
>>> seqs are started from the beginning journal, we treat it as normal,
>>> and only print a message to show the miss journal, and point out
>>> it maybe caused by discarding.
>>>
>>> Signed-off-by: Tang Junhui <tang.junhui.linux@xxxxxxxxx>
>>> ---
>>>  drivers/md/bcache/journal.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>>> index 10748c6..9b4cd2e 100644
>>> --- a/drivers/md/bcache/journal.c
>>> +++ b/drivers/md/bcache/journal.c
>>> @@ -328,9 +328,13 @@ int bch_journal_replay(struct cache_set *s,
>>> struct list_head *list)
>>>   list_for_each_entry(i, list, list) {
>>>   BUG_ON(i->pin && atomic_read(i->pin) != 1);
>>>
>>> - cache_set_err_on(n != i->j.seq, s,
>>> -"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
>>> + if (n != i->j.seq && n == start)
>>> + pr_info("bcache: journal entries %llu-%llu may be discarded!
>>> (replaying %llu-%llu)",
>>>   n, i->j.seq - 1, start, end);
>>> + else
>>> + cache_set_err_on(n != i->j.seq, s,
>>> +        "bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
>>> +        n, i->j.seq - 1, start, end);
>>>
>>>   for (k = i->j.start;
>>>        k < bset_bkey_last(&i->j);
>>> -- 
>>> 1.8.3.1
>>>
>>>
>>> Coly Li <colyli@xxxxxxx <mailto:colyli@xxxxxxx>> 于2019年3月21日周四 下
>>> 午12:52写道：
>>>
>>> On 2019/3/21 3:33 上午, Dennis Schridde wrote:
>>>> On Mittwoch, 20. März 2019 12:16:29 CET Coly Li wrote:
>>>>> On 2019/3/20 5:42 上午, Dennis Schridde wrote:
>>>>>> Hello!
>>>>>>
>>>>>> During boot my bcache device cannot be activated anymore and
>>>>>> hence the filesystem content is inaccessible.  It appears that
>>>>>> parts of the journal are corrupted, since dmesg says: ```
>>>>>> bcache: register_bdev() registered backing device sda3 bcache:
>>>>>> error on UUID: bcache: journal entries X-Y missing! (replaying
>>>>>> X-Z) , disabling caching bcache: bch_count_io_errors() nvme0n1:
>>>>>> IO error on writing btree. bcache: bch_btree_insert() error -5
>>>>>> bcache: bch_cached_dev_attach() Can't attach sda3: shutting
>>>>>> down bcache: register_cache() registered cache device nvme0n1
>>>>>> bcache: bch_count_io_errors() nvme0n1: IO error on writing
>>>>>> btree. bcache: bch_count_io_errors() nvme0n1: IO error on
>>>>>> writing btree. bcache: bch_count_io_errors() nvme0n1: IO error
>>>>>> on writing btree. bcache: bch_count_io_errors() nvme0n1: IO
>>>>>> error on writing btree. bcache: bch_count_io_errors() nvme0n1:
>>>>>> IO error on writing btree. bcache: bch_count_io_errors()
>>>>>> nvme0n1: IO error on writing btree. bcache: cache_set_free()
>>>>>> Cache set UUID unregistered ```
>>>>>>
>>>>>> UUID represents a UUID.  X, Y, Z are integers, with X<Y<Z,
>>>>>> Y=X+12 and Z=Y+116.
>>>>>>
>>>>>> Error -5 is EIO, i.e. a generic I/O error.  Is there a way to
>>>>>> get more information on where that error originates from and
>>>>>> what exactly is broken? Did bcache just detect broken data, or
>>>>>> is the device itself broken?  Which device, the HDD or the NVMe
>>>>>> SSD?
>>>>>>
>>>>>> Is there a way to recover from this without loosing all data
>>>>>> on the drive?  Is it maybe possible to just discard the
>>>>>> journal entries >X and return to the state the block device was
>>>>>> at point X, loosing only modifications after that point?
>>>>>>
>>>>>> Background: The situation appeared after my computer was
>>>>>> running for a few hours and the screen stayed dark when I tried
>>>>>> to wake the monitor from standby.  The machine did not react to
>>>>>> NumLock or Ctrl+Alt+Entf, so I issued a magic SysRq and tried
>>>>>> to safely reboot the machine by slowly typing REISUB. Sadly
>>>>>> after this the machine ended up in the state described above.
>>>>>
>>>>> It seems some journal set was lost during bch_journal_replay()
>>>>> after reboot and start cache set.
>>>>>
>>>>> During my test for a journal deadlock fix, I also observe this
>>>>> issue. I change the journal buckets number from 256 to 8, such
>>>>> problem can be observe almost every reboot.
>>>>>
>>>>> This one is not fixed yet and I am currently working on it.
>>>>>
>>>>> What kernel version do you use ?  I though this issue was only
>>>>> introduced by my current changes, but from your report it seems
>>>>> such problem happens in upstream kernel as well.
>>>
>>>> I was using Linux 5.0.2 (with Gentoo patches, which are minimal,
>>>> AFAIK).
>>>
>>>> I would have expected that S and/or U in REISUB would write all
>>>> bcache metadata to disk and prevent such problems.  Is this a wrong
>>>> assumption?
>>>
>>>> Will your patches allow me to use the cache again, or will they
>>>> prevent the metadata from breaking in the first place?
>>>
>>> Now I am still looking for the reason how such problem happens. Once I
>>> have a fix, I will let you know.
>>>
>>> Thanks.
>>>
>>> Coly Li
>>>
>>>
>>>
>>
> 
> 

-- 

Coly Li