Re: Recover from "journal entries X-Y missing! (replaying X-Z)", "IO error on writing btree."

Coly Li <colyli@xxxxxxx> · Wed, 20 Mar 2019 19:16:29 +0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 2019/3/20 5:42 上午, Dennis Schridde wrote:
> Hello!
> 
> During boot my bcache device cannot be activated anymore and hence 
> the filesystem content is inaccessible.  It appears that parts of 
> the journal are corrupted, since dmesg says: ``` bcache: 
> register_bdev() registered backing device sda3 bcache: error on 
> UUID: bcache: journal entries X-Y missing! (replaying X-Z) , 
> disabling caching bcache: bch_count_io_errors() nvme0n1: IO error 
> on writing btree. bcache: bch_btree_insert() error -5 bcache: 
> bch_cached_dev_attach() Can't attach sda3: shutting down bcache: 
> register_cache() registered cache device nvme0n1 bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> bch_count_io_errors() nvme0n1: IO error on writing btree. bcache: 
> cache_set_free() Cache set UUID unregistered ```
> 
> UUID represents a UUID.  X, Y, Z are integers, with X<Y<Z, Y=X+12 
> and Z=Y+116.
> 
> Error -5 is EIO, i.e. a generic I/O error.  Is there a way to get 
> more information on where that error originates from and what 
> exactly is broken? Did bcache just detect broken data, or is the 
> device itself broken?  Which device, the HDD or the NVMe SSD?
> 
> Is there a way to recover from this without loosing all data on
> the drive?  Is it maybe possible to just discard the journal
> entries >X and return to the state the block device was at point X,
> loosing only modifications after that point?
> 
> Background: The situation appeared after my computer was running 
> for a few hours and the screen stayed dark when I tried to wake
> the monitor from standby.  The machine did not react to NumLock or 
> Ctrl+Alt+Entf, so I issued a magic SysRq and tried to safely
> reboot the machine by slowly typing REISUB. Sadly after this the
> machine ended up in the state described above.

It seems some journal set was lost during bch_journal_replay() after
reboot and start cache set.

During my test for a journal deadlock fix, I also observe this issue.
I change the journal buckets number from 256 to 8, such problem can be
observe almost every reboot.

This one is not fixed yet and I am currently working on it.

What kernel version do you use ?  I though this issue was only
introduced by my current changes, but from your report it seems such
problem happens in upstream kernel as well.

Thanks.

- -- 

Coly Li
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE6j5FL/T5SGCN6PrQxzkHk2t9+PwFAlySIQsACgkQxzkHk2t9
+PxSbw/+J3X6LRHbHRr74jqmKcCYoWRWUSZnKcRFlKbRDOi9YDHPY5IuXB++bnt4
XCW7sK4xCosWW2OiWXScqaShW4D7T3R6Yl7qU/q+dcsoXspL+aNiWGDbvRhdQ7rC
nQOE3+8OhijX/k8JSl2BXqkR4R/1EsAUqw88XWupTtFlIzRJtDftt2EJfc19BgMl
z6Xv8ZMlisnMCY9R2AAdMjgW65ewMa9nihlpGiAC8AW8Gd9bgPLR0LIJdxATGflL
jxDgTpepZunmtoyQjCvIaQKv1y7K70TM0mltLjUOckAOAznoqUj4ViKsJQ4DJPuw
P7Dna1/pi/1mIVwdpqIetv/xWCVOf413GoM56jD438sTmPt46Zhp7Ze21jBlaF6C
EF/LLJ4X16pSA5pP+jbqaHE1KlLH6cgfdxCnvApTbAlTk7RSAw4KSl9F7ao9IvbN
81I38QFol6sOrvnUsn+iG7rWH4Ekhq6SI8kxrnCBEhWBJ0Km5iU5inoNahPrnB4R
TCshgMy8VUp8qUnXPYRoSXlJ+SxJkMFTYVyiLeimfjeuzLxAh8mlEslcjJwdM+Rs
iKmHW6YZaXxwspQ4VenUHwOv17xnYCSrXicurDLDie1syAiN3Gg4iEia2r93uUxq
iWokebNFpTPFW8K8ZYTxfEQ6stQH5zsoMVApz0GqckvJSzKktCg=
=8Jhn
-----END PGP SIGNATURE-----