Re: bcache writeback infinite loop?

Larkin Lowrey <llowrey@xxxxxxxxxxxxxxxxx> · Wed, 30 Oct 2019 15:06:13 -0400

On 10/29/2019 1:19 AM, Coly Li wrote:
On 2019/10/28 11:14 下午, Larkin Lowrey wrote:
On 10/27/2019 11:01 PM, Coly Li wrote:
On 2019/10/25 7:22 上午, Eric Wheeler wrote:
On Wed, 23 Oct 2019, Larkin Lowrey wrote:
I have a backing device that is constantly writing due to
bcache_writebac. It
has been at 14.3.MB dirty all day and has not changed. There's
nothing else
writing to it.

This started after I "upgraded" from Fedora 29 to 30 and
consequently from
kernel 5.2.18 to 5.3.6.

You can see from the info below that the writeback process is
chewing up  A
LOT of CPU and writing constantly at ~7MB/s. It sure looks like it's
in an
infinite loop and writing the same data over and over. At least I
hope that's
the case and it's not just filling the array with garbage.

This configuration has been stable for many years and across many
Fedora
upgrades. The host has ECC memory so RAM corruption should not be a
concern. I
have not had any recent controller or drive failures.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
   5915 root      20   0       0      0      0 R  94.1   0.0 891:45.42
   bcache_writebac

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
md3              0.02 1478.77      0.00      6.90     0.00
0.00   0.00
0.00    0.00    0.00   0.00     7.72     4.78   0.00   0.00
md3              0.00 1600.00      0.00      7.48     0.00
0.00   0.00
0.00    0.00    0.00   0.00     0.00     4.79   0.00   0.00
md3              0.00 1300.00      0.00      6.07     0.00
0.00   0.00
0.00    0.00    0.00   0.00     0.00     4.78   0.00   0.00
md3              0.00 1500.00      0.00      7.00     0.00
0.00   0.00
0.00    0.00    0.00   0.00     0.00     4.78   0.00   0.00

--- bcache ---
UUID                        dc2877bc-d1b3-43fa-9f15-cad018e73bf6
Block Size                  512 B
Bucket Size                 512.00 KiB
Congested?                  False
Read Congestion             2.0ms
Write Congestion            20.0ms
Total Cache Size            128 GiB
Total Cache Used            128 GiB     (100%)
Total Cache Unused          0 B (0%)
Evictable Cache             127 GiB     (99%)
Replacement Policy          [lru] fifo random
Cache Mode                  writethrough [writeback] writearound none
Total Hits                  49872       (97%)
Total Misses                1291
Total Bypass Hits           659 (77%)
Total Bypass Misses         189
Total Bypassed              5.8 MiB
--- Cache Device ---
    Device File               /dev/dm-3 (253:3)
    Size                      128 GiB
    Block Size                512 B
    Bucket Size               512.00 KiB
    Replacement Policy        [lru] fifo random
    Discard?                  False
    I/O Errors                0
    Metadata Written          5.0 MiB
    Data Written              86.1 MiB
    Buckets                   262144
    Cache Used                128 GiB     (100%)
    Cache Unused              0 B (0%)
--- Backing Device ---
    Device File               /dev/md3 (9:3)
    bcache Device File        /dev/bcache0 (252:0)
    Size                      73 TiB
    Cache Mode                writethrough [writeback] writearound none
    Readahead                 0.0k
    Sequential Cutoff         4.0 MiB
    Merge sequential?         False
    State                     dirty
    Writeback?                True
    Dirty Data                14.3 MiB
    Total Hits                49872       (97%)
    Total Misses              1291
    Total Bypass Hits         659 (77%)
    Total Bypass Misses       189
    Total Bypassed            5.8 MiB

I have not tried reverting back to an earlier kernel. I'm concerned
about
possible corruption. Is that safe? Any other suggestions as to how
to debug
and/or resolve this issue?
I don't there have been any on-disk format changes, reverting should be
safe.

Coly?  Can you confirm?
There is no on-disk format change for now. It should be safe to revert
to previous version.

Reverting to 5.2.18 didn't make a difference not did moving forward to
5.3.7.

I see, so this might not be a regression in Linux v5.3 bcache code.

I noticed that on each reboot the point where it got stuck changed. It
had been stuck at 14.3MiB dirty then a couple of reboots later it was
down to 10.7MiB. For example...

On reboot, the dirty was 42.9MiB and proceeded to shrink until it got
stuck at 10.7MiB. I then rebooted and it was 44.9MiB and shrank to
10.0MiB. I rebooted again, and got unlucky, it started at 45.4MiB and
got stuck at 10.3MiB (an increase from the prior run). I assume that
mounting and unmounting the fs does generate some small amount of writes
which may explain the differences.

The cache device is a raid5 of SSDs and I've not had any unclean
shutdowns so I would not expect data corruption there. A raid check ran
yesterday with no errors.
Oh, since this is a cache on raid5 of SSD, it is less possible that SSD
is the problem (if you don't find media error on SSDs). Then it might be
from a corrupted bcache B+tree.

In Linux v5.3, there are many fixes related to bcache journal and
internal B+tree node flushing. So it is possible a hidden problem
corrupted internal B+btree node (e.g. B+node is not flushed onto SSD in
time).

So far this is no user space utility to check consistency of bcache
B+tree, I am able to tell the exact reason why the B+btree node is
corrupted (this can be queued on my long to-do list).

I will suggest to run a fsck without bcache to make sure the data
consistency of backing device, and then rebuild new cache set with Linux
v5.3. Since Linux v5.3, bcache is more stable, it can survive 12+ hours
on my testing machine (Linux 5.2 kernel is around 40 minutes and panic
due to a race bug).

I did a scrub with bcache running and 19 errors were found and corrected 
using duplicate metadata. That seems encouraging. Unfortunately, I can't 
seem to shut down bcache in order to test as you suggest. I can stop 
bcache0 but I am unable to stop the cache device. I do the usual:

echo 1 > /sys/fs/bcache/dc2877bc-d1b3-43fa-9f15-cad018e73bf6/stop

... and nothing happens. I assume that's because it can't do a clean 
shutdown. Is there any other way to unload bcache?

My alternative is to dig up a bootable usb drive that doesn't auto-start 
bcache. So far, all of the boot images I've tried init bcache automatically.

--Larkin