Re: bcache failure hangs something in kernel

Nix <nix@xxxxxxxxxxxxx> · Tue, 14 Nov 2017 18:25:01 +0000

On 14 Nov 2017, Michael Lyle outgrape:

> On Tue, Nov 14, 2017 at 5:27 AM, Nix <nix@xxxxxxxxxxxxx> wrote:
>> Every time I rebooted I got warnings that bcache couldn't clean up
> in time, and I suspect this caused corruption in the end (fairly fast,
> actually, less than a month after starting using bcache: it had only
> just finished populating).
>
> What did you see?  A message like this is normal:
>
> [    2.224767] bcache: bch_journal_replay() journal replay done, 432
> keys in 243 entries, seq 40691386
>
> but anything else is strange...  If you were consistently seeing other
> messages that means something unusual was happening (an already
> badly-corrupted volume or bad hardware).

This registration code:

# Register all bcaches.
if [ -f /sys/fs/bcache/register_quiet ]; then
    for name in /dev/sd*[0-9]* /dev/md/*; do
        echo $name > /sys/fs/bcache/register_quiet 2>&1
    done
    # New devices registered: create them, after a short delay
    # to let the registration happen.
    sleep 1
    /sbin/mdev -s
fi

... did this (including the messages showing that the md array it's
caching is happy):

[   11.281907] md: md125 stopped.
[   11.294948] md/raid:md125: device sda3 operational as raid disk 0
[   11.305620] md/raid:md125: device sdf3 operational as raid disk 4
[   11.315899] md/raid:md125: device sdd3 operational as raid disk 3
[   11.325770] md/raid:md125: device sdc3 operational as raid disk 2
[   11.335245] md/raid:md125: device sdb3 operational as raid disk 1
[   11.344688] md/raid:md125: raid level 6 active with 5 out of 5 devices, algorithm 2
[   11.353810] md125: detected capacity change from 0 to 15761089757184

[   11.468956] bcache: prio_read() bad csum reading priorities
[   11.478010] bcache: prio_read() bad magic reading priorities
[   11.497911] bcache: error on 314dcdd2-9869-4110-99cc-9cd3a861afa6: 
[   11.497914] bad checksum at bucket 28262, block 0, 36185 keys
[   11.507021] , disabling caching
[   11.529823] bcache: register_cache() registered cache device sde2
[   11.539054] bcache: cache_set_free() Cache set 314dcdd2-9869-4110-99cc-9cd3a861afa6 unregistered
[   11.558596] bcache: register_bdev() registered backing device md125

The hardware is fine (zero other problems ever encountered: the RAM is
all ECC and has had zero errors ever, and the disks and SSD are
otherwise faultless -- so far! -- and are in fairly heavy use for other
things, like the XFS and RAID journals, without incident). The cached
(XFS) volume was in use as my rootfs (not only / but also /usr/src and
/home: 4TiB, ~10% full) until the previous reboot. There had been a lot
of writes to the cache device becuase I'd only enabled the cache a week
before, rebooting several times after doing that as part of system
bringup.

Reboots with the cache enabled always featured a message from bcache an
instant before reboot saying it had timed out: from the code, the
timeout is based on a (short!) delay without any concern for whether,
say, the SSD is in the middle of writing a bunch of data, and the delay
is way too short for the SSD in question (an ATA-connected DC3510) to
write more than a GiB or so, a small fraction of the 350GiB I have
devoted to bcache.

I note that the SMART data's bus reset count on the SSD suggests that
rebooting resets the bus as part of POST (the count of bus resets is
identical to the count of OS reboots plus firmware upgrades from the
IPMI event log), which likely halts any ongoing writes. I suspect this
alone could explain the problem, but it's all speculation.

However, SMART also says

0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion

which I believe suggests that this cannot be the problem.

Indeed,

199 CRC_Error_Count         -OSRCK   100   100   000    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0

suggests zero problems. However I do also see

174 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    17

(whatever *that* means. What defines a safe shutdown to Intel SSDs?
Search me.)

> I have not experienced bcache corruption yet, though that doesn't say
> there's not an issue.  This doesn't sound at all like what Alexandr
> experienced, either.

Indeed not.

> There's not a bucket corruption message in the kernel, -- maybe you
> saw a bad btree message at bucket xxxxx, block xxx, dev xxxx?

dmesg above. Sorry, vague memory caused trouble there. I posted a
fuller description on 7th June, which got no response:

<https://www.spinics.net/lists/linux-bcache/msg04668.html>

> What SSD are you using?  A known issue is that there are families of
> SSDs that do not do the right thing on shutdown-- e.g. some devices
> based around LSI/SandForce that do emergency-writeback-from-RAM that
> have underprovisioned / missing capacitors.

This is an Intel DC3510. I believe Intel SSDs are the single model that
actually *work* reliably when powerfail happens (btw Corsair are
rumoured to be even worse, sometimes bricking the whole device on
powerfail! "Corsair" seems to be an appropriate manufacturer name in
this case). SMART says:

175 Power_Loss_Cap_Test     PO--CK   100   100   010    -    5350 (33 1413)

which I believe means "all is fine". isdct says that things are fine too:

DeviceStatus : Healthy
EnduranceAnalyzer : 1102.31 years
LatencyTrackingEnabled : False
WriteCacheEnabled : True
WriteCacheReorderingStateEnabled : True
WriteCacheState : 1
WriteCacheSupported : True
WriteErrorRecoveryTimer : 0

(Hm is it write caching that's doing it? Not given that the thing has
capacitors, surely.)

(In any case this machine has never experienced a power loss. :) )
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html