On Tue, Nov 14, 2017 at 10:25 AM, Nix <nix@xxxxxxxxxxxxx> wrote: > [ 11.497914] bad checksum at bucket 28262, block 0, 36185 keys That's no good-- shouldn't have checksum errors. It means either the metadata we wrote got corrupted by the disk, or a metadata write didn't happen in the order we requested. > Reboots with the cache enabled always featured a message from bcache an > instant before reboot saying it had timed out: from the code, the > timeout is based on a (short!) delay without any concern for whether, > say, the SSD is in the middle of writing a bunch of data, and the delay > is way too short for the SSD in question (an ATA-connected DC3510) to > write more than a GiB or so, a small fraction of the 350GiB I have > devoted to bcache. I've seen things hit this couple second timeout before. It basically means that garbage collection is busy analyzing stuff on the disk and doesn't get around to checking the "should I exit now?" flag in time. Not ideal but relatively harmless. (It's not trying to write back the dirty data at this phase or anything). > I note that the SMART data's bus reset count on the SSD suggests that > rebooting resets the bus as part of POST (the count of bus resets is > identical to the count of OS reboots plus firmware upgrades from the > IPMI event log), which likely halts any ongoing writes. Even if it did, as long as acknowledged IO is written it's OK. That is, it's OK for anything we're trying to write to be lost, as long as the drive hasn't told us it's done and then later that write gets "undone". I think there has to be something somewhat unique to your environment-- at an environment I used to administrate (before working on bcache myself), there were about 100 bcache roots in writeback mode-- and we both unceremoniously lost power with active workload a couple of times and did several clean shutdowns for upgrades without losing a volume to corruption (though we did lose many disks that didn't feel like working at all again after power failure). And now I have a bad arc-fault circuit breaker in my home that has dumped power on my two ext4 root-on bcache-on md machines three times in the past couple weeks without issue. Each of my production machines has 15 unsafe shutdowns in smartctl -- a number that I can't quite explain because I think the real number should be 7-8 or so... and my bcache development test rig has 145 (!). Mike -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html