All older OSDs corrupted after Quincy upgrade

Hector Martin <marcan@xxxxxxxxx> · Wed, 29 Jun 2022 17:12:26 +0900

Hi all,

Looks like I finally found myself with a serious Ceph explosion on my
home cluster.

Executive summary: Two days ago this cluster got upgraded under
"exciting" circumstances (that I believe are ultimately irrelevant).
This resulted in all OSDs that had been deployed when the cluster was
built corrupting themselves, increasingly over time. The four OSDs that
had been recently deployed are fine. I believe there is a bug that
causes allocator/freelist corruption on OSDs with a specific history -
in particular, the corrupted ones had in the past seen a bdev expansion,
and had been through upgrades prior. The bug causes data to be
clobbered, causing RocksDB failures.

Full story:

This is a single-host home cluster that I use as a NAS and for testing
purposes. It has:

8 HDD OSDs (0-7)
3 SSD OSDs (8-10)

The HDD OSDs were deployed a couple years ago, and had been through a
bdev expansion. At some point, there was a regression with a new
allocator on devices that had undergone that, causing assert failures,
so I have set this in ceph.conf as a workaround:

[osd]
        bluestore allocator = bitmap
        bluefs allocator = bitmap

The SSD OSDs were deployed recently. In addition, an HDD failed and was
recently replaced. The cluster was healthy (data recovery had long
occurred) and was in the process of rebalancing onto the freshly
replaced disk. There are only two pools of note, a size 3 cephfs
metadata pool on the SSDs and a RS 5,2 pool on the HDDs for cephfs data.

Other than the HDD failure in the past, there is no evidence of any
hardware issues happening during this whole ordeal (no SMART/IO errors etc).

The cluster was happily running 16.2.5. I had a pending upgrade to
17.2.0, so I had updated the package but was waiting until the rebalance
was done before restarting the daemons.

Two days ago, in the middle of the night with the rebalance still in
progress (but probably nearly complete?), the machine ended up in an OOM
tailspin that caused everything to grind to a halt, eventually causing
OSDs to kill themselves. Some auto-restarted with the new daemon. The
machine didn't recover from the OOM tailspin, so in the morning I hard
reset it (front reset button). When the machine came back, 3 OSDs were
failing to start. Two were asserting on startup due to RocksDB corruption:

ceph-17.2.0/src/kv/RocksDBStore.cc: 1863: ceph_abort_msg("block checksum
mismatch: stored = 3040506372, computed = 1760649055  in db/311946.sst
offset 49152517 size 4139")

The corrupted sst was generated that morning, soon after the machine
restart:

2022-06-28T11:04:00.356+0900 7f2adb513640  4 rocksdb:
[db/compaction/compaction_job.cc:1426] [default] [JOB 3] Generated table
#311946: 228111 keys, 56950412 bytes

(These logs are from osd.0; the story is similar for osd.1).

The other OSD did come back up and I was hoping to be able to recover
from a 2-OSD loss. However, I found some OSDs would randomly die during
RocksDB compaction with the same errors.

At this point I thought the problem was due to latent RocksDB
corruption. So I started using ceph-bluestore-tool to export bluefs and
take a look at the files. One pattern I found is that the corrupted ssts
all seemed to have long strings of 'aaaaaaaaa' (ascii) where the
corruption started, and no other ssts had that kind of data.

Eventually I took down all other OSDs since I started suspecting even
minor writes were further causing damage. At this point I still hoped it
was a RocksDB issue and I could pull the data out somehow and rescue the
cluster.

Since the failures usually occurred during RocksDB compaction for other
OSDs, I was hoping perhaps I could disable that (this was mentioned in a
past discussion) and get the OSDs up enough to recover things and
migrate the data to new OSDs, so I ordered some new HDDs yesterday.

Meanwhile, trying to do minimal experiments, even simple osd
--flush-journal invocations and ceph-kvstore-tool invocations,
eventually let me to suspect every single thing I did was causing even
more corruption. osd.5 also started failing to start up with a corrupted
superblock at one point. I suspected the allocator, so I tried qfsck,
and sure enough:

ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 qfsck
qfsck bluestore.quick-fsck
bluestore::NCB::operator()::(2)compare_allocators:: spillover
bluestore::NCB::compare_allocators::mismatch:: idx1=477259 idx2=477260
bluestore::NCB::read_allocation_from_drive_for_bluestore_tool::FAILURE.
Allocator from file and allocator from metadata differ::ret=-1

All OSDs *except* 3,8,9,10 fail qfsck. Those are the four that were
recently deployed and have not undergone a bdev expansion nor any other
upgrades.

At this point, is it fair to say that if the allocator/freelist
management borked, there's probably no way to recover the data in these
OSDs? I imagine unpredictable corruption is likely to have happened on
all of them... I do have a backup of most of the data in the cluster, so
I'm ready to call it a loss and rebuild it if need be.

One remaining question is whether my bitmap allocator setting in
ceph.conf had anything to do with the problem. It certainly didn't break
the 3 new OSDs, but I can't say whether the problem would've happened
had that config been absent.

Although this upgrade happened under exciting uncontrolled OOM
circumstances, at this point I think that's a red herring. I think
there's a bug in the new Quincy NCB stuff that causes freelist
corruption with (likely) OSDs that have in the past undergone a bdev
expansion, or some other distinguishing latent feature of my older OSDs.
I get the feeling that the corruption slowly happens/gets worse on
restarts (maybe the first boot on the new version is fine and only
committing/reloading freelist data ends up causing trouble?), I'm not
even sure if I would have noticed the problem if I'd been doing a
controlled, rolling upgrade until it was too late. So if I'm right about
this, this is a significant footgun bug that will probably hit other
people... :(

The qfsck errors are often of this form:
 idx1=477259 idx2=477260
 idx1=654489 idx2=654490
 idx1=1773104 idx2=1773105
 idx1=1547932 idx2=1547933

I.e. off-by-one. I seem to recall that back when I ran into the
allocator issue with bdev-expanded OSDs, the issue was that the
expansion created a freelist entry that was zero-sized or some other
degenerate condition, and that wasn't handled properly by the new
allocator. Random shot in the dark: perhaps finding such a block causes
an off-by-one condition in the new allocator when migrating, which ends
up with a bad freelist?

Two remaining cases that aren't:
bluestore::NCB::compare_allocators::Failed memcmp(arr1, arr2,
sizeof(extent_t)*idx2)
bluestore::NCB::compare_allocators::!!!![1205426]
arr1::<7146797858816,27721728>
bluestore::NCB::compare_allocators::!!!![1205426]
arr2::<7146797858816,25624576>

And:
bluestore::NCB::compare_allocators::!!!![1889072]
arr1::<7146820468736,5111808>
bluestore::NCB::compare_allocators::!!!![1889072]
arr2::<7146820468736,3014656>

(That makes 6; of the remaining 2 HDD OSDs one is too dead to even start
qfsk since the superblock triggers a RocksDB checksum error, the other
is the good recently deployed one).

I'd like to know if there's any likely chance of data recovery at this
point or if there's anything else to try, or whether it's a lost cause
and I should just rebuild it and restore from backup (which will take a
while over the internet...). Additionally, I'm happy to do more
forensics on these OSDs to try to track down the bug. I have some new
HDDs, so I can at least swap out a few of them and keep the old ones
untouched for later.

-- 
Hector Martin (marcan@xxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx