Hi all, Looks like I finally found myself with a serious Ceph explosion on my home cluster. Executive summary: Two days ago this cluster got upgraded under "exciting" circumstances (that I believe are ultimately irrelevant). This resulted in all OSDs that had been deployed when the cluster was built corrupting themselves, increasingly over time. The four OSDs that had been recently deployed are fine. I believe there is a bug that causes allocator/freelist corruption on OSDs with a specific history - in particular, the corrupted ones had in the past seen a bdev expansion, and had been through upgrades prior. The bug causes data to be clobbered, causing RocksDB failures. Full story: This is a single-host home cluster that I use as a NAS and for testing purposes. It has: 8 HDD OSDs (0-7) 3 SSD OSDs (8-10) The HDD OSDs were deployed a couple years ago, and had been through a bdev expansion. At some point, there was a regression with a new allocator on devices that had undergone that, causing assert failures, so I have set this in ceph.conf as a workaround: [osd] bluestore allocator = bitmap bluefs allocator = bitmap The SSD OSDs were deployed recently. In addition, an HDD failed and was recently replaced. The cluster was healthy (data recovery had long occurred) and was in the process of rebalancing onto the freshly replaced disk. There are only two pools of note, a size 3 cephfs metadata pool on the SSDs and a RS 5,2 pool on the HDDs for cephfs data. Other than the HDD failure in the past, there is no evidence of any hardware issues happening during this whole ordeal (no SMART/IO errors etc). The cluster was happily running 16.2.5. I had a pending upgrade to 17.2.0, so I had updated the package but was waiting until the rebalance was done before restarting the daemons. Two days ago, in the middle of the night with the rebalance still in progress (but probably nearly complete?), the machine ended up in an OOM tailspin that caused everything to grind to a halt, eventually causing OSDs to kill themselves. Some auto-restarted with the new daemon. The machine didn't recover from the OOM tailspin, so in the morning I hard reset it (front reset button). When the machine came back, 3 OSDs were failing to start. Two were asserting on startup due to RocksDB corruption: ceph-17.2.0/src/kv/RocksDBStore.cc: 1863: ceph_abort_msg("block checksum mismatch: stored = 3040506372, computed = 1760649055 in db/311946.sst offset 49152517 size 4139") The corrupted sst was generated that morning, soon after the machine restart: 2022-06-28T11:04:00.356+0900 7f2adb513640 4 rocksdb: [db/compaction/compaction_job.cc:1426] [default] [JOB 3] Generated table #311946: 228111 keys, 56950412 bytes (These logs are from osd.0; the story is similar for osd.1). The other OSD did come back up and I was hoping to be able to recover from a 2-OSD loss. However, I found some OSDs would randomly die during RocksDB compaction with the same errors. At this point I thought the problem was due to latent RocksDB corruption. So I started using ceph-bluestore-tool to export bluefs and take a look at the files. One pattern I found is that the corrupted ssts all seemed to have long strings of 'aaaaaaaaa' (ascii) where the corruption started, and no other ssts had that kind of data. Eventually I took down all other OSDs since I started suspecting even minor writes were further causing damage. At this point I still hoped it was a RocksDB issue and I could pull the data out somehow and rescue the cluster. Since the failures usually occurred during RocksDB compaction for other OSDs, I was hoping perhaps I could disable that (this was mentioned in a past discussion) and get the OSDs up enough to recover things and migrate the data to new OSDs, so I ordered some new HDDs yesterday. Meanwhile, trying to do minimal experiments, even simple osd --flush-journal invocations and ceph-kvstore-tool invocations, eventually let me to suspect every single thing I did was causing even more corruption. osd.5 also started failing to start up with a corrupted superblock at one point. I suspected the allocator, so I tried qfsck, and sure enough: ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 qfsck qfsck bluestore.quick-fsck bluestore::NCB::operator()::(2)compare_allocators:: spillover bluestore::NCB::compare_allocators::mismatch:: idx1=477259 idx2=477260 bluestore::NCB::read_allocation_from_drive_for_bluestore_tool::FAILURE. Allocator from file and allocator from metadata differ::ret=-1 All OSDs *except* 3,8,9,10 fail qfsck. Those are the four that were recently deployed and have not undergone a bdev expansion nor any other upgrades. At this point, is it fair to say that if the allocator/freelist management borked, there's probably no way to recover the data in these OSDs? I imagine unpredictable corruption is likely to have happened on all of them... I do have a backup of most of the data in the cluster, so I'm ready to call it a loss and rebuild it if need be. One remaining question is whether my bitmap allocator setting in ceph.conf had anything to do with the problem. It certainly didn't break the 3 new OSDs, but I can't say whether the problem would've happened had that config been absent. Although this upgrade happened under exciting uncontrolled OOM circumstances, at this point I think that's a red herring. I think there's a bug in the new Quincy NCB stuff that causes freelist corruption with (likely) OSDs that have in the past undergone a bdev expansion, or some other distinguishing latent feature of my older OSDs. I get the feeling that the corruption slowly happens/gets worse on restarts (maybe the first boot on the new version is fine and only committing/reloading freelist data ends up causing trouble?), I'm not even sure if I would have noticed the problem if I'd been doing a controlled, rolling upgrade until it was too late. So if I'm right about this, this is a significant footgun bug that will probably hit other people... :( The qfsck errors are often of this form: idx1=477259 idx2=477260 idx1=654489 idx2=654490 idx1=1773104 idx2=1773105 idx1=1547932 idx2=1547933 I.e. off-by-one. I seem to recall that back when I ran into the allocator issue with bdev-expanded OSDs, the issue was that the expansion created a freelist entry that was zero-sized or some other degenerate condition, and that wasn't handled properly by the new allocator. Random shot in the dark: perhaps finding such a block causes an off-by-one condition in the new allocator when migrating, which ends up with a bad freelist? Two remaining cases that aren't: bluestore::NCB::compare_allocators::Failed memcmp(arr1, arr2, sizeof(extent_t)*idx2) bluestore::NCB::compare_allocators::!!!![1205426] arr1::<7146797858816,27721728> bluestore::NCB::compare_allocators::!!!![1205426] arr2::<7146797858816,25624576> And: bluestore::NCB::compare_allocators::!!!![1889072] arr1::<7146820468736,5111808> bluestore::NCB::compare_allocators::!!!![1889072] arr2::<7146820468736,3014656> (That makes 6; of the remaining 2 HDD OSDs one is too dead to even start qfsk since the superblock triggers a RocksDB checksum error, the other is the good recently deployed one). I'd like to know if there's any likely chance of data recovery at this point or if there's anything else to try, or whether it's a lost cause and I should just rebuild it and restore from backup (which will take a while over the internet...). Additionally, I'm happy to do more forensics on these OSDs to try to track down the bug. I have some new HDDs, so I can at least swap out a few of them and keep the old ones untouched for later. -- Hector Martin (marcan@xxxxxxxxx) Public Key: https://mrcn.st/pub _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx