Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;) Have you excluded any possible hardware issues? 15.2.10 has a new option to check for all zero reads; maybe try it with true? Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV) .set_default(false) .set_flag(Option::FLAG_RUNTIME) .set_description("Check data read for suspicious pages") .set_long_description("Looks into data read to check if there is a 4K block entirely filled with zeros. " "If this happens, we re-read data. If there is difference, we print error to log.") .add_see_also("bluestore_retry_disk_reads"), The "fix zombie spanning blobs" feature was added in 15.2.9. Does 15.2.8 work for you? Cheers, Dan On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten <jelten@xxxxxxxxx> wrote: > > Thanks for the idea, I've tried it with 1 thread, and it shredded another OSD. > I've updated the tracker ticket :) > > At least non-racecondition bugs are hopefully easier to spot... > > I wouldn't just disable the fsck and upgrade anyway until the cause is rooted out. > > -- Jonas > > > On 29/03/2021 14.34, Dan van der Ster wrote: > > Hi, > > > > Saw that, looks scary! > > > > I have no experience with that particular crash, but I was thinking > > that if you have already backfilled the degraded PGs, and can afford > > to try another OSD, you could try: > > > > "bluestore_fsck_quick_fix_threads": "1", # because > > https://github.com/facebook/rocksdb/issues/5068 showed a similar crash > > and the dev said it occurs because WriteBatch is not thread safe. > > > > "bluestore_fsck_quick_fix_on_mount": "false", # should disable the > > fsck during upgrade. See https://github.com/ceph/ceph/pull/40198 > > > > -- Dan > > > > On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten <jelten@xxxxxxxxx> wrote: > >> > >> Hi! > >> > >> After upgrading MONs and MGRs successfully, the first OSD host I upgraded on Ubuntu Bionic from 14.2.16 to 15.2.10 > >> shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot. > >> RocksDB complains "Corruption: unknown WriteBatch tag". > >> > >> The initial crash/corruption occured when the automatic fsck was ran, and when it committed the changes for a lot of "zombie spanning blobs". > >> > >> Tracker issue with logs: https://tracker.ceph.com/issues/50017 > >> > >> > >> Anyone else encountered this error? I've "suspended" the upgrade for now :) > >> > >> -- Jonas > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx