Hello users, We've managed to extract all the data off the filesystem, and the damage was caused by rescue attempts. I'll share the story for context: - MDS hung up during a heavy snaptrim - MDS refused to boot after a fail due to a invalid journal head - We've attempted to reset the journal (here's where the stupidty started) - MDS refused to start due to corruption of strays - We've reset the session (not a big deal) and snap (bad!!!) table - MDS refused to start due to invalid snap references - We've run the cephfs-data-scan tools to recover anything that might've been lost - We've removed the strays manually (direct OMAP manipulation) - We've disabled crashes on newer corrupt dentries - We've extracted all the data off the filesystem Nothing critical has been lost but this might be useful as a reminder to others to read carefully before trying things out. > Following up, the daemon doesn't crash anymore, and just stays in a damaged state. > > 2024-02-19 07:41:52.465975433 -14> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 20 mds.0.cache.ino(0x2000027e1f4) decode_snap_blob snaprealm(0x2000027e1f4 seq 389f lc 0 cr 389f cps 38a0 snaps={} past_parent_snaps=36a4,37f3,382a,3846,3856,3 > 5a,385d,3861,3865,3869,386d,3871,3873,3875,3876,3878,387a,387c,387e,3880,3882,3884,3886,3888,388a,388c,388e,3890,3892,3894,3896,3898,389a,389c,389e last_modified 0.000000 change_attr 6403 0x7f0cf0c456f0) > 2024-02-19 07:41:52.465976770 -13> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 20 mds.0.cache.dir(0x602) lookup_exact_snap (head, '2000027e1f4') > 2024-02-19 07:41:52.465980875 -12> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 1 mds.0.cache.den(0x602 2000027e1f4) loaded already corrupt dentry: [dentry #0x100/stray2/2000027e1f4 [38a0,head] rep@0.0 NULL (dversion lock) pv=0 v=14846772 > 5 ino=(nil) state=0 0x7f0cefc0acd0] > 2024-02-19 07:41:52.465991373 -11> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 10 mds.0.cache.dir(0x602) go_bad_dentry 2000027e1f4 > 2024-02-19 07:41:52.465993109 -10> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 -1 mds.0.damage notify_dentry Damage to dentries in fragment * of ino 0x602is fatal because it is a system directory for this rank > 2024-02-19 07:41:52.466011324 -9> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 5 mds.beacon.flying-fish-cove.christmas-island.0 set_want_state: up:rejoin -> down:damaged > 2024-02-19 07:41:52.466012799 -8> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 5 mds.beacon.flying-fish-cove.christmas-island.0 Sending beacon down:damaged seq 11 > > The directories under '100' are the directories that have seemingly been strayed by the data-scan. The snaprealm problems still persist, however they aren't fatal anymore. > > Attempting to run a scrub or listing damage via `ceph tell mds.fs:0` results in: > terminate called after throwing an instance of 'std::out_of_range' > what(): map::at > zsh: abort ceph tell mds.delta:0 damage ls > > Running the same thing via ceph daemon results in: > ERROR: (38) Function not implemented -- Alex D. RedXen System & Infrastructure Administration https://redxen.eu/
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx