Re: Re-linking subdirectories with root inodes in CephFS

caskd <caskd@xxxxxxxxx> · Fri, 23 Feb 2024 07:14:47 +0000

Hello users,

We've managed to extract all the data off the filesystem, and the damage was caused by rescue attempts.
I'll share the story for context:

- MDS hung up during a heavy snaptrim
- MDS refused to boot after a fail due to a invalid journal head
- We've attempted to reset the journal (here's where the stupidty started)
- MDS refused to start due to corruption of strays
- We've reset the session (not a big deal) and snap (bad!!!) table
- MDS refused to start due to invalid snap references
- We've run the cephfs-data-scan tools to recover anything that might've been lost
- We've removed the strays manually (direct OMAP manipulation)
- We've disabled crashes on newer corrupt dentries
- We've extracted all the data off the filesystem

Nothing critical has been lost but this might be useful as a reminder to others to read carefully before trying things out.

> Following up, the daemon doesn't crash anymore, and just stays in a damaged state.
> 
> 2024-02-19 07:41:52.465975433    -14> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 20 mds.0.cache.ino(0x2000027e1f4) decode_snap_blob snaprealm(0x2000027e1f4 seq 389f lc 0 cr 389f cps 38a0 snaps={} past_parent_snaps=36a4,37f3,382a,3846,3856,3
> 5a,385d,3861,3865,3869,386d,3871,3873,3875,3876,3878,387a,387c,387e,3880,3882,3884,3886,3888,388a,388c,388e,3890,3892,3894,3896,3898,389a,389c,389e last_modified 0.000000 change_attr 6403 0x7f0cf0c456f0)
> 2024-02-19 07:41:52.465976770    -13> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 20 mds.0.cache.dir(0x602) lookup_exact_snap (head, '2000027e1f4')
> 2024-02-19 07:41:52.465980875    -12> 2024-02-19T07:41:51.007+0000 7f0cf08feb38  1 mds.0.cache.den(0x602 2000027e1f4) loaded already corrupt dentry: [dentry #0x100/stray2/2000027e1f4 [38a0,head] rep@0.0 NULL (dversion lock) pv=0 v=14846772
> 5 ino=(nil) state=0 0x7f0cefc0acd0]
> 2024-02-19 07:41:52.465991373    -11> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 10 mds.0.cache.dir(0x602) go_bad_dentry 2000027e1f4
> 2024-02-19 07:41:52.465993109    -10> 2024-02-19T07:41:51.007+0000 7f0cf08feb38 -1 mds.0.damage notify_dentry Damage to dentries in fragment * of ino 0x602is fatal because it is a system directory for this rank
> 2024-02-19 07:41:52.466011324     -9> 2024-02-19T07:41:51.007+0000 7f0cf08feb38  5 mds.beacon.flying-fish-cove.christmas-island.0 set_want_state: up:rejoin -> down:damaged
> 2024-02-19 07:41:52.466012799     -8> 2024-02-19T07:41:51.007+0000 7f0cf08feb38  5 mds.beacon.flying-fish-cove.christmas-island.0 Sending beacon down:damaged seq 11
> 
> The directories under '100' are the directories that have seemingly been strayed by the data-scan. The snaprealm problems still persist, however they aren't fatal anymore.
> 
> Attempting to run a scrub or listing damage via `ceph tell mds.fs:0` results in:
> terminate called after throwing an instance of 'std::out_of_range'
>   what():  map::at
>   zsh: abort      ceph tell mds.delta:0 damage ls
> 
> Running the same thing via ceph daemon results in:
> ERROR: (38) Function not implemented

-- 
Alex D.
RedXen System & Infrastructure Administration
https://redxen.eu/
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx