MDS_DAMAGE dir_frag

Sascha Lucas <ceph-users@xxxxxxxxx> · Mon, 12 Dec 2022 14:39:24 +0100 (CET)

Hi,

without any outage/disaster cephFS (17.2.5/cephadm) reports damaged 
metadata:

[root@ceph106 ~]# zcat /var/log/ceph/3cacfa58-55cf-11ed-abaf-5cba2c03dec0/ceph-mds.disklib.ceph106.kbzjbg.log-20221211.gz
2022-12-10T10:12:35.161+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg Updating MDS map to version 958 from mon.1
2022-12-10T10:12:50.974+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg Updating MDS map to version 959 from mon.1
2022-12-10T15:18:36.609+0000 7fa461791700  0 mds.0.cache.dir(0x100001516b1) _fetched missing object for [dir 0x100001516b1 /volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505/ [2,head] auth v=0 cv=0/0 ap=1+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x56541d3c5a80]
2022-12-10T15:18:36.615+0000 7fa461791700 -1 log_channel(cluster) log [ERR] : dir 0x100001516b1 object missing on disk; some files may be lost (/volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505)
2022-12-10T15:18:40.010+0000 7fa46779d700  1 mds.disklib.ceph106.kbzjbg Updating MDS map to version 960 from mon.1
2022-12-11T02:32:01.474+0000 7fa468fa0700 -1 received  signal: Hangup from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

[root@ceph101 ~]# ceph tell mds.disklib:0 damage ls
2022-12-12T10:20:42.484+0100 7fa9e37fe700  0 client.165258 ms_handle_reset on v2:xxx.xxx.xxx.xxx:6800/519677707
2022-12-12T10:20:42.504+0100 7fa9e37fe700  0 client.165264 ms_handle_reset on v2:xxx.xxx.xxx.xxx:6800/519677707
[
    {
        "damage_type": "dir_frag",
        "id": 2085830739,
        "ino": 1099513009841,
        "frag": "*",
        "path": "/volumes/_nogroup/ec-pool4p2/aa36abb9-a22e-405f-921c-76152599c6ba/LQ1WYG_10.28.2022_04.50/CV_MAGNETIC/V_7770505"
    }
]

The mentioned path CV_MAGNETIC/V_7770505 is not visible, but I can't 
tell whether this is due to being lost, or removed by the application 
using the cephFS.

Data is on EC4+2 pool, ROOT and METADATA are on replica=3 pools.

Questions are: What happened? And how to fix the problem?

Is running "ceph tell mds.disklib:0 scrub start /what/path? 
recursive,repair" the right thing? Is this a safe command? How is the 
impact on production?

Can the file-system stay mounted/used by clients? How long will it take 
for 340T? What is a dir_frag damage?

TIA, Sascha.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx