Good morning to everybody,
I run into a problem where inodes are not updated in journal backlog and
scrubbing plus repair is not removing old infos.
Infos about my Ceph-installation:
* version Pacific 16.2.5
* 6 nodes
* 48 OSDs
* one active MDS
* one standby-replay MDS
* one standby MDS
* one CephFS Pool on spindles for data - cephfs_data - usage 6%
* one CephFS pool on nvme for metadata - cephfs_meta - usage <1%
* mounted via kernel-client:
o version Ubuntu 5.11.22
o version Centos 3.10.0-1160.42.2.el7
We are using CephFS to hold state files of our Slurm queuing system to
keep the masters in sync. These files are somehow leading to backtrace
errors which will not be repaired by scrubbing plus repair and force
using following command:
ceph tell mds.scfs:0 scrub start / recursive repair force
These clients are on CentOS 7 mounted via kernel client. I run into this
problem while a failover the active MDS service from one to the
standby-replay . I also discovered that when now do a failover again the
error is disappearing for some hours till these files are written again.
If I do a "damage ls" these files get mentioned:
root@scvirt01:/home/urzadmin# ceph tell mds.0 damage ls
2021-10-11T09:27:52.286+0200 7fa8117fa700 0 client.395589209
ms_handle_reset on v2:172.26.8.153:6800/3237390256
2021-10-11T09:27:52.306+0200 7fa8117fa700 0 client.395589215
ms_handle_reset on v2:172.26.8.153:6800/3237390256
[
{
"damage_type": "backtrace",
"id": 389005317,
"ino": 1099539016806,
"path": "/slurmstate_galaxy/state/qos_usage.old"
},
{
"damage_type": "backtrace",
"id": 784942402,
"ino": 1099539034091,
"path": "/slurmstate_galaxy/state/trigger_state.old"
},
{
"damage_type": "backtrace",
"id": 800422439,
"ino": 1099539016280,
"path": "/slurmstate_galaxy/state/priority_last_decay_ran.old"
},
{
"damage_type": "backtrace",
"id": 1096079557,
"ino": 1099539034095,
"path": "/slurmstate_galaxy/state/fed_mgr_state.old"
},
{
"damage_type": "backtrace",
"id": 1478025581,
"ino": 1099539034678,
"path": "/slurmstate_galaxy/state/heartbeat"
},
{
"damage_type": "backtrace",
"id": 1850571320,
"ino": 1099539034090,
"path": "/slurmstate_galaxy/state/resv_state.old"
},
{
"damage_type": "backtrace",
"id": 2374363174,
"ino": 1099539016807,
"path": "/slurmstate_galaxy/state/fed_mgr_state.old"
},
{
"damage_type": "backtrace",
"id": 2476062375,
"ino": 1099539034092,
"path": "/slurmstate_galaxy/state/assoc_mgr_state.old"
},
{
"damage_type": "backtrace",
"id": 2615211078,
"ino": 1099539034088,
"path": "/slurmstate_galaxy/state/node_state.old"
},
{
"damage_type": "backtrace",
"id": 2872809546,
"ino": 1099539016538,
"path": "/slurmstate_galaxy/state/priority_last_decay_ran.old"
},
{
"damage_type": "backtrace",
"id": 2952984622,
"ino": 1099539034094,
"path": "/slurmstate_galaxy/state/qos_usage.old"
},
{
"damage_type": "backtrace",
"id": 3048617909,
"ino": 1099539017073,
"path": "/slurmstate_galaxy/state/trigger_state.old"
},
{
"damage_type": "backtrace",
"id": 4027167458,
"ino": 1099539035485,
"path": "/slurmstate_galaxy/state/heartbeat"
},
{
"damage_type": "backtrace",
"id": 4094349452,
"ino": 1099539034093,
"path": "/slurmstate_galaxy/state/assoc_usage.old"
},
{
"damage_type": "backtrace",
"id": 4274997805,
"ino": 1099539034089,
"path": "/slurmstate_galaxy/state/part_state.old"
}
]
When i do a ls on the client node the inodes and time stamps are all the
same and correct:
[root@galaxymaster01 state]# ls -ila
total 692
1099511628781 drwxr-xr-x 1 svcslurm root 42 Oct 11 09:49 .
1099511627776 drwxr-xr-x 1 root root 4 Apr 1 2021 ..
1099539038993 -rw------- 1 svcslurm domain users 48483 Oct 11 09:45
assoc_mgr_state
1099539038962 -rw------- 1 svcslurm domain users 48483 Oct 11 09:40
assoc_mgr_state.old
1099539038996 -rw------- 1 svcslurm domain users 14366 Oct 11 09:45
assoc_usage
1099539038963 -rw------- 1 svcslurm domain users 14366 Oct 11 09:40
assoc_usage.old
1099521678639 -rw-r--r-- 1 svcslurm domain users 7 Apr 1 2021
clustername
1099511628783 -rw-r--r-- 1 svcslurm domain users 7 Mar 26 2018
clustername_bkp
2199023255934 -rw-r--r-- 1 svcslurm domain users 7 Apr 1 2021
clustername_bkp2
1099536954641 -rw-r--r-- 1 svcslurm domain users 0 Apr 1 2021
clustername_bkp3
1099538397169 -rw------- 1 svcslurm domain users 211 Sep 11 14:36
dbd.messages
1099539039000 -rw------- 1 svcslurm domain users 19 Oct 11 09:45
fed_mgr_state
1099539038967 -rw------- 1 svcslurm domain users 19 Oct 11 09:40
fed_mgr_state.old
1099511628789 drwxr----- 1 svcslurm domain users 11 Oct 11 09:34 hash.0
1099511631702 drwxr----- 1 svcslurm domain users 9 Oct 11 09:48 hash.1
1099511634616 drwxr----- 1 svcslurm domain users 9 Oct 11 09:39 hash.2
1099511637533 drwxr----- 1 svcslurm domain users 14 Oct 11 09:39 hash.3
1099511640454 drwxr----- 1 svcslurm domain users 9 Oct 11 09:39 hash.4
1099511643373 drwxr----- 1 svcslurm domain users 8 Oct 11 09:44 hash.5
1099511646293 drwxr----- 1 svcslurm domain users 14 Oct 11 09:44 hash.6
1099511649213 drwxr----- 1 svcslurm domain users 15 Oct 11 09:45 hash.7
1099511652131 drwxr----- 1 svcslurm domain users 11 Oct 11 09:45 hash.8
1099511655048 drwxr----- 1 svcslurm domain users 13 Oct 11 09:30 hash.9
1099539039021 -rw------- 1 svcslurm domain users 16 Oct 11 09:49
heartbeat
1099539039011 -rw------- 1 svcslurm domain users 264068 Oct 11 09:46
job_state
1099539039006 -rw------- 1 svcslurm domain users 260650 Oct 11 09:45
job_state.old
1099538757056 -rw------- 1 svcslurm domain users 42 Sep 15 08:47
last_config_lite
1099538397170 -rw------- 1 svcslurm domain users 42 Sep 11 14:44
last_config_lite.old
1099539038991 -rw------- 1 svcslurm domain users 451 Oct 11 09:45
last_tres
1099539038959 -rw------- 1 svcslurm domain users 451 Oct 11 09:40
last_tres.old
1099521425633 -rw------- 1 svcslurm domain users 2874 Jun 8 2020
layouts_state_base
1099521425597 -rw------- 1 svcslurm domain users 2874 Jun 8 2020
layouts_state_base.old
1099539039012 -rw------- 1 svcslurm domain users 17925 Oct 11 09:46
node_state
1099539039007 -rw------- 1 svcslurm domain users 17925 Oct 11 09:45
node_state.old
1099539038995 -rw------- 1 svcslurm domain users 1018 Oct 11 09:45
part_state
1099539038964 -rw------- 1 svcslurm domain users 1018 Oct 11 09:40
part_state.old
1099539039016 -rw------- 1 svcslurm domain users 16 Oct 11 09:47
priority_last_decay_ran
1099539038974 -rw------- 1 svcslurm domain users 16 Oct 11 09:42
priority_last_decay_ran.old
1099539038998 -rw------- 1 svcslurm domain users 796 Oct 11 09:45
qos_usage
1099539038965 -rw------- 1 svcslurm domain users 796 Oct 11 09:40
qos_usage.old
1099539038997 -rw------- 1 svcslurm domain users 35 Oct 11 09:45
resv_state
1099539038966 -rw------- 1 svcslurm domain users 35 Oct 11 09:40
resv_state.old
1099539038999 -rw------- 1 svcslurm domain users 31 Oct 11 09:45
trigger_state
1099539038968 -rw------- 1 svcslurm domain users 31 Oct 11 09:40
trigger_state.old
So - my first question - is it save to remove the damage entries? Via:
ceph tell mds.$filesystem:0 damage rm $id
My second question - can i do something no to run in this error again.
Perhaps switch to fuse client.
Thanks in advance!
Cheers,
Vadim
--
Vadim Bulst
Universität Leipzig / URZ
04109 Leipzig, Augustusplatz 10
phone: +49-341-97-33380
mail:vadim.bulst@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx