Re: MDS crashes to damaged metadata

"Stolte, Felix" <f.stolte@xxxxxxxxxxxxx> · Thu, 1 Dec 2022 09:25:33 +0000

Had to reduce the debug level back to normal. Debug Level 20 generated about 70GB log file in one hour. Of course there was no crash in that period.

---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Am 01.12.2022 um 09:55 schrieb Stolte, Felix <f.stolte@xxxxxxxxxxxxx>:

I set debug_mds=20 in ceph.conf and inserted it on the running daemon via "ceph daemon mds.mon-e2-1 config set debug_mds 20“. I have to check with my superiors, if i am allowed to provide yout the logs though.

Regarding the tool:
<pool> is refering to the cephfs_metadata pool? (just want to be sure)

How long will the runs gonna take? We have 15M Objects in our metadata pool and 330M in data pools

Regarding the root cause:
As far as i can tell, all damaged inodes have been only accessed via two samba servers running with ctdb. We are also running nfs gateways on different systems, but there hasn’t been a damaged inode (yet).

Samba Servers running Ubuntu 18.04 with kernel 5.4.0-132 and samba version 4.7.6.
Cephfs is accessed via kernel mount and

ceph version is 16.2.10 across all nodes
we have one filesystem and two data pools and using cehpfs snapshots

---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Am 01.12.2022 um 01:26 schrieb Patrick Donnelly <pdonnell@xxxxxxxxxx>:

You can run this tool. Be sure to read the comments.

https://github.com/ceph/ceph/blob/main/src/tools/cephfs/first-damage.py

As of now what causes the damage is not yet known but we are trying to
reproduce it. If your workload reliably produces the damage, a
debug_mds=20 MDS log would be extremely helpful.

On Wed, Nov 30, 2022 at 6:15 PM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:

Hi Patrick,

it does seem like it. We are not using postgres on cephfs as far as i know. We narrowed it down to three damaged inodes, but files in question had been xlsx, pdf or pst.

Do you have any suggestion how to fix this?

Is there a way to scan the cephfs for damaged inodes?

---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Am 30.11.2022 um 22:49 schrieb Patrick Donnelly <pdonnell@xxxxxxxxxx>:

On Wed, Nov 30, 2022 at 3:10 PM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:

Hey guys,

our mds daemons are crashing constantly when someone is trying to delete a file:

-26> 2022-11-29T12:32:58.807+0100 7f081b458700 -1 /build/ceph-16.2.10/src/mds/Server.cc<http://server.cc/>: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7f081b458700 time 2022-11-29T12:32:58.808844+0100

2022-11-29T12:32:58.807+0100 7f081b458700  4 mds.0.server handle_client_request client_request(client.1189402075:14014394 unlink #0x100197fa8e0/~$29.11. T.xlsx 2022-11-29T12:32:23.711889+0100 RETRY=1 caller_uid=133365,

I observed that the corresponding object in the cephfs data pool does not exist. Basically our MDS Daemons are crashing each time, when somone tries to delete a file which does not exist in the data pool but metadata says otherwise.

Any suggestions how to fix this problem?

Is this it?

https://tracker.ceph.com/issues/38452

Are you running postgres on CephFS by chance?

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx