Re: MDS crashes to damaged metadata

"Stolte, Felix" <f.stolte@xxxxxxxxxxxxx> · Thu, 15 Dec 2022 14:31:03 +0000

Hi Patrick,

we used your script to repair the damaged objects on the weekend and it went smoothly. Thanks for your support.

We adjusted your script to scan for damaged files on a daily basis, runtime is about 6h. Until thursday last week, we had exactly the same 17 Files. On thursday at 13:05 a snapshot was created and our active mds crashed once at this time (snapshot was created):

2022-12-08T13:05:48.919+0100 7f440afec700 -1 /build/ceph-16.2.10/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f440afec700 time 2022-12-08T13:05:48.921223+0100
/build/ceph-16.2.10/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state LOCK_XLOCK || state LOCK_XLOCKDONE)

12 Minutes lates the unlink_local error crashes appeared again. This time with a new file. During debugging we noticed a MTU mismatch between MDS (1500) and client (9000) with cephfs kernel mount. The client is also creating the snapshots via mkdir in the .snap directory.

We disabled snapshot creation for now, but really need this feature. I uploaded the mds logs of the first crash along with the information above to https://tracker.ceph.com/issues/38452

I would greatly appreciate it, if you could answer me the following question:

Is the Bug related to our MTU Mismatch? We fixed the MTU Issue going back to 1500 on all nodes in the ceph public network on the weekend also.

If you need a debug level 20 log of the ScatterLock for further analysis, i could schedule snapshots at the end of our workdays and increase the debug level 5 Minutes arround snap shot creation.

Regards
Felix
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Am 02.12.2022 um 20:08 schrieb Patrick Donnelly <pdonnell@xxxxxxxxxx>:

On Thu, Dec 1, 2022 at 5:08 PM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:

Script is running for ~2 hours and according to the line count in the memo file we are at 40% (cephfs is still online).

We had to modify the script putting a try/catch arround the for loop in line 78 to 87. For some reasons there are some objects (186 at this moment) which throw an UnicodeDecodeError exception during the iteration:

<rados.OmapIterator object at 0x7f9606f8bcf8> Traceback (most recent call last): File "first-damage.py", line 138, in <module> traverse(f, ioctx) File "first-damage.py", line 79, in traverse for (dnk, val) in it: File "rados.pyx", line 1382, in rados.OmapIterator.__next__ File "rados.pyx", line 311, in rados.decode_cstr UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 10-11: invalid continuation byte

Don’t know if this is because of the filesystem still running. We saved the object names in a separate file and i will investigate further tomorrow. We should be able to modify the script to only check for the objects which threw the exception instead of searching through the whole pool again.

That shouldn't be caused by teh fs running. It may be you have some
file names which have invalid unicode characters?

Regarding the mds logfiles with debug 20:
We cannot run this debug level for longer than one hour since the logfile size increase is to high for the local storage on the mds servers where logs are stored (don’t have a central logging yet).

Okay.

But if you are just interested in the time frame arround the crash, i could set the debug level to 20, trigger the crash on the weekend and sent you the logs.

The crash is unlikely to point to what causes the corruption. I was
hoping we could locate an instance of damage while the MDS is running.

Regards Felix

---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Am 01.12.2022 um 20:51 schrieb Patrick Donnelly <pdonnell@xxxxxxxxxx>:

On Thu, Dec 1, 2022 at 3:55 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:

I set debug_mds=20 in ceph.conf and inserted it on the running daemon via "ceph daemon mds.mon-e2-1 config set debug_mds 20“. I have to check with my superiors, if i am allowed to provide yout the logs though.

Suggest using `ceph config set` instead of ceph.conf. It's much easier.

Regarding the tool:
<pool> is refering to the cephfs_metadata pool? (just want to be sure)

Yes.

How long will the runs gonna take? We have 15M Objects in our metadata pool and 330M in data pools

Not sure. You can monitor the number of lines generated on the memo
file to get an idea of objects/s.

You can speed test the tool without bringing the file system by
**not** using `--remove`.

Regarding the root cause:
As far as i can tell, all damaged inodes have been only accessed via two samba servers running with ctdb. We are also running nfs gateways on different systems, but there hasn’t been a damaged inode (yet).

Samba Servers running Ubuntu 18.04 with kernel 5.4.0-132 and samba version 4.7.6.
Cephfs is accessed via kernel mount and

ceph version is 16.2.10 across all nodes
we have one filesystem and two data pools and using cehpfs snapshots

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx