Re: MDS "newly corrupt dentry" after patch version upgrade

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Fri, 12 May 2023 17:02:38 +0200

If is thrown while decoding the file name, then somebody probably 
managed to store files with non-UTF-8 characters in the name. Although I 
don't really know how this can happen. Perhaps some OS quirk.

On 10/05/2023 22:33, Patrick Donnelly wrote:
Hi Janek,

All this indicates is that you have some files with binary keys  that
cannot be decoded as utf-8. Unfortunately, the rados python library
assumes that omap keys can be decoded this way. I have a ticket here:

https://tracker.ceph.com/issues/59716

I hope to have a fix soon.

On Thu, May 4, 2023 at 3:15 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
After running the tool for 11 hours straight, it exited with the
following exception:

Traceback (most recent call last):
    File "/home/webis/first-damage.py", line 156, in <module>
      traverse(f, ioctx)
    File "/home/webis/first-damage.py", line 84, in traverse
      for (dnk, val) in it:
    File "rados.pyx", line 1389, in rados.OmapIterator.__next__
    File "rados.pyx", line 318, in rados.decode_cstr
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 8:
invalid start byte

Does that mean that the last inode listed in the output file is corrupt?
Any way I can fix it?

The output file has 14 million lines. We have about 24.5 million objects
in the metadata pool.

Janek

On 03/05/2023 14:20, Patrick Donnelly wrote:
On Wed, May 3, 2023 at 4:33 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
Hi Patrick,

I'll try that tomorrow and let you know, thanks!
I was unable to reproduce the crash today. Even with
mds_abort_on_newly_corrupt_dentry set to true, all MDS booted up
correctly (though they took forever to rejoin with logs set to 20).

To me it looks like the issue has resolved itself overnight. I had run a
recursive scrub on the file system and another snapshot was taken, in
case any of those might have had an effect on this. It could also be the
case that the (supposedly) corrupt journal entry has simply been
committed now and hence doesn't trigger the assertion any more. Is there
any way I can verify this?
You can run:

https://github.com/ceph/ceph/blob/main/src/tools/cephfs/first-damage.py

Just do:

python3 first-damage.py --memo run.1 <meta pool>

No need to do any of the other steps if you just want a read-only check.

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx