Re: [Help appreciated] ceph mds damaged

Justin Li <justin.li@xxxxxxxxxxxxx> · Tue, 23 May 2023 23:04:15 +0000

Hi Patrick,

Sorry for keeping bothering you but I found that MDS service kept crashing even cluster shows MDS is up. I attached another log of MDS server - eowyn at below. Look forward to hearing more insights. Thanks a lot.

https://drive.google.com/file/d/1nD_Ks7fNGQp0GE5Q_x8M57HldYurPhuN/view?usp=sharing

MDS crashed:
root@eowyn:~# systemctl status  ceph-mds@eowyn
● ceph-mds@eowyn.service - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Wed 2023-05-24 08:55:12 AEST; 24s ago
    Process: 44349 ExecStart=/usr/bin/ceph-mds -f --cluster ${CLUSTER} --id eowyn --setuser ceph --setgroup ceph (code=kill>
   Main PID: 44349 (code=killed, signal=ABRT)

May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Scheduled restart job, restart counter is at 3.
May 24 08:55:12 eowyn systemd[1]: Stopped Ceph metadata server daemon.
May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Start request repeated too quickly.
May 24 08:55:12 eowyn systemd[1]: ceph-mds@eowyn.service: Failed with result 'signal'.
May 24 08:55:12 eowyn systemd[1]: Failed to start Ceph metadata server daemon.

Part of MDS log on eowyn (MDS server):
   -2> 2023-05-24T08:55:11.854+1000 7f1f8ee93700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x100/stray0/1005480d3ac [19ce,head] auth (dversion lock) pv=2154265085 v=2154265074 ino=0x1005480d3ac state=1342177316 | purging=1 0x55b04517ca00]
    -1> 2023-05-24T08:55:11.858+1000 7f1f8ee93700 -1 /build/ceph-16.2.13/src/mds/CDentry.cc: In function 'bool CDentry::check_corruption(bool)' thread 7f1f8ee93700 time 2023-05-24T08:55:11.858329+1000
/build/ceph-16.2.13/src/mds/CDentry.cc: 697: ceph_abort_msg("abort() called")

 ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x7f1f99404495]
 2: (CDentry::check_corruption(bool)+0x86b) [0x55b02652991b]
 3: (StrayManager::_purge_stray_purged(CDentry*, bool)+0xc64) [0x55b026480ed4]
 4: (MDSContext::complete(int)+0x61) [0x55b026601471]
 5: (MDSIOContextBase::complete(int)+0x4fc) [0x55b026601b9c]
 6: (Finisher::finisher_thread_entry()+0x19d) [0x7f1f994b8c6d]
 7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1f99146609]
 8: clone()

Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment
For ICT Support please see https://www.deakin.edu.au/sebeicthelp

Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li@xxxxxxxxxxxxx
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.

-----Original Message-----
From: Justin Li
Sent: Wednesday, May 24, 2023 8:25 AM
To: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: RE:  [Help appreciated] ceph mds damaged

Sorry Patrick, last email was restricted as attachment size. I attached a link for you to download the log. Thanks.
https://drive.google.com/drive/folders/1bV_X7vyma_-gTfLrPnEV27QzsdmgyK4g?usp=sharing

Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment For ICT Support please see https://www.deakin.edu.au/sebeicthelp

Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li@xxxxxxxxxxxxx
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.

-----Original Message-----
From: Justin Li
Sent: Wednesday, May 24, 2023 8:21 AM
To: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: RE:  [Help appreciated] ceph mds damaged

Hi Patrick,

I attached two logs here. Those two servers are one of the monitors and MDSs. Let me know if you need more logs. Thanks.

Justin Li
Senior Technical Officer
School of Information Technology
Faculty of Science, Engineering and Built Environment For ICT Support please see https://www.deakin.edu.au/sebeicthelp

Deakin University
Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
+61 3 9246 8932
Justin.li@xxxxxxxxxxxxx
http://www.deakin.edu.au/
Deakin University CRICOS Provider Code 00113B

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.

-----Original Message-----
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: Wednesday, May 24, 2023 7:35 AM
To: Justin Li <justin.li@xxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  [Help appreciated] ceph mds damaged

Hello Justin,

On Tue, May 23, 2023 at 4:55 PM Justin Li <justin.li@xxxxxxxxxxxxx> wrote:
>
> Dear All,
>
> After a unsuccessful upgrade to pacific, MDS were offline and could not get back on. Checked the MDS log and found below. See cluster info from below as well. Appreciate it if anyone can point me to the right direction. Thanks.
>
>
> MDS log:
>
> 2023-05-24T06:21:36.831+1000 7efe56e7d700  1 mds.0.cache.den(0x600
> 1005480d3b2) loaded already corrupt dentry: [dentry
> #0x100/stray0/1005480d3b2 [19ce,head] rep@0,-2.0<mailto:rep@0,-2.0>
> NULL (dversion lock) pv=0 v=2154265030 ino=(nil) state=0
> 0x556433addb80]
>
>     -5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage
> notify_dentry Damage to dentries in fragment * of ino 0x600is fatal
> because it is a system directory for this rank
>
>     -4> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco
> set_want_state: up:active -> down:damaged
>
>     -3> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco
> Sending beacon down:damaged seq 5339
>
>     -2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient:
> _send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0
>
>     -1> 2023-05-24T06:21:37.659+1000 7efe60690700  5 mds.beacon.posco
> received beacon reply down:damaged seq 5339 rtt 0.827966
>
>      0> 2023-05-24T06:21:37.659+1000 7efe56e7d700  1 mds.posco respawn!
>
>
> Cluster info:
> root@ceph-1:~# ceph -s
>   cluster:
>     id:     e2b93a76-2f97-4b34-8670-727d6ac72a64
>     health: HEALTH_ERR
>             1 filesystem is degraded
>             1 filesystem is offline
>             1 mds daemon damaged
>
>   services:
>     mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h)
>     mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2
>     mds: 0/1 daemons up, 3 standby
>     osd: 135 osds: 133 up (since 10h), 133 in (since 2w)
>
>   data:
>     volumes: 0/1 healthy, 1 recovering; 1 damaged
>     pools:   4 pools, 4161 pgs
>     objects: 230.30M objects, 276 TiB
>     usage:   836 TiB used, 460 TiB / 1.3 PiB avail
>     pgs:     4138 active+clean
>              13   active+clean+scrubbing
>              10   active+clean+scrubbing+deep
>
>
>
> root@ceph-1:~# ceph health detail
> HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds
> daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded
>     fs cephfs is degraded
> [ERR] MDS_ALL_DOWN: 1 filesystem is offline
>     fs cephfs is offline because no MDS is active for it.
> [ERR] MDS_DAMAGE: 1 mds daemon damaged
>     fs cephfs mds.0 is damaged

Do you have a complete log you can share? Try:

https://docs.ceph.com/en/quincy/man/8/ceph-post-file/

To get your upgrade to complete, you may set:

ceph config set mds mds_go_bad_corrupt_dentry false

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx