Thanks for replying, Greg. I'll give you a detailed sequence I did on the upgrade at below. Step 1: upgrade ceph mgr and Monitor --- reboot. Then mgr and mon are all up running. Step 2: upgrade one OSD node --- reboot and OSDs are all up. Step 3: upgrade a second OSD node named OSD-node2. I didn't know OSD-node2 had MDS service enabled but at that time MDS service was stopped. Then I upgraded all ceph components (totally wrong, sigh) and reboot the OSD-node2. Before reboot, MDS were all function well. But reboot, MDS on OSD-node2 kicked in and killed all running MDSs which is a mystery to me. From there, none of the MDSs can be turned back on. Later, I found OSD-node2 has hardware memory error. Thanks Greg for looking into it. Justin Li Senior Technical Officer School of Information Technology Faculty of Science, Engineering and Built Environment For ICT Support please see https://www.deakin.edu.au/sebeicthelp Deakin University Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125 +61 3 9246 8932 Justin.li@xxxxxxxxxxxxx http://www.deakin.edu.au Deakin University CRICOS Provider Code 00113B Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone. Deakin University does not warrant that this email and any attachments are error or virus free. -----Original Message----- From: Gregory Farnum <gfarnum@xxxxxxxxxx> Sent: Wednesday, May 24, 2023 7:22 AM To: Justin Li <justin.li@xxxxxxxxxxxxx> Cc: ceph-users@xxxxxxx Subject: Re: [Help appreciated] ceph mds damaged On Tue, May 23, 2023 at 1:55 PM Justin Li <justin.li@xxxxxxxxxxxxx> wrote: > > Dear All, > > After a unsuccessful upgrade to pacific, MDS were offline and could not get back on. Checked the MDS log and found below. See cluster info from below as well. Appreciate it if anyone can point me to the right direction. Thanks. What made is unsuccessful? Do you mean you tried to upgrade and then rolled back somehow, or that you ran the upgrade but this problem occurred? -Greg > > > MDS log: > > 2023-05-24T06:21:36.831+1000 7efe56e7d700 1 mds.0.cache.den(0x600 > 1005480d3b2) loaded already corrupt dentry: [dentry > #0x100/stray0/1005480d3b2 [19ce,head] rep@0,-2.0<mailto:rep@0,-2.0> > NULL (dversion lock) pv=0 v=2154265030 ino=(nil) state=0 > 0x556433addb80] > > -5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage > notify_dentry Damage to dentries in fragment * of ino 0x600is fatal > because it is a system directory for this rank > > -4> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5 mds.beacon.posco > set_want_state: up:active -> down:damaged > > -3> 2023-05-24T06:21:36.831+1000 7efe56e7d700 5 mds.beacon.posco > Sending beacon down:damaged seq 5339 > > -2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient: > _send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0 > > -1> 2023-05-24T06:21:37.659+1000 7efe60690700 5 mds.beacon.posco > received beacon reply down:damaged seq 5339 rtt 0.827966 > > 0> 2023-05-24T06:21:37.659+1000 7efe56e7d700 1 mds.posco respawn! > > > Cluster info: > root@ceph-1:~# ceph -s > cluster: > id: e2b93a76-2f97-4b34-8670-727d6ac72a64 > health: HEALTH_ERR > 1 filesystem is degraded > 1 filesystem is offline > 1 mds daemon damaged > > services: > mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h) > mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2 > mds: 0/1 daemons up, 3 standby > osd: 135 osds: 133 up (since 10h), 133 in (since 2w) > > data: > volumes: 0/1 healthy, 1 recovering; 1 damaged > pools: 4 pools, 4161 pgs > objects: 230.30M objects, 276 TiB > usage: 836 TiB used, 460 TiB / 1.3 PiB avail > pgs: 4138 active+clean > 13 active+clean+scrubbing > 10 active+clean+scrubbing+deep > > > > root@ceph-1:~# ceph health detail > HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds > daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded > fs cephfs is degraded > [ERR] MDS_ALL_DOWN: 1 filesystem is offline > fs cephfs is offline because no MDS is active for it. > [ERR] MDS_DAMAGE: 1 mds daemon damaged > fs cephfs mds.0 is damaged > > > > > Justin Li > Senior Technical Officer > School of Information Technology > Faculty of Science, Engineering and Built Environment > > Request for assistance can be lodged to the SIT Technical Team using > this > form<https://deakinesmprod.service-now.com/esc?id=sc_cat_item&sys_id=7 > afa8fa5db62e9101acbdbf2f39619a7> > > Deakin University > Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125 > +61 3 9246 8932 > justin.li@xxxxxxxxxxxxx<mailto:justin.li@xxxxxxxxxxxxx> > http://www.deakin.edu.au<http://www.deakin.edu.au/> > Deakin University CRICOS Provider Code 00113B > > Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are error or virus free. > > > Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone. > > Deakin University does not warrant that this email and any attachments are error or virus free. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx > Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone. Deakin University does not warrant that this email and any attachments are error or virus free. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx