Hi Frank, Yes, I would purge the osd. The cluster looks absolutely healthy except of this osd.584 Probably, the purge will help the cluster to forget this faulty one. Also, I would restart monitors, too. With the amount of data you maintain in your cluster, I don't think your ceph.conf contains any information about some particular osds, but if it does, don't forget to remove the configuration of osd.584 from the ceph.conf Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Monday, May 3, 2021 8:37:09 AM To: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx> Subject: Re: OSD slow ops warning not clearing after OSD down Hi Vladimir, thanks for your reply. I did, the cluster is healthy: [root@gnosis ~]# ceph status cluster: id: --- health: HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-02, ceph-03 mds: con-fs2-2/2/2 up {0=ceph-08=up:active,1=ceph-12=up:active}, 2 up:standby osd: 584 osds: 578 up, 578 in data: pools: 11 pools, 3215 pgs objects: 610.3 M objects, 1.2 PiB usage: 1.5 PiB used, 4.6 PiB / 6.0 PiB avail pgs: 3191 active+clean 13 active+clean+scrubbing+deep 9 active+clean+snaptrim_wait 2 active+clean+snaptrim io: client: 358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr [root@gnosis ~]# ceph health detail HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops OSD 580 is down+out and the message does not even increment the seconds. Its probably stuck in some part of the health checking that tries to query 580 and doesn't understand that the OSD being down means there are no ops. I tried to restart the OSD on this disk, but it seems completely rigged. The iDRAC log on the server says that the disk was removed during operation possibly due to a physical connection fail on the SAS lanes. I somehow need to get rid of this message and am wondering of purging the OSD would help. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx> Sent: 03 May 2021 13:45:19 To: ceph-users@xxxxxxx; Frank Schilder Subject: Re: OSD slow ops warning not clearing after OSD down Hi Frank. Check your cluster for inactive/incomplete placement groups. I saw similar behavior on Octopus when some pgs stuck in incomplete/inactive or peering state. ________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Monday, May 3, 2021 3:42:48 AM To: ceph-users@xxxxxxx <ceph-users@xxxxxxx> Subject: OSD slow ops warning not clearing after OSD down Dear cephers, I have a strange problem. An OSD went down and recovery finished. For some reason, I have a slow ops warning for the failed OSD stuck in the system: health: HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops The OSD is auto-out: | 580 | ceph-22 | 0 | 0 | 0 | 0 | 0 | 0 | autoout,exists | It is probably a warning dating back to just before the fail. How can I clear it? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx