Re: OSD slow ops warning not clearing after OSD down

Frank Schilder <frans@xxxxxx> · Tue, 4 May 2021 07:49:21 +0000

I created a ticket: https://tracker.ceph.com/issues/50637

Hope a purge will do the trick.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 03 May 2021 15:21:38
To: Dan van der Ster; Vladimir Sigunov
Cc: ceph-users@xxxxxxx
Subject:  Re: OSD slow ops warning not clearing after OSD down

Hi Dan,

just restarted all MONs, no change though :(

Thanks for looking at this. I will wait until tomorrow. My plan is to get the disk up again with the same OSD ID and would expect that this will eventually allow the message to be cleared.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 03 May 2021 15:08:03
To: Vladimir Sigunov
Cc: ceph-users@xxxxxxx; Frank Schilder
Subject: Re:  Re: OSD slow ops warning not clearing after OSD down

Wait, first just restart the leader mon.

See: https://tracker.ceph.com/issues/47380 for a related issue.

-- dan

On Mon, May 3, 2021 at 2:55 PM Vladimir Sigunov
<vladimir.sigunov@xxxxxxxxx> wrote:
>
> Hi Frank,
> Yes, I would purge the osd. The cluster looks absolutely healthy except of this osd.584 Probably,  the purge will help the cluster to forget this faulty one. Also, I would restart monitors, too.
> With the amount of data you maintain in your cluster, I don't think your ceph.conf contains any information about some particular osds, but if it does, don't forget to remove the configuration of osd.584 from the ceph.conf
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: Monday, May 3, 2021 8:37:09 AM
> To: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Vladimir,
>
> thanks for your reply. I did, the cluster is healthy:
>
> [root@gnosis ~]# ceph status
>   cluster:
>     id:     ---
>     health: HEALTH_WARN
>             430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
>   services:
>     mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
>     mgr: ceph-01(active), standbys: ceph-02, ceph-03
>     mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 up:standby
>     osd: 584 osds: 578 up, 578 in
>
>   data:
>     pools:   11 pools, 3215 pgs
>     objects: 610.3 M objects, 1.2 PiB
>     usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
>     pgs:     3191 active+clean
>              13   active+clean+scrubbing+deep
>              9    active+clean+snaptrim_wait
>              2    active+clean+snaptrim
>
>   io:
>     client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr
>
> [root@gnosis ~]# ceph health detail
> HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
> SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> OSD 580 is down+out and the message does not even increment the seconds. Its probably stuck in some part of the health checking that tries to query 580 and doesn't understand that the OSD being down means there are no ops.
>
> I tried to restart the OSD on this disk, but it seems completely rigged. The iDRAC log on the server says that the disk was removed during operation possibly due to a physical connection fail on the SAS lanes. I somehow need to get rid of this message and am wondering of purging the OSD would help.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx>
> Sent: 03 May 2021 13:45:19
> To: ceph-users@xxxxxxx; Frank Schilder
> Subject: Re: OSD slow ops warning not clearing after OSD down
>
> Hi Frank.
> Check your cluster for inactive/incomplete placement groups. I saw similar behavior on Octopus when some pgs stuck in incomplete/inactive or peering state.
>
> ________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: Monday, May 3, 2021 3:42:48 AM
> To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject:  OSD slow ops warning not clearing after OSD down
>
> Dear cephers,
>
> I have a strange problem. An OSD went down and recovery finished. For some reason, I have a slow ops warning for the failed OSD stuck in the system:
>
>     health: HEALTH_WARN
>             430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
>
> The OSD is auto-out:
>
> | 580 | ceph-22 |    0  |    0  |    0   |     0   |    0   |     0   | autoout,exists |
>
> It is probably a warning dating back to just before the fail. How can I clear it?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx