Re: OSD slow ops warning not clearing after OSD down

Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx> · Mon, 3 May 2021 12:54:41 +0000

Hi Frank,
Yes, I would purge the osd. The cluster looks absolutely healthy except of this osd.584 Probably,  the purge will help the cluster to forget this faulty one. Also, I would restart monitors, too.
With the amount of data you maintain in your cluster, I don't think your ceph.conf contains any information about some particular osds, but if it does, don't forget to remove the configuration of osd.584 from the ceph.conf

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, May 3, 2021 8:37:09 AM
To: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re: OSD slow ops warning not clearing after OSD down

Hi Vladimir,

thanks for your reply. I did, the cluster is healthy:

[root@gnosis ~]# ceph status
  cluster:
    id:     ---
    health: HEALTH_WARN
            430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-02, ceph-03
    mds: con-fs2-2/2/2 up  {0=ceph-08=up:active,1=ceph-12=up:active}, 2 up:standby
    osd: 584 osds: 578 up, 578 in

  data:
    pools:   11 pools, 3215 pgs
    objects: 610.3 M objects, 1.2 PiB
    usage:   1.5 PiB used, 4.6 PiB / 6.0 PiB avail
    pgs:     3191 active+clean
             13   active+clean+scrubbing+deep
             9    active+clean+snaptrim_wait
             2    active+clean+snaptrim

  io:
    client:   358 MiB/s rd, 56 MiB/s wr, 2.35 kop/s rd, 1.32 kop/s wr

[root@gnosis ~]# ceph health detail
HEALTH_WARN 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
SLOW_OPS 430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

OSD 580 is down+out and the message does not even increment the seconds. Its probably stuck in some part of the health checking that tries to query 580 and doesn't understand that the OSD being down means there are no ops.

I tried to restart the OSD on this disk, but it seems completely rigged. The iDRAC log on the server says that the disk was removed during operation possibly due to a physical connection fail on the SAS lanes. I somehow need to get rid of this message and am wondering of purging the OSD would help.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Vladimir Sigunov <vladimir.sigunov@xxxxxxxxx>
Sent: 03 May 2021 13:45:19
To: ceph-users@xxxxxxx; Frank Schilder
Subject: Re: OSD slow ops warning not clearing after OSD down

Hi Frank.
Check your cluster for inactive/incomplete placement groups. I saw similar behavior on Octopus when some pgs stuck in incomplete/inactive or peering state.

________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, May 3, 2021 3:42:48 AM
To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject:  OSD slow ops warning not clearing after OSD down

Dear cephers,

I have a strange problem. An OSD went down and recovery finished. For some reason, I have a slow ops warning for the failed OSD stuck in the system:

    health: HEALTH_WARN
            430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops

The OSD is auto-out:

| 580 | ceph-22 |    0  |    0  |    0   |     0   |    0   |     0   | autoout,exists |

It is probably a warning dating back to just before the fail. How can I clear it?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx