Re: Laggy OSDs

David Orman <ormandj@xxxxxxxxxxxx> · Tue, 29 Mar 2022 16:41:15 -0500

We're definitely dealing with something that sounds similar, but hard to
state definitively without more detail. Do you have object lock/versioned
buckets in use (especially if one started being used around the time of the
slowdown)? Was this cluster always 16.2.7?

What is your pool configuration (EC k+m or replicated X setup), and do you
use the same pool for indexes and data? I'm assuming this is RGW usage via
the S3 API, let us know if this is not correct.

On Tue, Mar 29, 2022 at 4:13 PM Alex Closs <acloss@xxxxxxxxxxxxx> wrote:

> Hey folks,
>
> We have a 16.2.7 cephadm cluster that's had slow ops and several
> (constantly changing) laggy PGs. The set of OSDs with slow ops seems to
> change at random, among all 6 OSD hosts in the cluster. All drives are
> enterprise SATA SSDs, by either Intel or Micron. We're still not ruling out
> a network issue, but wanted to troubleshoot from the Ceph side in case
> something broke there.
>
> ceph -s:
>
>  health: HEALTH_WARN
>  3 slow ops, oldest one blocked for 246 sec, daemons
> [osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.
>
>  services:
>  mon: 5 daemons, quorum
> ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 (age 28h)
>  mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh,
> ceph-mon1.iogajr
>  osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
>  rgw: 3 daemons active (3 hosts, 1 zones)
>
>  data:
>  pools: 26 pools, 3936 pgs
>  objects: 33.14M objects, 144 TiB
>  usage: 338 TiB used, 162 TiB / 500 TiB avail
>  pgs: 3916 active+clean
>  19 active+clean+laggy
>  1 active+clean+scrubbing+deep
>
>  io:
>  client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr
>
> This is actually much faster than it's been for much of the past hour,
> it's been as low as 50 kb/s and dozens of iops in both directions (where
> the cluster typically does 300MB to a few gigs, and ~4k iops)
>
> The cluster has been on 16.2.7 since a few days after release without
> issue. The only recent change was an apt upgrade and reboot on the hosts
> (which was last Friday and didn't show signs of problems).
>
> Happy to provide logs, let me know what would be useful. Thanks for
> reading this wall :)
>
> -Alex
>
> MIT CSAIL
> he/they
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx