Re: Laggy OSDs

Arnaud M <arnaud.meauzoone@xxxxxxxxx> · Wed, 30 Mar 2022 00:26:32 +0200

Hello

is swap enabled on your host ? Is swap used ?

For our cluster we tend to allocate enough ram and disable swap

Maybe the reboot of your host re-activated swap ?

Try to disable swap and see if it help

All the best

Arnaud

Le mar. 29 mars 2022 à 23:41, David Orman <ormandj@xxxxxxxxxxxx> a écrit :

> We're definitely dealing with something that sounds similar, but hard to
> state definitively without more detail. Do you have object lock/versioned
> buckets in use (especially if one started being used around the time of the
> slowdown)? Was this cluster always 16.2.7?
>
> What is your pool configuration (EC k+m or replicated X setup), and do you
> use the same pool for indexes and data? I'm assuming this is RGW usage via
> the S3 API, let us know if this is not correct.
>
> On Tue, Mar 29, 2022 at 4:13 PM Alex Closs <acloss@xxxxxxxxxxxxx> wrote:
>
> > Hey folks,
> >
> > We have a 16.2.7 cephadm cluster that's had slow ops and several
> > (constantly changing) laggy PGs. The set of OSDs with slow ops seems to
> > change at random, among all 6 OSD hosts in the cluster. All drives are
> > enterprise SATA SSDs, by either Intel or Micron. We're still not ruling
> out
> > a network issue, but wanted to troubleshoot from the Ceph side in case
> > something broke there.
> >
> > ceph -s:
> >
> >  health: HEALTH_WARN
> >  3 slow ops, oldest one blocked for 246 sec, daemons
> > [osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.
> >
> >  services:
> >  mon: 5 daemons, quorum
> > ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 (age 28h)
> >  mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh,
> > ceph-mon1.iogajr
> >  osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
> >  rgw: 3 daemons active (3 hosts, 1 zones)
> >
> >  data:
> >  pools: 26 pools, 3936 pgs
> >  objects: 33.14M objects, 144 TiB
> >  usage: 338 TiB used, 162 TiB / 500 TiB avail
> >  pgs: 3916 active+clean
> >  19 active+clean+laggy
> >  1 active+clean+scrubbing+deep
> >
> >  io:
> >  client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr
> >
> > This is actually much faster than it's been for much of the past hour,
> > it's been as low as 50 kb/s and dozens of iops in both directions (where
> > the cluster typically does 300MB to a few gigs, and ~4k iops)
> >
> > The cluster has been on 16.2.7 since a few days after release without
> > issue. The only recent change was an apt upgrade and reboot on the hosts
> > (which was last Friday and didn't show signs of problems).
> >
> > Happy to provide logs, let me know what would be useful. Thanks for
> > reading this wall :)
> >
> > -Alex
> >
> > MIT CSAIL
> > he/they
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx