Re: Laggy OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi - I've been bitten by that too and checked, and that *did* happen but I swapped them off a while ago.

Thanks for your quick reply :)
-Alex
On Mar 29, 2022, 6:26 PM -0400, Arnaud M <arnaud.meauzoone@xxxxxxxxx>, wrote:
> Hello
>
> is swap enabled on your host ? Is swap used ?
>
> For our cluster we tend to allocate enough ram and disable swap
>
> Maybe the reboot of your host re-activated swap ?
>
> Try to disable swap and see if it help
>
> All the best
>
> Arnaud
>
> > Le mar. 29 mars 2022 à 23:41, David Orman <ormandj@xxxxxxxxxxxx> a écrit :
> > > We're definitely dealing with something that sounds similar, but hard to
> > > state definitively without more detail. Do you have object lock/versioned
> > > buckets in use (especially if one started being used around the time of the
> > > slowdown)? Was this cluster always 16.2.7?
> > >
> > > What is your pool configuration (EC k+m or replicated X setup), and do you
> > > use the same pool for indexes and data? I'm assuming this is RGW usage via
> > > the S3 API, let us know if this is not correct.
> > >
> > > On Tue, Mar 29, 2022 at 4:13 PM Alex Closs <acloss@xxxxxxxxxxxxx> wrote:
> > >
> > > > Hey folks,
> > > >
> > > > We have a 16.2.7 cephadm cluster that's had slow ops and several
> > > > (constantly changing) laggy PGs. The set of OSDs with slow ops seems to
> > > > change at random, among all 6 OSD hosts in the cluster. All drives are
> > > > enterprise SATA SSDs, by either Intel or Micron. We're still not ruling out
> > > > a network issue, but wanted to troubleshoot from the Ceph side in case
> > > > something broke there.
> > > >
> > > > ceph -s:
> > > >
> > > >  health: HEALTH_WARN
> > > >  3 slow ops, oldest one blocked for 246 sec, daemons
> > > > [osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.
> > > >
> > > >  services:
> > > >  mon: 5 daemons, quorum
> > > > ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 (age 28h)
> > > >  mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh,
> > > > ceph-mon1.iogajr
> > > >  osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
> > > >  rgw: 3 daemons active (3 hosts, 1 zones)
> > > >
> > > >  data:
> > > >  pools: 26 pools, 3936 pgs
> > > >  objects: 33.14M objects, 144 TiB
> > > >  usage: 338 TiB used, 162 TiB / 500 TiB avail
> > > >  pgs: 3916 active+clean
> > > >  19 active+clean+laggy
> > > >  1 active+clean+scrubbing+deep
> > > >
> > > >  io:
> > > >  client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr
> > > >
> > > > This is actually much faster than it's been for much of the past hour,
> > > > it's been as low as 50 kb/s and dozens of iops in both directions (where
> > > > the cluster typically does 300MB to a few gigs, and ~4k iops)
> > > >
> > > > The cluster has been on 16.2.7 since a few days after release without
> > > > issue. The only recent change was an apt upgrade and reboot on the hosts
> > > > (which was last Friday and didn't show signs of problems).
> > > >
> > > > Happy to provide logs, let me know what would be useful. Thanks for
> > > > reading this wall :)
> > > >
> > > > -Alex
> > > >
> > > > MIT CSAIL
> > > > he/they
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux