Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

Frank Schilder <frans@xxxxxx> · Tue, 13 Sep 2022 10:45:23 +0000

Are you observing this here: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LAN6PTZ2NHF2ZHAYXZIQPHZ4CMJKMI5K/
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Boris Behrens <bb@xxxxxxxxx>
Sent: 13 September 2022 11:43:20
To: ceph-users@xxxxxxx
Subject:  laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

Hi, I need you help really bad.

we are currently experiencing a very bad cluster hangups that happen
sporadic. (once on 2022-09-08 mid day (48 hrs after the upgrade) and once
2022-09-12 in the evening)
We use krbd without cephx for the qemu clients and when the OSDs are
getting laggy, the krbd connection comes to a grinding halt, to a point
that all IO is staling and we can't even unmap the rbd device.

>From the logs, it looks like that the cluster starts to snaptrim a lot a
PGs, then PGs become laggy and then the cluster snowballs into laggy OSDs.
I have attached the monitor log and the osd log (from one OSD) around the
time where it happened.

- is this a known issue?
- what can I do to debug it further?
- can I downgrade back to nautilus?
- should I upgrade the PGs for the pool to 4096 or 8192?

The cluster contains a mixture of 2,4 and 8TB SSDs (no rotating disks)
where the 8TB disks got ~120PGs and the 2TB disks got ~30PGs. All hosts
have a minimum of 128GB RAM and the kernel logs of all ceph hosts do not
show anything for the timeframe.

Cluster stats:
  cluster:
    id:     74313356-3b3d-43f3-bce6-9fb0e4591097
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-rbd-mon4,ceph-rbd-mon5,ceph-rbd-mon6 (age
25h)
    mgr: ceph-rbd-mon5(active, since 4d), standbys: ceph-rbd-mon4,
ceph-rbd-mon6
    osd: 149 osds: 149 up (since 6d), 149 in (since 7w)

  data:
    pools:   4 pools, 2241 pgs
    objects: 25.43M objects, 82 TiB
    usage:   231 TiB used, 187 TiB / 417 TiB avail
    pgs:     2241 active+clean

  io:
    client:   211 MiB/s rd, 273 MiB/s wr, 1.43k op/s rd, 8.80k op/s wr

--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
ssd    417 TiB  187 TiB  230 TiB   231 TiB      55.30
TOTAL  417 TiB  187 TiB  230 TiB   231 TiB      55.30

--- POOLS ---
POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
isos                    7    64  455 GiB  117.92k  1.3 TiB   1.17     38 TiB
rbd                     8  2048   76 TiB   24.65M  222 TiB  66.31     38 TiB
archive                 9   128  2.4 TiB  669.59k  7.3 TiB   6.06     38 TiB
device_health_metrics  10     1   25 MiB      149   76 MiB      0     38 TiB

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx