Are you observing this here: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/LAN6PTZ2NHF2ZHAYXZIQPHZ4CMJKMI5K/ ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Boris Behrens <bb@xxxxxxxxx> Sent: 13 September 2022 11:43:20 To: ceph-users@xxxxxxx Subject: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus Hi, I need you help really bad. we are currently experiencing a very bad cluster hangups that happen sporadic. (once on 2022-09-08 mid day (48 hrs after the upgrade) and once 2022-09-12 in the evening) We use krbd without cephx for the qemu clients and when the OSDs are getting laggy, the krbd connection comes to a grinding halt, to a point that all IO is staling and we can't even unmap the rbd device. >From the logs, it looks like that the cluster starts to snaptrim a lot a PGs, then PGs become laggy and then the cluster snowballs into laggy OSDs. I have attached the monitor log and the osd log (from one OSD) around the time where it happened. - is this a known issue? - what can I do to debug it further? - can I downgrade back to nautilus? - should I upgrade the PGs for the pool to 4096 or 8192? The cluster contains a mixture of 2,4 and 8TB SSDs (no rotating disks) where the 8TB disks got ~120PGs and the 2TB disks got ~30PGs. All hosts have a minimum of 128GB RAM and the kernel logs of all ceph hosts do not show anything for the timeframe. Cluster stats: cluster: id: 74313356-3b3d-43f3-bce6-9fb0e4591097 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-rbd-mon4,ceph-rbd-mon5,ceph-rbd-mon6 (age 25h) mgr: ceph-rbd-mon5(active, since 4d), standbys: ceph-rbd-mon4, ceph-rbd-mon6 osd: 149 osds: 149 up (since 6d), 149 in (since 7w) data: pools: 4 pools, 2241 pgs objects: 25.43M objects, 82 TiB usage: 231 TiB used, 187 TiB / 417 TiB avail pgs: 2241 active+clean io: client: 211 MiB/s rd, 273 MiB/s wr, 1.43k op/s rd, 8.80k op/s wr --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 417 TiB 187 TiB 230 TiB 231 TiB 55.30 TOTAL 417 TiB 187 TiB 230 TiB 231 TiB 55.30 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL isos 7 64 455 GiB 117.92k 1.3 TiB 1.17 38 TiB rbd 8 2048 76 TiB 24.65M 222 TiB 66.31 38 TiB archive 9 128 2.4 TiB 669.59k 7.3 TiB 6.06 38 TiB device_health_metrics 10 1 25 MiB 149 76 MiB 0 38 TiB -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx