Dear list, I have a small cluster (Reef 18.2.4) with 7 hosts and 3-4 OSDs each (960GB/1.92TB mixed Intel D3-S4610, Samsung SM883, PM897 SSDs): cluster: id: ecff3ce8-539b-443e-a492-da428f4aa9e9 health: HEALTH_OK services: mon: 5 daemons, quorum titan,mangan,kalium,argon,chromium (age 2w) mgr: mangan(active, since 2w), standbys: titan, argon osd: 22 osds: 22 up (since 2w), 22 in (since 3M) data: pools: 2 pools, 513 pgs objects: 2.76M objects, 7.0 TiB usage: 16 TiB used, 15 TiB / 31 TiB avail pgs: 513 active+clean On that cluster RBD volumes for virtual machines are stored. For a couple of months now the cluster reports slow ops for some OSDs and some PGs as laggy. This happens once or twice a day, sometimes more and sometimes not at all for a few days, at completely random times, independent of when snapshots are deleted and trimmed and independent of the I/O load or load on the hosts. After about 30 seconds, during which the write speed goes to zero on the VMs, everything returns to normal. I cannot reproduce the slow ops manually by creating write load on the cluster. Even writing continuously with 300-400 MB/s full speed for 20 minutes does not create any problems. See attached log file for an example of a typical occurrence. I have also measured write load on the disks during the problems with iostat which just shows how writes stall, see also attached. The OSDs with slow ops are completely random, any of the disks would show up once in a while. Current config (I've tried optimising snaptrim and scrub which didn't help): # ceph config dump WHO MASK LEVEL OPTION VALUE RO global advanced auth_client_required cephx * global advanced auth_cluster_required cephx * global advanced auth_service_required cephx * global advanced bdev_async_discard true global advanced bdev_enable_discard true global advanced public_network 10.0.4.0/24 * mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr unknown mgr/pg_autoscaler/autoscale_profile scale-up * osd basic osd_memory_target 4294967296 osd advanced osd_pg_max_concurrent_snap_trims 1 osd advanced osd_scrub_begin_hour 23 osd advanced osd_scrub_end_hour 4 osd advanced osd_scrub_sleep 1.000000 osd advanced osd_snap_trim_priority 1 osd advanced osd_snap_trim_sleep 2.000000 osd.0 basic osd_mclock_max_capacity_iops_ssd 29199.674019 osd.1 basic osd_mclock_max_capacity_iops_ssd 31554.530141 osd.10 basic osd_mclock_max_capacity_iops_ssd 25949.821194 osd.11 basic osd_mclock_max_capacity_iops_ssd 26300.596265 osd.12 basic osd_mclock_max_capacity_iops_ssd 25167.331294 osd.13 basic osd_mclock_max_capacity_iops_ssd 21606.610828 osd.14 basic osd_mclock_max_capacity_iops_ssd 27894.095121 osd.15 basic osd_mclock_max_capacity_iops_ssd 25929.047047 osd.16 basic osd_mclock_max_capacity_iops_ssd 15423.600235 osd.17 basic osd_mclock_max_capacity_iops_ssd 25097.493934 osd.18 basic osd_mclock_max_capacity_iops_ssd 25966.188007 osd.19 basic osd_mclock_max_capacity_iops_ssd 23628.746459 osd.2 basic osd_mclock_max_capacity_iops_ssd 32157.280832 osd.20 basic osd_mclock_max_capacity_iops_ssd 22722.682745 osd.3 basic osd_mclock_max_capacity_iops_ssd 33951.086556 osd.4 basic osd_mclock_max_capacity_iops_ssd 22736.907664 osd.5 basic osd_mclock_max_capacity_iops_ssd 21916.777510 osd.6 basic osd_mclock_max_capacity_iops_ssd 29984.954749 osd.7 basic osd_mclock_max_capacity_iops_ssd 26757.965797 osd.8 basic osd_mclock_max_capacity_iops_ssd 22738.921429 osd.9 basic osd_mclock_max_capacity_iops_ssd 24635.156413 Any help would be much appreciated! Thanks, Tim _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx