Hello, I have a ceph cluster running 14.2.11. I am running benchmark tests with FIO concurrently on ~2000 volumes of 10G each. During the time initial warm-up FIO creates a 10G file on each volume before it runs the actual read/write I/O operations. During this time, I start seeing the Ceph cluster reporting about 35GiB/s write throughput for a while, but after some time I start seeing "long heartbeat" and "slow ops" warnings and in a few mins the throughput drops to ~1GB/s and stays there until all FIO runs complete. The cluster has 5 monitor nodes and 10 data nodes - each with 10x3.2TB NVME drives. I have setup 3 OSD for each NVME, so there are a total of 300 OSDs. Each server has 200GB uplink and there's no apparent network bottleneck as the network is set up to support over 1Tbps bandwidth. I dont see any CPU or memory issues also on the servers. There is a single manager instance running on one of the mons. The pool is configured for 3 replication factor with min_size of 2. I tried to use pg_num of 8192 and 16384 and saw the issue with both settings. Could you please suggest if this is a known issue or if I can tune any parameters? Long heartbeat ping times on back interface seen, longest is 1202.120 msec Long heartbeat ping times on front interface seen, longest is 1535.191 msec 35 slow ops, oldest one blocked for 122 sec, daemons [osd.135,osd.14,osd.141,osd.143,osd.149,osd.15,osd.151,osd.153,osd.157,osd.162]... have slow ops. Regards, Shridhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx