Hi, We have a quite serious issue regarding slow ops. In our case DB team used the cluster to read and write in the same pool at the same time and it made the cluster useless. When we ran fio, we realised that ceph doesn't like the read and write at the same time in the same pool, so we tested this with fio to create 2 separate pool, put the read operation to 1 pool and the write to another one and magic happened, no slow ops and a weigh higher performance. We asked the db team also to split the read and write (as much as thay can) and issue solved (after 2 week). Thank you ________________________________________ From: Void Star Nill <void.star.nill@xxxxxxxxx> Sent: Thursday, October 8, 2020 1:14 PM To: ceph-users Subject: [Suspicious newsletter] Weird performance issue with long heartbeat and slow ops warnings Email received from outside the company. If in doubt don't click links nor open attachments! ________________________________ Hello, I have a ceph cluster running 14.2.11. I am running benchmark tests with FIO concurrently on ~2000 volumes of 10G each. During the time initial warm-up FIO creates a 10G file on each volume before it runs the actual read/write I/O operations. During this time, I start seeing the Ceph cluster reporting about 35GiB/s write throughput for a while, but after some time I start seeing "long heartbeat" and "slow ops" warnings and in a few mins the throughput drops to ~1GB/s and stays there until all FIO runs complete. The cluster has 5 monitor nodes and 10 data nodes - each with 10x3.2TB NVME drives. I have setup 3 OSD for each NVME, so there are a total of 300 OSDs. Each server has 200GB uplink and there's no apparent network bottleneck as the network is set up to support over 1Tbps bandwidth. I dont see any CPU or memory issues also on the servers. There is a single manager instance running on one of the mons. The pool is configured for 3 replication factor with min_size of 2. I tried to use pg_num of 8192 and 16384 and saw the issue with both settings. Could you please suggest if this is a known issue or if I can tune any parameters? Long heartbeat ping times on back interface seen, longest is 1202.120 msec Long heartbeat ping times on front interface seen, longest is 1535.191 msec 35 slow ops, oldest one blocked for 122 sec, daemons [osd.135,osd.14,osd.141,osd.143,osd.149,osd.15,osd.151,osd.153,osd.157,osd.162]... have slow ops. Regards, Shridhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx