Do you have test results for the same test without caching? I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The benchmarks create IO requests much faster than OSDs can handle them. At some point all queues run full and you start seeing slow ops on OSDs. I would also prefer if IO activity was more steady and not so bursty, but for some reason IO client throttling is pushed to the clients instead of the internal OPS queueing system (ceph is collaborative, meaning a rogue un-collaborative client can screw it up for everyone). If you know what your IO stack can handle without stalls, you can use libvirt QOS settings to limit clients with reasonable peak-load and steady-load settings. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: norman <norman.kern@xxxxxxx> Sent: 20 November 2020 13:40:18 To: ceph-users Subject: The serious side-effect of rbd cache setting Hi All, We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore 3-replica), and an odd problem found: 1. Setting librbd cache [client] rbd cache = true rbd cache size = 16777216 rbd cache max dirty = 12582912 rbd cache target dirty = 8388608 rbd cache max dirty age = 1 rbd cache writethrough until flush = true 2. Running rbd bench rbd -c /etc/ceph/ceph.conf \ -k /etc/ceph/keyring2 \ -n client.rbd-openstack-002 bench \ --io-size 4K \ --io-threads 1 \ --io-pattern seq \ --io-type read \ --io-total 100G \ openstack-volumes/image-you-can-drop-me 3. Start another test rbd -c /etc/ceph/ceph.conf \ -k /etc/ceph/keyring2 \ -n client.rbd-openstack-002 bench \ --io-size 4K \ --io-threads 1 \ --io-pattern rand \ --io-type write \ --io-total 100G \ openstack-volumes/image-you-can-drop-me2 Running for minutes, I found the read test almost hung for a while: 69 152069 2375.21 9728858.72 70 153627 2104.63 8620569.93 71 155748 1956.04 8011953.10 72 157665 1945.84 7970177.24 73 159661 1947.64 7977549.44 74 161522 1890.45 7743277.01 75 163583 1991.04 8155301.58 76 165791 2008.44 8226566.26 77 168433 2153.43 8820438.66 78 170269 2121.43 8689377.16 79 172511 2197.62 9001467.33 80 174845 2252.22 9225091.00 81 177089 2259.42 9254579.83 82 179675 2248.22 9208708.30 83 182053 2356.61 9652679.11 84 185087 2515.00 10301433.50 99 185345 550.16 2253434.96 101 185346 407.76 1670187.73 102 185348 282.44 1156878.38 103 185350 162.34 664931.53 104 185353 12.86 52681.27 105 185357 1.93 7916.89 106 185361 2.74 11235.38 107 185367 3.27 13379.95 108 185375 5.08 20794.43 109 185384 6.93 28365.91 110 185403 9.19 37650.06 111 185438 17.47 71544.17 128 185467 4.94 20243.53 129 185468 4.45 18210.82 131 185469 3.89 15928.44 132 185493 4.09 16764.16 133 185529 4.16 17037.21 134 185578 18.64 76329.67 135 185631 27.78 113768.65 Why this happened? It's a unacceptable performance for read. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx