Hello. On the latest Jewel release I see a cyclic performance drop on read operations. Performance significantly drops every 4-5 seconds from ~70k IOPS to ~20k IOPS. It looks like this (some fields were truncated to fit in line length): <CUT> ... 19:46:10.432125 4096 pgs: 4096 active+clean; 67378 MB data, 82433 kB/s rd, 20608 op/s 19:46:11.453338 4096 pgs: 4096 active+clean; 67378 MB data, 104 MB/s rd, 26857 op/s 19:46:12.486138 4096 pgs: 4096 active+clean; 67378 MB data, 276 MB/s rd, 70879 op/s 19:46:13.517175 4096 pgs: 4096 active+clean; 67378 MB data, 235 MB/s rd, 60375 op/s 19:46:15.530826 4096 pgs: 4096 active+clean; 67378 MB data, 81768 kB/s rd, 20442 op/s 19:46:16.561929 4096 pgs: 4096 active+clean; 67378 MB data, 132 MB/s rd, 33811 op/s 19:46:17.582495 4096 pgs: 4096 active+clean; 67378 MB data, 277 MB/s rd, 71027 op/s 19:46:18.614087 4096 pgs: 4096 active+clean; 67378 MB data, 200 MB/s rd, 51365 op/s 19:46:20.643567 4096 pgs: 4096 active+clean; 67378 MB data, 97849 kB/s rd, 24462 op/s 19:46:21.664988 4096 pgs: 4096 active+clean; 67378 MB data, 129 MB/s rd, 33108 op/s 19:46:22.693243 4096 pgs: 4096 active+clean; 67378 MB data, 270 MB/s rd, 69269 op/s 19:46:23.692111 4096 pgs: 4096 active+clean; 67378 MB data, 199 MB/s rd, 51186 op/s 19:46:25.725054 4096 pgs: 4096 active+clean; 67378 MB data, 84951 kB/s rd, 21238 op/s 19:46:26.746227 4096 pgs: 4096 active+clean; 67378 MB data, 132 MB/s rd, 33833 op/s 19:46:27.779780 4096 pgs: 4096 active+clean; 67378 MB data, 293 MB/s rd, 75189 op/s 19:46:28.775288 4096 pgs: 4096 active+clean; 67378 MB data, 204 MB/s rd, 52249 op/s 19:46:30.795561 4096 pgs: 4096 active+clean; 67378 MB data, 75260 kB/s rd, 18815 op/s 19:46:31.818544 4096 pgs: 4096 active+clean; 67378 MB data, 133 MB/s rd, 34243 op/s 19:46:32.851392 4096 pgs: 4096 active+clean; 67378 MB data, 295 MB/s rd, 75755 op/s 19:46:33.843960 4096 pgs: 4096 active+clean; 67378 MB data, 205 MB/s rd, 52649 op/s 19:46:34.861416 4096 pgs: 4096 active+clean; 67378 MB data, 69177 kB/s rd, 17294 op/s 19:46:35.872386 4096 pgs: 4096 active+clean; 67378 MB data, 85299 kB/s rd, 21324 op/s 19:46:36.898020 4096 pgs: 4096 active+clean; 67378 MB data, 155 MB/s rd, 39896 op/s 19:46:37.934147 4096 pgs: 4096 active+clean; 67378 MB data, 321 MB/s rd, 82209 op/s 19:46:39.966386 4096 pgs: 4096 active+clean; 67378 MB data, 163 MB/s rd, 41735 op/s 19:46:40.973110 4096 pgs: 4096 active+clean; 67378 MB data, 55481 kB/s rd, 13870 op/s ... <CUT> My test it's run 6 VM with RBD disk from SSD pool and start fio on them: <CUT> # cat aio-read.fio [global] ioengine=libaio buffered=0 rw=randread bs=4k size=2g directory=/backup group_reporting thread [file1] iodepth=4 numjobs=4 # <CUT> What it's? it's bug or wrong configuration? P.S. About a cluster. Cluster is HEALTH_OK: <CUT> # ceph -s cluster 1894d33c-d75b-49d3-bf28-b28467d1754d health HEALTH_OK monmap e1: 5 mons at {c1=10.22.11.20:6789/0,c2=10.22.11.21:6789/0,c3=10.22.11.22:6789/0,c4=10.22.11.23:6789/0,c5=10.22.11.24:6789/0} election epoch 944, quorum 0,1,2,3,4 c1,c2,c3,c4,c5 osdmap e5431: 80 osds: 80 up, 80 in flags sortbitwise pgmap v3607661: 4096 pgs, 2 pools, 67378 MB data, 16879 objects 467 GB used, 221 TB / 221 TB avail 4096 active+clean [root@c1 current]# <CUT> I have 5 OSD nodes (monitors are located on this nodes, but use dedicated ssd (that not use as OSD drive)). Each OSD node: * 6 SSD Intel DC S3700 200Gb (SSDSC2BA200G3). 2 connected to SATA Intel controller and another 4 - SAS RAID LSI 9261-8i, RAID 0 created on it. * 12 HDD SAS (not using in this test) * 2 x CPU Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz * 64Gb RAM * 2 x 2port Mellanox ConnectX-3 Pro EN 10gbit NIC. Make 2 bonds for - cluster and client networks. I made two pools: hot and data. The "hot" - all SSD, the "data" - only HDD. Crush map: <CUT> ... host c5-ssd { id -11 # do not change unnecessarily # weight 0.688 alg straw2 hash 0 # rjenkins1 item osd.76 weight 0.172 item osd.77 weight 0.172 item osd.78 weight 0.172 item osd.79 weight 0.172 } root ssd { id -12 # do not change unnecessarily # weight 1.376 alg straw2 hash 0 # rjenkins1 item c1-ssd weight 0.000 item c2-ssd weight 0.000 item c3-ssd weight 0.000 item c4-ssd weight 0.688 item c5-ssd weight 0.688 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take ssd step chooseleaf firstn 0 type host step emit } rule data { ruleset 1 type erasure min_size 3 max_size 5 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 0 type host step emit } # end crush map <CUT> I also have some kernel tuning parameters: <CUT> # cat /etc/sysctl.d/02-kernel.conf ## Kernel PID max kernel.pid_max = 4194303 fs.file-max = 26234859 ## VM swappiness vm.swappiness = 0 vm.vfs_cache_pressure = 50 vm.min_free_kbytes = 1978336 <CUT> Each SSD have configurations in the ceph.conf (osd with 60+ ids - it's SSD disks): <CUT> # cat /etc/ceph/ceph.conf [global] fsid = 1894d33c-d75b-49d3-bf28-b28467d1754d mon initial members = c1, c2, c3, c4, c5 mon host = 10.22.11.20,10.22.11.21,10.22.11.22,10.22.11.23,10.22.11.24 auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 2 mon osd full ratio = .90 mon osd nearfull ratio = .85 public network = 10.22.11.0/24 cluster network = 10.22.10.0/24 admin socket = /var/run/ceph/$cluster-$name.asok filestore btrfs snap = true filestore btrfs clone range = true filestore xattr use omap = true filestore op threads = 4 filestore seek data hole = true [osd] osd crush update on start = false osd mount options xfs = rw,noatime,logbsize=256k,logbufs=8,inode64,allocsize=4M osd mount options btrfs = rw,noatime,autodefrag,user_subvol_rm_allowed osd op threads = 4 journal block align = true journal dio = true journal aio = true .... [osd.64] filestore journal writeahead = true osd op threads = 32 osd op num shards = 16 osd op num threads per shard = 2 filestore fd cache shards = 64 filestore fd cache size = 10240 filestore op threads = 16 filestore queue max ops = 5000 filestore queue committing max ops = 5000 journal queue max ops = 3000 journal queue max bytes = 10485760000 journal max write entries = 1000 .... <CUT> -- Mike, run. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html