Re: Cyclic performance drop

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 14 Oct 2016 18:53:55 +0000 (UTC)

On Fri, 14 Oct 2016, Mike wrote:
> Hello.
> On the latest Jewel release I see a cyclic performance drop on read operations.
> Performance significantly drops every 4-5 seconds from ~70k IOPS to ~20k IOPS.
> 
> It looks like this (some fields were truncated to fit in line length):
> <CUT>
> ...
> 19:46:10.432125 4096 pgs: 4096 active+clean; 67378 MB data, 82433 kB/s rd, 20608 op/s
> 19:46:11.453338 4096 pgs: 4096 active+clean; 67378 MB data, 104 MB/s rd, 26857 op/s
> 19:46:12.486138 4096 pgs: 4096 active+clean; 67378 MB data, 276 MB/s rd, 70879 op/s
> 19:46:13.517175 4096 pgs: 4096 active+clean; 67378 MB data, 235 MB/s rd, 60375 op/s
> 19:46:15.530826 4096 pgs: 4096 active+clean; 67378 MB data, 81768 kB/s rd, 20442 op/s
> 19:46:16.561929 4096 pgs: 4096 active+clean; 67378 MB data, 132 MB/s rd, 33811 op/s
> 19:46:17.582495 4096 pgs: 4096 active+clean; 67378 MB data, 277 MB/s rd, 71027 op/s
> 19:46:18.614087 4096 pgs: 4096 active+clean; 67378 MB data, 200 MB/s rd, 51365 op/s
> 19:46:20.643567 4096 pgs: 4096 active+clean; 67378 MB data, 97849 kB/s rd, 24462 op/s
> 19:46:21.664988 4096 pgs: 4096 active+clean; 67378 MB data, 129 MB/s rd, 33108 op/s
> 19:46:22.693243 4096 pgs: 4096 active+clean; 67378 MB data, 270 MB/s rd, 69269 op/s
> 19:46:23.692111 4096 pgs: 4096 active+clean; 67378 MB data, 199 MB/s rd, 51186 op/s
> 19:46:25.725054 4096 pgs: 4096 active+clean; 67378 MB data, 84951 kB/s rd, 21238 op/s
> 19:46:26.746227 4096 pgs: 4096 active+clean; 67378 MB data, 132 MB/s rd, 33833 op/s
> 19:46:27.779780 4096 pgs: 4096 active+clean; 67378 MB data, 293 MB/s rd, 75189 op/s
> 19:46:28.775288 4096 pgs: 4096 active+clean; 67378 MB data, 204 MB/s rd, 52249 op/s
> 19:46:30.795561 4096 pgs: 4096 active+clean; 67378 MB data, 75260 kB/s rd, 18815 op/s
> 19:46:31.818544 4096 pgs: 4096 active+clean; 67378 MB data, 133 MB/s rd, 34243 op/s
> 19:46:32.851392 4096 pgs: 4096 active+clean; 67378 MB data, 295 MB/s rd, 75755 op/s
> 19:46:33.843960 4096 pgs: 4096 active+clean; 67378 MB data, 205 MB/s rd, 52649 op/s
> 19:46:34.861416 4096 pgs: 4096 active+clean; 67378 MB data, 69177 kB/s rd, 17294 op/s
> 19:46:35.872386 4096 pgs: 4096 active+clean; 67378 MB data, 85299 kB/s rd, 21324 op/s
> 19:46:36.898020 4096 pgs: 4096 active+clean; 67378 MB data, 155 MB/s rd, 39896 op/s
> 19:46:37.934147 4096 pgs: 4096 active+clean; 67378 MB data, 321 MB/s rd, 82209 op/s
> 19:46:39.966386 4096 pgs: 4096 active+clean; 67378 MB data, 163 MB/s rd, 41735 op/s
> 19:46:40.973110 4096 pgs: 4096 active+clean; 67378 MB data, 55481 kB/s rd, 13870 op/s
> ...
> <CUT>

You should probably confirm this result by looking at the raw perfcounter 
stats coming out of the OSD admin socket interface.  (Operators usually 
wire this up to graphite or similar monitoring tools.)

If this is a smallish cluster, a simpler check would be

 ceph daemonperf osd.0

and see if the stats reported by a single OSD show the same behavior.

The numbers reported by the monitor are not very accurate.  They average 
over a short period of time and can be sensitive to the timing of stat 
reports from OSDs (we're effectively taking the differential of a very 
choppy stair-step function and hoping for the best).

sage

> 
>  My test it's run 6 VM with RBD disk from SSD pool and start fio on them:
> <CUT>
> # cat aio-read.fio
> [global]
> 
> 
> ioengine=libaio
> 
> 
> buffered=0
> 
> 
> rw=randread
> 
> 
> bs=4k
> 
> 
> size=2g
> 
> 
> directory=/backup
> 
> 
> group_reporting
> 
> 
> thread
> 
> 
> 
> 
> 
> [file1]
> 
> 
> iodepth=4
> 
> 
> numjobs=4
> #
> <CUT>
> 
> What it's? it's bug or wrong configuration?
> 
> 
> P.S. About a cluster.
> 
> Cluster is HEALTH_OK:
> <CUT>
> # ceph -s
>     cluster 1894d33c-d75b-49d3-bf28-b28467d1754d
>      health HEALTH_OK
>      monmap e1: 5 mons at
> {c1=10.22.11.20:6789/0,c2=10.22.11.21:6789/0,c3=10.22.11.22:6789/0,c4=10.22.11.23:6789/0,c5=10.22.11.24:6789/0}
>             election epoch 944, quorum 0,1,2,3,4 c1,c2,c3,c4,c5
>      osdmap e5431: 80 osds: 80 up, 80 in
>             flags sortbitwise
>       pgmap v3607661: 4096 pgs, 2 pools, 67378 MB data, 16879 objects
>             467 GB used, 221 TB / 221 TB avail
>                 4096 active+clean
> 
> [root@c1 current]#
> <CUT>
> 
> I have 5 OSD nodes (monitors are located on this nodes, but use dedicated ssd (that not
> use as OSD drive)). Each OSD node:
> * 6 SSD Intel DC S3700 200Gb (SSDSC2BA200G3). 2 connected to SATA Intel controller and
> another 4 - SAS RAID LSI 9261-8i, RAID 0 created on it.
> * 12 HDD SAS (not using in this test)
> * 2 x CPU Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> * 64Gb RAM
> * 2 x 2port Mellanox ConnectX-3 Pro EN 10gbit NIC. Make 2 bonds for - cluster and client
> networks.
> 
> I made two pools: hot and data. The "hot" - all SSD, the "data" - only HDD.
> Crush map:
> <CUT>
> ...
> host c5-ssd {
>         id -11          # do not change unnecessarily
>         # weight 0.688
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.76 weight 0.172
>         item osd.77 weight 0.172
>         item osd.78 weight 0.172
>         item osd.79 weight 0.172
> }
> root ssd {
>         id -12          # do not change unnecessarily
>         # weight 1.376
>         alg straw2
>         hash 0  # rjenkins1
>         item c1-ssd weight 0.000
>         item c2-ssd weight 0.000
>         item c3-ssd weight 0.000
>         item c4-ssd weight 0.688
>         item c5-ssd weight 0.688
> }
> 
> # rules
> rule replicated_ruleset {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take ssd
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule data {
>         ruleset 1
>         type erasure
>         min_size 3
>         max_size 5
>         step set_chooseleaf_tries 5
>         step set_choose_tries 100
>         step take default
>         step chooseleaf indep 0 type host
>         step emit
> }
> 
> # end crush map
> <CUT>
> 
> I also have some kernel tuning parameters:
> <CUT>
> # cat /etc/sysctl.d/02-kernel.conf
> ## Kernel PID max
> kernel.pid_max = 4194303
> fs.file-max = 26234859
> ## VM swappiness
> vm.swappiness = 0
> vm.vfs_cache_pressure = 50
> vm.min_free_kbytes = 1978336
> <CUT>
> 
> Each SSD have configurations in the ceph.conf (osd with 60+ ids - it's SSD disks):
> <CUT>
> # cat /etc/ceph/ceph.conf
> [global]
> fsid = 1894d33c-d75b-49d3-bf28-b28467d1754d
> mon initial members = c1, c2, c3, c4, c5
> mon host = 10.22.11.20,10.22.11.21,10.22.11.22,10.22.11.23,10.22.11.24
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd pool default size = 2
> mon osd full ratio = .90
> mon osd nearfull ratio = .85
> public network = 10.22.11.0/24
> cluster network = 10.22.10.0/24
> admin socket = /var/run/ceph/$cluster-$name.asok
> filestore btrfs snap = true
> filestore btrfs clone range = true
> filestore xattr use omap = true
> filestore op threads = 4
> filestore seek data hole = true
> 
> [osd]
> osd crush update on start = false
> osd mount options xfs = rw,noatime,logbsize=256k,logbufs=8,inode64,allocsize=4M
> osd mount options btrfs = rw,noatime,autodefrag,user_subvol_rm_allowed
> osd op threads = 4
> journal block align = true
> journal dio = true
> journal aio = true
> 
> ....
> 
> [osd.64]
> filestore journal writeahead = true
> osd op threads = 32
> osd op num shards = 16
> osd op num threads per shard = 2
> filestore fd cache shards = 64
> filestore fd cache size = 10240
> filestore op threads = 16
> filestore queue max ops = 5000
> filestore queue committing max ops = 5000
> journal queue max ops = 3000
> journal queue max bytes = 10485760000
> journal max write entries = 1000
> 
> ....
> <CUT>
> 
> -- 
> Mike, run.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html