Hello all, I'm testing out a new cluster that we hope to put into production soon. Performance has overall been great, but there's one benchmark that not only stresses the cluster, but causes it to degrade - async randwrites. The benchmark: # The file was previously laid out with dd'd random data to prevent sparseness root@mc-3015-201:~# fio --rw=randwrite --bs=4k --size=100G --numjobs=$JOBS --group_reporting --directory=/mnt/ceph/ --name=largerandwrite --iodepth=16 --end_fsync=1 largerandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16 fio-3.1 Starting 1 process Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s] largerandwrite: (groupid=0, jobs=1): err= 0: pid=17230: Mon May 6 11:30:11 2019 write: IOPS=14.7k, BW=57.4MiB/s (60.2MB/s)(100GiB/1782445msec) clat (nsec): min=1617, max=120033k, avg=12644.96, stdev=379152.20 lat (nsec): min=1656, max=120033k, avg=12687.21, stdev=379152.31 clat percentiles (usec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 4], 60.00th=[ 4], | 70.00th=[ 4], 80.00th=[ 5], 90.00th=[ 5], 95.00th=[ 6], | 99.00th=[ 9], 99.50th=[ 11], 99.90th=[ 24], 99.95th=[10290], | 99.99th=[19530] bw ( KiB/s): min=19424, max=1395544, per=100.00%, avg=306914.86, stdev=392390.46, samples=683 iops : min= 4856, max=348886, avg=76728.66, stdev=98097.60, samples=683 lat (usec) : 2=0.01%, 4=78.02%, 10=21.37%, 20=0.43%, 50=0.10% lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 10=0.01%, 20=0.06%, 50=0.01%, 250=0.01% cpu : usr=0.78%, sys=5.03%, ctx=30215, majf=0, minf=1657 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=0,26214400,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: bw=57.4MiB/s (60.2MB/s), 57.4MiB/s-57.4MiB/s (60.2MB/s-60.2MB/s), io=100GiB (107GB), run=1782445-1782445msec Setup: Ubuntu 18.04.2 + Nautilus repo (deb https://download.ceph.com/debian-nautilus bionic main) 3 hosts, 100Gbit/s NIC, and each has: Dual-socket Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (20 cores each) 2 Optane NVMe cards, in one LVM VG - vg_optane. 18 12TB 7200rpm SAS drives, each running a bluestore OSD each HDD OSD has a 32GB LV on vg_optane for wal+db 3 OSDs on 32GB lv's on vg_optane 1 mon, mgr, and mds cephfs_data on hdd osds, 512 pgs cephfs_metadata on ssd osds, 16 pgs both replicated, size = 3, min_size = 2 root@m3-3101-422:~# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 591 TiB 582 TiB 8.7 TiB 8.8 TiB 1.49 ssd 288 GiB 257 GiB 14 GiB 31 GiB 10.75 TOTAL 591 TiB 582 TiB 8.8 TiB 8.8 TiB 1.49 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL cephfs_metadata 23 5.3 GiB 5.71k 5.6 GiB 2.30 79 GiB cephfs_data 24 1.6 TiB 11.43M 6.8 TiB 1.23 183 TiB Observations: When starting the benchmark, it's over 10 seconds before iostat shows any activity on the OSD drives. fio gets to 100% very quickly, then the end_fsync takes a long time. root@m3-3101-422:~# ceph health detail | head HEALTH_WARN noscrub,nodeep-scrub flag(s) set; Degraded data redundancy: 207/34295076 objects degraded (0.001%), 28 pgs degraded, 5 pgs undersized; 18 pgs not deep-scrubbed in time; 528 pgs not scrubbed in time; 1 pools have too many placement groups; too few PGs per OSD (25 < min 30) OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set PG_DEGRADED Degraded data redundancy: 207/34295076 objects degraded (0.001%), 28 pgs degraded, 5 pgs undersized pg 24.8 is active+recovery_wait+degraded, acting [10,40,36] pg 24.11 is active+recovery_wait+degraded, acting [40,0,36] pg 24.15 is stuck undersized for 137.001663, current state active+recovery_wait+undersized+degraded+remapped, last acting [16,33] pg 24.25 is active+recovery_wait+degraded, acting [7,20,40] pg 24.2e is active+recovering+degraded, acting [45,40,32] pg 24.49 is active+recovery_wait+degraded, acting [1,40,9] pg 24.5d is active+recovery_wait+degraded, acting [16,40,27] It seems that the OSDs cannot keep up and thus become degraded. I understand that 4k randwrites is a very harsh benchmark for a distributed cluster of spinning disks, and the actual performance is acceptable. What I want is for Ceph to not enter HEALTH_WARN when doing it. Things I have tried: Increasing OSD journal size to 10GB (bluestore_block_wal_size) - no effect Setting write_congestion_kb to to a few MiB in /etc/fstab - no effect Increasing OSD shards (osd_op_num_shards_hdd 5 -> 10) - no effect Increasing threads per OSD shard (osd_op_num_threads_per_shard_hdd 1 -> 2) - hopeful. No degraded pg's when running 1 fio job, though performance seems to actually be slightly lower. More seeks on the drives, perhaps? 2 fio jobs causes the same problem, though. So, what is the best solution here? I think in future I would try to buy CPUs with faster single-thread performance. Is there another tweakable I am missing? Should I keep cranking up osd_op_num_threads_per_shard_hdd ? Thanks in advance, Nathan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com