Degraded pgs during async randwrites

Nathan Fish <lordcirth@xxxxxxxxx> · Mon, 6 May 2019 12:52:29 -0400

Hello all, I'm testing out a new cluster that we hope to put into
production soon. Performance has overall been great, but there's one
benchmark that not only stresses the cluster, but causes it to degrade
- async randwrites.

The benchmark:
# The file was previously laid out with dd'd random data to prevent sparseness
root@mc-3015-201:~# fio --rw=randwrite --bs=4k --size=100G
--numjobs=$JOBS --group_reporting --directory=/mnt/ceph/
--name=largerandwrite --iodepth=16 --end_fsync=1
largerandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
largerandwrite: (groupid=0, jobs=1): err= 0: pid=17230: Mon May  6 11:30:11 2019
  write: IOPS=14.7k, BW=57.4MiB/s (60.2MB/s)(100GiB/1782445msec)
    clat (nsec): min=1617, max=120033k, avg=12644.96, stdev=379152.20
     lat (nsec): min=1656, max=120033k, avg=12687.21, stdev=379152.31
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    3], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    6],
     | 99.00th=[    9], 99.50th=[   11], 99.90th=[   24], 99.95th=[10290],
     | 99.99th=[19530]
   bw (  KiB/s): min=19424, max=1395544, per=100.00%, avg=306914.86,
stdev=392390.46, samples=683
   iops        : min= 4856, max=348886, avg=76728.66, stdev=98097.60,
samples=683
  lat (usec)   : 2=0.01%, 4=78.02%, 10=21.37%, 20=0.43%, 50=0.10%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 10=0.01%, 20=0.06%, 50=0.01%, 250=0.01%
  cpu          : usr=0.78%, sys=5.03%, ctx=30215, majf=0, minf=1657
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,26214400,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=57.4MiB/s (60.2MB/s), 57.4MiB/s-57.4MiB/s
(60.2MB/s-60.2MB/s), io=100GiB (107GB), run=1782445-1782445msec

Setup:
Ubuntu 18.04.2 + Nautilus repo (deb
https://download.ceph.com/debian-nautilus bionic main)
3 hosts, 100Gbit/s NIC, and each has:
Dual-socket Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (20 cores each)
2 Optane NVMe cards, in one LVM VG - vg_optane.
18 12TB 7200rpm SAS drives, each running a bluestore OSD
each HDD OSD has a 32GB LV on vg_optane for wal+db
3 OSDs on 32GB lv's on vg_optane
1 mon, mgr, and mds

cephfs_data on hdd osds, 512 pgs
cephfs_metadata on ssd osds, 16 pgs
both replicated, size = 3, min_size = 2

root@m3-3101-422:~# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW
USED
    hdd       591 TiB     582 TiB     8.7 TiB      8.8 TiB
1.49
    ssd       288 GiB     257 GiB      14 GiB       31 GiB
10.75
    TOTAL     591 TiB     582 TiB     8.8 TiB      8.8 TiB
1.49

POOLS:
    POOL                ID     STORED      OBJECTS     USED
%USED     MAX AVAIL
    cephfs_metadata     23     5.3 GiB       5.71k     5.6 GiB
2.30        79 GiB
    cephfs_data         24     1.6 TiB      11.43M     6.8 TiB
1.23       183 TiB

Observations:
When starting the benchmark, it's over 10 seconds before iostat shows
any activity on the OSD drives. fio gets to 100% very quickly, then
the end_fsync takes a long time.

root@m3-3101-422:~# ceph health detail | head
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; Degraded data
redundancy: 207/34295076 objects degraded (0.001%), 28 pgs degraded, 5
pgs undersized; 18 pgs not deep-scrubbed in time; 528 pgs not scrubbed
in time; 1 pools have too many placement groups; too few PGs per OSD
(25 < min 30)
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
PG_DEGRADED Degraded data redundancy: 207/34295076 objects degraded
(0.001%), 28 pgs degraded, 5 pgs undersized
    pg 24.8 is active+recovery_wait+degraded, acting [10,40,36]
    pg 24.11 is active+recovery_wait+degraded, acting [40,0,36]
    pg 24.15 is stuck undersized for 137.001663, current state
active+recovery_wait+undersized+degraded+remapped, last acting [16,33]
    pg 24.25 is active+recovery_wait+degraded, acting [7,20,40]
    pg 24.2e is active+recovering+degraded, acting [45,40,32]
    pg 24.49 is active+recovery_wait+degraded, acting [1,40,9]
    pg 24.5d is active+recovery_wait+degraded, acting [16,40,27]

It seems that the OSDs cannot keep up and thus become degraded. I
understand that 4k randwrites is a very harsh benchmark for a
distributed cluster of spinning disks, and the actual performance is
acceptable. What I want is for Ceph to not enter HEALTH_WARN when
doing it.

Things I have tried:
Increasing OSD journal size to 10GB (bluestore_block_wal_size) - no effect
Setting write_congestion_kb to to a few MiB in /etc/fstab - no effect
Increasing OSD shards (osd_op_num_shards_hdd 5 -> 10) - no effect
Increasing threads per OSD shard (osd_op_num_threads_per_shard_hdd 1
-> 2) - hopeful.
No degraded pg's when running 1 fio job,
though performance seems to actually be slightly lower. More seeks on
the drives, perhaps?
2 fio jobs causes the same problem, though.

So, what is the best solution here? I think in future I would try to
buy CPUs with faster single-thread performance. Is there another
tweakable I am missing? Should I keep cranking up
osd_op_num_threads_per_shard_hdd ?

Thanks in advance,
Nathan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com