very low read performance

Dirk Sarpe <dirk.sarpe@xxxxxxx> · Mon, 30 Jul 2018 14:21:55 +0200

Dear list,

we experience very poor single thread read performance (~35 MB/s) on our 5 
node ceph cluster. I first encountered it in vms transferring data via rsync, 
but could reproduce the problem with rbd and rados bench on the physical 
nodes.

Let me shortly give an overview on our infrastructure which is surely far from 
optimal for ceph:
  - 5 dell r720xd nodes running proxmox 5.2 are kvm hypervisors and also 
provide the ceph infrastructure (3 are mons, all 5 are osd nodes)
  - each node has 12 4 TB 7200 rpm SAS HDDs, attached via the internal H710P, 
which maps each physical disk to a virtual one. Each virtual disk is one 
bluestore osd (total 60). OS resides on two additional disks.
  - each node has two Intel DC P3700 NVME 400 GB, each with 6 ~54 GB 
partitions for rocks db
  - each node has two 10 GBit/s NICs teamed together (ovs bond in slb-balance 
mode, bridge to attach vms and host interfaces. The networks are vlan tagged). 
Ceph's cluster and public network is the same. Ping latency between nodes is ~ 
0.1 ms and MTU 1500.
  - we disabled cephx trying to gain performance
  - deep scrubbing is restricted from 1900 to 0600 and the interval raised 
from weekly to 28 days as it reduced performance even further
  - we reduced several debugging options as this is often suggested to gain 
performance
  - replication factor is 3
  - the ceph pool provides rbd images and has 2048 pgs (current distribution 
is 80 to 129 pgs/osd)

Some more information at https://git.idiv.de/dsarpe/ceph-perf (etc/pve/
ceph.conf, etc/network/interfaces, ceph_osd_df, ceph_osd_tree).

Here are two example rados bench runs for write and sequential read while the 
cluster was relatively idle (low cpu and memory load on nodes, ceph capacity 
used < 50%, no recovery and hardly any other client io):

```
# rados bench -p rbd --run-name benchmark_t1 --no-cleanup -b 4M 300 seq -t 1
[…]
Total time run:         300.018203
Total writes made:      13468
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     179.562
Stddev Bandwidth:       18.155
Max bandwidth (MB/sec): 220
Min bandwidth (MB/sec): 108
Average IOPS:           44
Stddev IOPS:            4
Max IOPS:               55
Min IOPS:               27
Average Latency(s):     0.0222748
Stddev Latency(s):      0.0134969
Max latency(s):         0.27939
Min latency(s):         0.0114312
```

```
# rados bench -p rbd --run-name benchmark_t1 300 seq -t 1
[…]
Total time run:       300.239245
Total reads made:     2612
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   34.7989
Average IOPS:         8
Stddev IOPS:          1
Max IOPS:             15
Min IOPS:             5
Average Latency(s):   0.114472
Max latency(s):       0.471255
Min latency(s):       0.0182361
```

Performance scales with the number of threads, but I think even with spinners 
the sequential read from a single thread should be higher. If I test 
sequential read speed from individual ceph block partitions with dd I get 
values of ~180 MB/s per partition (there is one ceph block partition per 
device) even if reading from all partitions in parallel.

```
echo 3 | tee /proc/sys/vm/drop_caches
ls -l /var/lib/ceph/osd/ceph-*/ |
grep "block -" |
awk '{ print $11 }' |
while read -r partition
do dd if="$partition" of=/dev/null bs=4M count=100 &
done
```

The block.db partitions on nvme all report >=1.1 GB/s if read sequentially in 
parallel it goes down to 350 MB/s (there are 6 ceph block.db partitions per 
nvme device).

```
echo 3 | tee /proc/sys/vm/drop_caches
ls -l /var/lib/ceph/osd/ceph-*/ |
grep "block.db -" |
awk '{ print $11 }' |
while read -r partition
do dd if="$partition" of=/dev/null bs=4M count=100 &
done
```

During deeb scrubing iostat usually shows values of 50 - 100 MB/s per device 
(not all at the same time of course). Therefore I wonder why the sequential 
read is so much lower? Any pointers where to look?

On a side note is there a command to list all current ceph clients and get 
their configuration on luminous?

Cheers,
Dirk

-- 
general it-support unit

Phone  +49 341 97-33118
Email dirk.sarpe@xxxxxxx <mailto:dirk.sarpe@xxxxxxx>

German Centre for Integrative Biodiversity Research (iDiv) 
Halle-Jena-Leipzig
Deutscher Platz 5e 04103
Leipzig
Germany

iDiv is a research centre of the DFG - Deutsche Forschungsgemeinschaft

iDiv ist ein Forschungszentrum der Deutschen 
Forschungsgemeinschaft (DFG). Es ist eine zentrale Einrichtung 
der Universität Leipzig im Sinne des § 92 Abs. 1 SächsHSFG und wird 
zusammen mit der Martin-Luther-Universität Halle-Wittenberg, der 
Friedrich-Schiller-Universität Jena sowie dem Helmholtz-Zentrum für 
Umweltforschung (UFZ) betrieben. Sieben außeruniversitäre Einrichtungen 
unterstützen iDiv finanziell sowie durch ihre Expertise: das 
Max-Planck-Institut für Biogeochemie (MPI BGC), das Max-Planck-Institut 
für chemische Ökologie (MPI CE), das Max-Planck-Institut für 
evolutionäre Anthropologie (MPI EVA), das Leibniz-Institut Deutsche 
Sammlung von Mikroorganismen und Zellkulturen (DSMZ), das 
Leibniz-Institut für Pflanzenbiochemie (IPB), das Leibnitz-Institut für 
Pflanzengenetik und Kulturpflanzenforschung (IPK) und 
das Leibniz-Institut Senckenberg Museum für Naturkunde Görlitz (SMNG). 
USt-IdNr. DE 141510383
Attachment:
signature.asc

Description: This is a digitally signed message part.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com