Dear list, we experience very poor single thread read performance (~35 MB/s) on our 5 node ceph cluster. I first encountered it in vms transferring data via rsync, but could reproduce the problem with rbd and rados bench on the physical nodes. Let me shortly give an overview on our infrastructure which is surely far from optimal for ceph: - 5 dell r720xd nodes running proxmox 5.2 are kvm hypervisors and also provide the ceph infrastructure (3 are mons, all 5 are osd nodes) - each node has 12 4 TB 7200 rpm SAS HDDs, attached via the internal H710P, which maps each physical disk to a virtual one. Each virtual disk is one bluestore osd (total 60). OS resides on two additional disks. - each node has two Intel DC P3700 NVME 400 GB, each with 6 ~54 GB partitions for rocks db - each node has two 10 GBit/s NICs teamed together (ovs bond in slb-balance mode, bridge to attach vms and host interfaces. The networks are vlan tagged). Ceph's cluster and public network is the same. Ping latency between nodes is ~ 0.1 ms and MTU 1500. - we disabled cephx trying to gain performance - deep scrubbing is restricted from 1900 to 0600 and the interval raised from weekly to 28 days as it reduced performance even further - we reduced several debugging options as this is often suggested to gain performance - replication factor is 3 - the ceph pool provides rbd images and has 2048 pgs (current distribution is 80 to 129 pgs/osd) Some more information at https://git.idiv.de/dsarpe/ceph-perf (etc/pve/ ceph.conf, etc/network/interfaces, ceph_osd_df, ceph_osd_tree). Here are two example rados bench runs for write and sequential read while the cluster was relatively idle (low cpu and memory load on nodes, ceph capacity used < 50%, no recovery and hardly any other client io): ``` # rados bench -p rbd --run-name benchmark_t1 --no-cleanup -b 4M 300 seq -t 1 […] Total time run: 300.018203 Total writes made: 13468 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 179.562 Stddev Bandwidth: 18.155 Max bandwidth (MB/sec): 220 Min bandwidth (MB/sec): 108 Average IOPS: 44 Stddev IOPS: 4 Max IOPS: 55 Min IOPS: 27 Average Latency(s): 0.0222748 Stddev Latency(s): 0.0134969 Max latency(s): 0.27939 Min latency(s): 0.0114312 ``` ``` # rados bench -p rbd --run-name benchmark_t1 300 seq -t 1 […] Total time run: 300.239245 Total reads made: 2612 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 34.7989 Average IOPS: 8 Stddev IOPS: 1 Max IOPS: 15 Min IOPS: 5 Average Latency(s): 0.114472 Max latency(s): 0.471255 Min latency(s): 0.0182361 ``` Performance scales with the number of threads, but I think even with spinners the sequential read from a single thread should be higher. If I test sequential read speed from individual ceph block partitions with dd I get values of ~180 MB/s per partition (there is one ceph block partition per device) even if reading from all partitions in parallel. ``` echo 3 | tee /proc/sys/vm/drop_caches ls -l /var/lib/ceph/osd/ceph-*/ | grep "block -" | awk '{ print $11 }' | while read -r partition do dd if="$partition" of=/dev/null bs=4M count=100 & done ``` The block.db partitions on nvme all report >=1.1 GB/s if read sequentially in parallel it goes down to 350 MB/s (there are 6 ceph block.db partitions per nvme device). ``` echo 3 | tee /proc/sys/vm/drop_caches ls -l /var/lib/ceph/osd/ceph-*/ | grep "block.db -" | awk '{ print $11 }' | while read -r partition do dd if="$partition" of=/dev/null bs=4M count=100 & done ``` During deeb scrubing iostat usually shows values of 50 - 100 MB/s per device (not all at the same time of course). Therefore I wonder why the sequential read is so much lower? Any pointers where to look? On a side note is there a command to list all current ceph clients and get their configuration on luminous? Cheers, Dirk -- general it-support unit Phone +49 341 97-33118 Email dirk.sarpe@xxxxxxx <mailto:dirk.sarpe@xxxxxxx> German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig Deutscher Platz 5e 04103 Leipzig Germany iDiv is a research centre of the DFG - Deutsche Forschungsgemeinschaft iDiv ist ein Forschungszentrum der Deutschen Forschungsgemeinschaft (DFG). Es ist eine zentrale Einrichtung der Universität Leipzig im Sinne des § 92 Abs. 1 SächsHSFG und wird zusammen mit der Martin-Luther-Universität Halle-Wittenberg, der Friedrich-Schiller-Universität Jena sowie dem Helmholtz-Zentrum für Umweltforschung (UFZ) betrieben. Sieben außeruniversitäre Einrichtungen unterstützen iDiv finanziell sowie durch ihre Expertise: das Max-Planck-Institut für Biogeochemie (MPI BGC), das Max-Planck-Institut für chemische Ökologie (MPI CE), das Max-Planck-Institut für evolutionäre Anthropologie (MPI EVA), das Leibniz-Institut Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ), das Leibniz-Institut für Pflanzenbiochemie (IPB), das Leibnitz-Institut für Pflanzengenetik und Kulturpflanzenforschung (IPK) und das Leibniz-Institut Senckenberg Museum für Naturkunde Görlitz (SMNG). USt-IdNr. DE 141510383
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com