Re: cephfs (rbd) read performance low - where is the bottleneck?

Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> · Sun, 20 Nov 2016 21:33:39 -0700

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 5242880000 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

>From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike287@xxxxxxxxx> wrote:
> Hi,
>
> reading a big file 50 GB (tried more too)
>
> dd if=bigfile of=/dev/zero bs=4M
>
> in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives
> me only about *122 MB/s* read speed in single thread. Scrubbing turned off
> during measurement.
>
> I have been searching for possible bottlenecks. The network is not the
> problem, the machine running dd is connected to the cluster public network
> with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private
> 10 GBASE-T.
>
> The osd SATA disks are utilized only up until about 10% or 20%, not more
> than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
> core is used on this 6-core machine). mon and mds connected with only 1 GbE
> (I would expect some latency from that, but no bandwidth issues; in fact
> network bandwidth is about 20 Mbit max).
>
> If I read a file with 50 GB, then clear the cache on the reading machine
> (but not the osd caches), I get much better reading performance of about
> *620 MB/s*. That seems logical to me as much (most) of the data is still in
> the osd cache buffers. But still the read performance is not super
> considered that the reading machine is connected to the cluster with a 20
> Gbit/s bond.
>
> How can I improve? I am not really sure, but from my understanding 2
> possible bottlenecks come to mind:
>
> 1) 1 GbE connection to mon / mds
>
> Is this the reason why reads are slow and osd disks are not hammered by read
> requests and therewith fully utilized?
>
> 2) Move metadata to SSD
>
> Currently, cephfs_metadata is on the same pool as the data on the spinning
> SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
> solution?
>
> Or is it both?
>
> Your experience and insight are highly appreciated.
>
> Thanks,
>
> Mike
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com