Re: cephfs (rbd) read performance low - where is thebottleneck?

Mike Miller <millermike287@xxxxxxxxx> · Thu, 24 Nov 2016 00:29:58 +0800

JiaJia, all,

thanks, yes, I have the mount opts in mtab, and correct, if I leave out 
the "-v" option, no complaints.

mtab:
mounted ... type ceph (name=cephfs,rasize=134217728,key=client.cephfs)

It has to be rasize (rsize will not work).
One can check here:

cat /sys/class/bdi/ceph-*/read_ahead_kb
-> 131072

And YES! I am so happy, dd 40GB file does a lot more single thread now, 
much better.

rasize= 67108864      222 MB/s
rasize=134217728      360 MB/s
rasize=268435456      474 MB/s

Thank you all very much for bringing me on the right track, highly 
appreciated.

Regards,

Mike

On 11/23/16 5:55 PM, JiaJia Zhong wrote:
Mike,
if you run mount.ceph with "-v" options, you may get "ceph: Unknown
mount option rsize",
actually, you could ignore this, the rsize and rasize will both be
passed to mount syscall.

I belive that you have had the cephfs mounted successfully,
run "mount" in terminal to check the actual mount opts in mtab.

------------------ Original ------------------
*From: * "Mike Miller"<millermike287@xxxxxxxxx>;
*Date: * Wed, Nov 23, 2016 02:38 PM
*To: * "Eric Eastman"<eric.eastman@xxxxxxxxxxxxxx>;
*Cc: * "Ceph Users"<ceph-users@xxxxxxxxxxxxxx>;
*Subject: * Re:  cephfs (rbd) read performance low - where
is thebottleneck?

Hi,

did some testing multithreaded access and dd, performance scales as it
should.

Any ideas to improve single threaded read performance further would be
highly appreciated. Some of our use cases requires that we need to read
large files by a single thread.

I have tried changing the readahead on the kernel client cephfs mount
too, rsize and rasize.

mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:
Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 5242880000 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike287@xxxxxxxxx>
wrote:
Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3)
gives
me only about *122 MB/s* read speed in single thread. Scrubbing
turned off
during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public
network
with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T,
private
10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
core is used on this 6-core machine). mon and mds connected with only
1 GbE
(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is
still in
the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered
by read
requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the
spinning
SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com