Re: Parallel reads with CephFS

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Fri, 9 Dec 2016 02:57:13 +0100

Thanks for your response!

2016-12-09 1:27 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
> On Wed, Dec 7, 2016 at 5:45 PM, Andreas Gerstmayr
> <andreas.gerstmayr@xxxxxxxxx> wrote:
>> Hi,
>>
>> does the CephFS kernel module (as of kernel version 4.8.8) support
>> parallel reads of file stripes?
>> When an application requests a 500MB block from a file (which is
>> splitted into multiple objects and stripes on different OSDs) at once,
>> does the CephFS kernel client request these blocks in parallel?
>
> You're definitely exceeding the default readahead parameters at that
> point, but somebody who works with the kernel client more will have to
> tell you how to adjust them.
>

Sorry, I should have been more specific: Does a single execution of
the read(fd, buf, 500*1024*1024) syscall request the file
stripes/chunks in parallel from multiple OSDs?

A read() syscall gets routed to the ceph_read_iter() function [*] of
the Ceph kernel module with a pointer to an iov_iter struct as a
parameter. In this struct the count field is set to 500*1024*1024
(i.e. the kernel module knows the exact blocksize). A block of 500 MB
is stored in 50 different objects when using a max object size of 10
MB, therefore the kernel client could request 50 objects in parallel
if I'm not mistaken.

[*] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/file.c#L1204

>> My benchmarks suggest it does not (there is no significant difference
>> in throughput whether I'm reading a file in chunks of 64 KB, 500 MB or
>> 1 GB blocks).
>> In the architecture docs [1] under the protocol section data striping
>> and the resulting performance gains are explained - but I'm not sure
>> if this means that this optimization is already implemented in the
>> current CephFS kernel module or not.
>
> The general strategy is usually to do readahead up to a full block (or
> maybe the current and next block, if it's sure enough about how much
> data you need). You'll need to set looser limits to let it do more
> than that; having extremely large defaults on normal workloads tends
> to quickly lead you to using a lot of bandwidth spuriously.

The kernel client has a max readahead size parameter (rasize, by
default 8 MB). Increasing this size really helps achieving better
sequential read throughput (thanks for the tip!), so I assume the
readahead operates in a separate kernel thread and requests these
blocks in parallel to the user space application performing read()
calls. Follow-up question: if the readahead requests multiple blocks,
does it request these blocks in parallel?

And generally: are there other tunables for optimizing the read
performance of a Ceph cluster, in particular the CephFS (except using
SSDs for all storage nodes or cache tiering)?

Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com