Thanks for your response! 2016-12-09 1:27 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: > On Wed, Dec 7, 2016 at 5:45 PM, Andreas Gerstmayr > <andreas.gerstmayr@xxxxxxxxx> wrote: >> Hi, >> >> does the CephFS kernel module (as of kernel version 4.8.8) support >> parallel reads of file stripes? >> When an application requests a 500MB block from a file (which is >> splitted into multiple objects and stripes on different OSDs) at once, >> does the CephFS kernel client request these blocks in parallel? > > You're definitely exceeding the default readahead parameters at that > point, but somebody who works with the kernel client more will have to > tell you how to adjust them. > Sorry, I should have been more specific: Does a single execution of the read(fd, buf, 500*1024*1024) syscall request the file stripes/chunks in parallel from multiple OSDs? A read() syscall gets routed to the ceph_read_iter() function [*] of the Ceph kernel module with a pointer to an iov_iter struct as a parameter. In this struct the count field is set to 500*1024*1024 (i.e. the kernel module knows the exact blocksize). A block of 500 MB is stored in 50 different objects when using a max object size of 10 MB, therefore the kernel client could request 50 objects in parallel if I'm not mistaken. [*] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/file.c#L1204 >> My benchmarks suggest it does not (there is no significant difference >> in throughput whether I'm reading a file in chunks of 64 KB, 500 MB or >> 1 GB blocks). >> In the architecture docs [1] under the protocol section data striping >> and the resulting performance gains are explained - but I'm not sure >> if this means that this optimization is already implemented in the >> current CephFS kernel module or not. > > The general strategy is usually to do readahead up to a full block (or > maybe the current and next block, if it's sure enough about how much > data you need). You'll need to set looser limits to let it do more > than that; having extremely large defaults on normal workloads tends > to quickly lead you to using a lot of bandwidth spuriously. The kernel client has a max readahead size parameter (rasize, by default 8 MB). Increasing this size really helps achieving better sequential read throughput (thanks for the tip!), so I assume the readahead operates in a separate kernel thread and requests these blocks in parallel to the user space application performing read() calls. Follow-up question: if the readahead requests multiple blocks, does it request these blocks in parallel? And generally: are there other tunables for optimizing the read performance of a Ceph cluster, in particular the CephFS (except using SSDs for all storage nodes or cache tiering)? Andreas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com