Thanks a lot for your explanation! I just increased the 'rasize' option of the kernel module and got significant better throughput for sequential reads. Thanks, Andreas 2016-09-15 0:29 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: > Oh hrm, I missed the stripe count settings. I'm not sure if that's > helping you or not; I don't have a good intuitive grasp of what > readahead will do in that case. I think you may need to adjust the > readahead config knob in order to make it read all those objects > together instead of one or two at a time. > -Greg > > On Wed, Sep 14, 2016 at 3:24 PM, Andreas Gerstmayr > <andreas.gerstmayr@xxxxxxxxx> wrote: >> 2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: >>> This is pretty standard behavior within Ceph as a whole — the journals >>> really help on writes; >> How does the journal help with large blocks? I thought the journal >> speed up is because of coalescing lots of small writes into bigger >> blocks - but in my benchmark the block size is already 1MB. >> >>> and especially with big block sizes you'll >>> exceed the size of readahead, but writes will happily flush out in >>> parallel. >> The client buffers lots of pages in the page cache and sends them in >> bulk to the storage nodes where multiple OSDs can write the data in >> parallel (because each OSD has their own disk), whereas the size of >> readahead is way smaller than the buffer cache and therefore it can't >> be parallelized that much (too little data is requested from the >> cluster)? >> Did I get that right? >> >> I just started a rados benchmark with similar settings (same >> blocksize, 10 threads (same as the stripe count)): >> $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup >> Total time run: 180.148379 >> Total writes made: 47285 >> Write size: 1048576 >> Object size: 1048576 >> Bandwidth (MB/sec): 262.478 >> >> Reading: >> $ rados bench -p repl1 60 -t 10 seq >> Total time run: 49.936949 >> Total reads made: 47285 >> Read size: 1048576 >> Object size: 1048576 >> Bandwidth (MB/sec): 946.894 >> >> Here the write is slower than the read benchmark. Is it because rados >> does sync() each object after write? And there is no readahead, so all >> the 10 threads are busy all the time during the benchmark, where in >> the CephFS scenario it depends on the client readahead setting if 10 >> stripes are requested in parallel all the time? >> >> >>> >>> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote: >>>> On 16-09-14 18:21, Andreas Gerstmayr wrote: >>>>> >>>>> Hello, >>>>> >>>>> I'm currently performing some benchmark tests with our Ceph storage >>>>> cluster and trying to find the bottleneck in our system. >>>>> >>>>> I'm writing a random 30GB file with the following command: >>>>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB >>>>> --randrepeat=0 --end_fsync=1 >>>>> [...] >>>>> WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s, >>>>> maxb=893368KB/s, mint=35212msec, maxt=35212msec >>>>> >>>>> real 0m35.539s >>>>> >>>>> This makes use of the page cache, but fsync()s at the end (network >>>>> traffic from the client stops here, so the OSDs should have the data). >>>>> >>>>> When I read the same file back: >>>>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G >>>>> [...] >>>>> READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s, >>>>> maxb=693854KB/s, mint=45337msec, maxt=45337msec >>>>> >>>>> real 0m45.627s >>>>> >>>>> It takes 10s longer. Why? When writing data to a Ceph storage cluster, >>>>> the data is written twice (unbuffered to the journal and buffered to >>>>> the backing filesystem [1]). On the other hand, reading should be much >>>>> faster because it needs only a single operation, the data should be >>>>> already in the page cache of the OSDs (I'm reading the same file I've >>>>> written before, and the OSDs have plenty of RAM) and reading from >>>>> disks is generally faster than writing. Any idea what is going on in >>>>> the background, which makes reads more expensive than writes? >>>> >>>> I am not an expert here, but I think it basically boils down to that you >>>> read it linearly and write (flush cache) in parallel. >>>> >>>> If you could read multiple parts of the same file in parallel you could >>>> achieve better speeds >> >> I thought the striping feature of CephFS does exactly that? Write and >> read stripe_count stripes in parallel? >> >>>> >>>> >>>>> >>>>> I've run these tests multiple times with fairly consistent results. >>>>> >>>>> Cluster Config: >>>>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs, >>>>> journal on same disk) >>>>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count: >>>>> 10, object size: 10MB >>>>> 10 GbE, separate frontend+backend network >>>>> >>>>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/ >>>>> >>>>> >>>>> Thanks, >>>>> Andreas >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com