Oh hrm, I missed the stripe count settings. I'm not sure if that's helping you or not; I don't have a good intuitive grasp of what readahead will do in that case. I think you may need to adjust the readahead config knob in order to make it read all those objects together instead of one or two at a time. -Greg On Wed, Sep 14, 2016 at 3:24 PM, Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> wrote: > 2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: >> This is pretty standard behavior within Ceph as a whole — the journals >> really help on writes; > How does the journal help with large blocks? I thought the journal > speed up is because of coalescing lots of small writes into bigger > blocks - but in my benchmark the block size is already 1MB. > >> and especially with big block sizes you'll >> exceed the size of readahead, but writes will happily flush out in >> parallel. > The client buffers lots of pages in the page cache and sends them in > bulk to the storage nodes where multiple OSDs can write the data in > parallel (because each OSD has their own disk), whereas the size of > readahead is way smaller than the buffer cache and therefore it can't > be parallelized that much (too little data is requested from the > cluster)? > Did I get that right? > > I just started a rados benchmark with similar settings (same > blocksize, 10 threads (same as the stripe count)): > $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup > Total time run: 180.148379 > Total writes made: 47285 > Write size: 1048576 > Object size: 1048576 > Bandwidth (MB/sec): 262.478 > > Reading: > $ rados bench -p repl1 60 -t 10 seq > Total time run: 49.936949 > Total reads made: 47285 > Read size: 1048576 > Object size: 1048576 > Bandwidth (MB/sec): 946.894 > > Here the write is slower than the read benchmark. Is it because rados > does sync() each object after write? And there is no readahead, so all > the 10 threads are busy all the time during the benchmark, where in > the CephFS scenario it depends on the client readahead setting if 10 > stripes are requested in parallel all the time? > > >> >> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote: >>> On 16-09-14 18:21, Andreas Gerstmayr wrote: >>>> >>>> Hello, >>>> >>>> I'm currently performing some benchmark tests with our Ceph storage >>>> cluster and trying to find the bottleneck in our system. >>>> >>>> I'm writing a random 30GB file with the following command: >>>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB >>>> --randrepeat=0 --end_fsync=1 >>>> [...] >>>> WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s, >>>> maxb=893368KB/s, mint=35212msec, maxt=35212msec >>>> >>>> real 0m35.539s >>>> >>>> This makes use of the page cache, but fsync()s at the end (network >>>> traffic from the client stops here, so the OSDs should have the data). >>>> >>>> When I read the same file back: >>>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G >>>> [...] >>>> READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s, >>>> maxb=693854KB/s, mint=45337msec, maxt=45337msec >>>> >>>> real 0m45.627s >>>> >>>> It takes 10s longer. Why? When writing data to a Ceph storage cluster, >>>> the data is written twice (unbuffered to the journal and buffered to >>>> the backing filesystem [1]). On the other hand, reading should be much >>>> faster because it needs only a single operation, the data should be >>>> already in the page cache of the OSDs (I'm reading the same file I've >>>> written before, and the OSDs have plenty of RAM) and reading from >>>> disks is generally faster than writing. Any idea what is going on in >>>> the background, which makes reads more expensive than writes? >>> >>> I am not an expert here, but I think it basically boils down to that you >>> read it linearly and write (flush cache) in parallel. >>> >>> If you could read multiple parts of the same file in parallel you could >>> achieve better speeds > > I thought the striping feature of CephFS does exactly that? Write and > read stripe_count stripes in parallel? > >>> >>> >>>> >>>> I've run these tests multiple times with fairly consistent results. >>>> >>>> Cluster Config: >>>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs, >>>> journal on same disk) >>>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count: >>>> 10, object size: 10MB >>>> 10 GbE, separate frontend+backend network >>>> >>>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/ >>>> >>>> >>>> Thanks, >>>> Andreas >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com