2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: > This is pretty standard behavior within Ceph as a whole — the journals > really help on writes; How does the journal help with large blocks? I thought the journal speed up is because of coalescing lots of small writes into bigger blocks - but in my benchmark the block size is already 1MB. > and especially with big block sizes you'll > exceed the size of readahead, but writes will happily flush out in > parallel. The client buffers lots of pages in the page cache and sends them in bulk to the storage nodes where multiple OSDs can write the data in parallel (because each OSD has their own disk), whereas the size of readahead is way smaller than the buffer cache and therefore it can't be parallelized that much (too little data is requested from the cluster)? Did I get that right? I just started a rados benchmark with similar settings (same blocksize, 10 threads (same as the stripe count)): $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup Total time run: 180.148379 Total writes made: 47285 Write size: 1048576 Object size: 1048576 Bandwidth (MB/sec): 262.478 Reading: $ rados bench -p repl1 60 -t 10 seq Total time run: 49.936949 Total reads made: 47285 Read size: 1048576 Object size: 1048576 Bandwidth (MB/sec): 946.894 Here the write is slower than the read benchmark. Is it because rados does sync() each object after write? And there is no readahead, so all the 10 threads are busy all the time during the benchmark, where in the CephFS scenario it depends on the client readahead setting if 10 stripes are requested in parallel all the time? > > On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote: >> On 16-09-14 18:21, Andreas Gerstmayr wrote: >>> >>> Hello, >>> >>> I'm currently performing some benchmark tests with our Ceph storage >>> cluster and trying to find the bottleneck in our system. >>> >>> I'm writing a random 30GB file with the following command: >>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB >>> --randrepeat=0 --end_fsync=1 >>> [...] >>> WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s, >>> maxb=893368KB/s, mint=35212msec, maxt=35212msec >>> >>> real 0m35.539s >>> >>> This makes use of the page cache, but fsync()s at the end (network >>> traffic from the client stops here, so the OSDs should have the data). >>> >>> When I read the same file back: >>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G >>> [...] >>> READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s, >>> maxb=693854KB/s, mint=45337msec, maxt=45337msec >>> >>> real 0m45.627s >>> >>> It takes 10s longer. Why? When writing data to a Ceph storage cluster, >>> the data is written twice (unbuffered to the journal and buffered to >>> the backing filesystem [1]). On the other hand, reading should be much >>> faster because it needs only a single operation, the data should be >>> already in the page cache of the OSDs (I'm reading the same file I've >>> written before, and the OSDs have plenty of RAM) and reading from >>> disks is generally faster than writing. Any idea what is going on in >>> the background, which makes reads more expensive than writes? >> >> I am not an expert here, but I think it basically boils down to that you >> read it linearly and write (flush cache) in parallel. >> >> If you could read multiple parts of the same file in parallel you could >> achieve better speeds I thought the striping feature of CephFS does exactly that? Write and read stripe_count stripes in parallel? >> >> >>> >>> I've run these tests multiple times with fairly consistent results. >>> >>> Cluster Config: >>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs, >>> journal on same disk) >>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count: >>> 10, object size: 10MB >>> 10 GbE, separate frontend+backend network >>> >>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/ >>> >>> >>> Thanks, >>> Andreas >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com