Re: CephFS: Writes are faster than reads?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 14 Sep 2016 15:29:45 -0700



Oh hrm, I missed the stripe count settings. I'm not sure if that's
helping you or not; I don't have a good intuitive grasp of what
readahead will do in that case. I think you may need to adjust the
readahead config knob in order to make it read all those objects
together instead of one or two at a time.
-Greg

On Wed, Sep 14, 2016 at 3:24 PM, Andreas Gerstmayr
<andreas.gerstmayr@xxxxxxxxx> wrote:
> 2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
>> This is pretty standard behavior within Ceph as a whole — the journals
>> really help on writes;
> How does the journal help with large blocks? I thought the journal
> speed up is because of coalescing lots of small writes into bigger
> blocks - but in my benchmark the block size is already 1MB.
>
>> and especially with big block sizes you'll
>> exceed the size of readahead, but writes will happily flush out in
>> parallel.
> The client buffers lots of pages in the page cache and sends them in
> bulk to the storage nodes where multiple OSDs can write the data in
> parallel (because each OSD has their own disk), whereas the size of
> readahead is way smaller than the buffer cache and therefore it can't
> be parallelized that much (too little data is requested from the
> cluster)?
> Did I get that right?
>
> I just started a rados benchmark with similar settings (same
> blocksize, 10 threads (same as the stripe count)):
> $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup
> Total time run:         180.148379
> Total writes made:      47285
> Write size:             1048576
> Object size:            1048576
> Bandwidth (MB/sec):     262.478
>
> Reading:
> $ rados bench -p repl1 60 -t 10 seq
> Total time run:       49.936949
> Total reads made:     47285
> Read size:            1048576
> Object size:          1048576
> Bandwidth (MB/sec):   946.894
>
> Here the write is slower than the read benchmark. Is it because rados
> does sync() each object after write? And there is no readahead, so all
> the 10 threads are busy all the time during the benchmark, where in
> the CephFS scenario it depends on the client readahead setting if 10
> stripes are requested in parallel all the time?
>
>
>>
>> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote:
>>> On 16-09-14 18:21, Andreas Gerstmayr wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm currently performing some benchmark tests with our Ceph storage
>>>> cluster and trying to find the bottleneck in our system.
>>>>
>>>> I'm writing a random 30GB file with the following command:
>>>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
>>>> --randrepeat=0 --end_fsync=1
>>>> [...]
>>>>   WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
>>>> maxb=893368KB/s, mint=35212msec, maxt=35212msec
>>>>
>>>> real    0m35.539s
>>>>
>>>> This makes use of the page cache, but fsync()s at the end (network
>>>> traffic from the client stops here, so the OSDs should have the data).
>>>>
>>>> When I read the same file back:
>>>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
>>>> [...]
>>>>     READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
>>>> maxb=693854KB/s, mint=45337msec, maxt=45337msec
>>>>
>>>> real    0m45.627s
>>>>
>>>> It takes 10s longer. Why? When writing data to a Ceph storage cluster,
>>>> the data is written twice (unbuffered to the journal and buffered to
>>>> the backing filesystem [1]). On the other hand, reading should be much
>>>> faster because it needs only a single operation, the data should be
>>>> already in the page cache of the OSDs (I'm reading the same file I've
>>>> written before, and the OSDs have plenty of RAM) and reading from
>>>> disks is generally faster than writing. Any idea what is going on in
>>>> the background, which makes reads more expensive than writes?
>>>
>>> I am not an expert here, but I think it basically boils down to that you
>>> read it linearly and write (flush cache) in parallel.
>>>
>>> If you could read multiple parts of the same file in parallel you could
>>> achieve better speeds
>
> I thought the striping feature of CephFS does exactly that? Write and
> read stripe_count stripes in parallel?
>
>>>
>>>
>>>>
>>>> I've run these tests multiple times with fairly consistent results.
>>>>
>>>> Cluster Config:
>>>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
>>>> journal on same disk)
>>>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
>>>> 10, object size: 10MB
>>>> 10 GbE, separate frontend+backend network
>>>>
>>>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
>>>>
>>>>
>>>> Thanks,
>>>> Andreas
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com