Re: CephFS: Writes are faster than reads?

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Thu, 15 Sep 2016 23:08:28 +0200

Thanks a lot for your explanation!
I just increased the 'rasize' option of the kernel module and got
significant better throughput for sequential reads.

Thanks,
Andreas

2016-09-15 0:29 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
> Oh hrm, I missed the stripe count settings. I'm not sure if that's
> helping you or not; I don't have a good intuitive grasp of what
> readahead will do in that case. I think you may need to adjust the
> readahead config knob in order to make it read all those objects
> together instead of one or two at a time.
> -Greg
>
> On Wed, Sep 14, 2016 at 3:24 PM, Andreas Gerstmayr
> <andreas.gerstmayr@xxxxxxxxx> wrote:
>> 2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
>>> This is pretty standard behavior within Ceph as a whole — the journals
>>> really help on writes;
>> How does the journal help with large blocks? I thought the journal
>> speed up is because of coalescing lots of small writes into bigger
>> blocks - but in my benchmark the block size is already 1MB.
>>
>>> and especially with big block sizes you'll
>>> exceed the size of readahead, but writes will happily flush out in
>>> parallel.
>> The client buffers lots of pages in the page cache and sends them in
>> bulk to the storage nodes where multiple OSDs can write the data in
>> parallel (because each OSD has their own disk), whereas the size of
>> readahead is way smaller than the buffer cache and therefore it can't
>> be parallelized that much (too little data is requested from the
>> cluster)?
>> Did I get that right?
>>
>> I just started a rados benchmark with similar settings (same
>> blocksize, 10 threads (same as the stripe count)):
>> $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup
>> Total time run:         180.148379
>> Total writes made:      47285
>> Write size:             1048576
>> Object size:            1048576
>> Bandwidth (MB/sec):     262.478
>>
>> Reading:
>> $ rados bench -p repl1 60 -t 10 seq
>> Total time run:       49.936949
>> Total reads made:     47285
>> Read size:            1048576
>> Object size:          1048576
>> Bandwidth (MB/sec):   946.894
>>
>> Here the write is slower than the read benchmark. Is it because rados
>> does sync() each object after write? And there is no readahead, so all
>> the 10 threads are busy all the time during the benchmark, where in
>> the CephFS scenario it depends on the client readahead setting if 10
>> stripes are requested in parallel all the time?
>>
>>
>>>
>>> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote:
>>>> On 16-09-14 18:21, Andreas Gerstmayr wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm currently performing some benchmark tests with our Ceph storage
>>>>> cluster and trying to find the bottleneck in our system.
>>>>>
>>>>> I'm writing a random 30GB file with the following command:
>>>>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
>>>>> --randrepeat=0 --end_fsync=1
>>>>> [...]
>>>>>   WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
>>>>> maxb=893368KB/s, mint=35212msec, maxt=35212msec
>>>>>
>>>>> real    0m35.539s
>>>>>
>>>>> This makes use of the page cache, but fsync()s at the end (network
>>>>> traffic from the client stops here, so the OSDs should have the data).
>>>>>
>>>>> When I read the same file back:
>>>>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
>>>>> [...]
>>>>>     READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
>>>>> maxb=693854KB/s, mint=45337msec, maxt=45337msec
>>>>>
>>>>> real    0m45.627s
>>>>>
>>>>> It takes 10s longer. Why? When writing data to a Ceph storage cluster,
>>>>> the data is written twice (unbuffered to the journal and buffered to
>>>>> the backing filesystem [1]). On the other hand, reading should be much
>>>>> faster because it needs only a single operation, the data should be
>>>>> already in the page cache of the OSDs (I'm reading the same file I've
>>>>> written before, and the OSDs have plenty of RAM) and reading from
>>>>> disks is generally faster than writing. Any idea what is going on in
>>>>> the background, which makes reads more expensive than writes?
>>>>
>>>> I am not an expert here, but I think it basically boils down to that you
>>>> read it linearly and write (flush cache) in parallel.
>>>>
>>>> If you could read multiple parts of the same file in parallel you could
>>>> achieve better speeds
>>
>> I thought the striping feature of CephFS does exactly that? Write and
>> read stripe_count stripes in parallel?
>>
>>>>
>>>>
>>>>>
>>>>> I've run these tests multiple times with fairly consistent results.
>>>>>
>>>>> Cluster Config:
>>>>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
>>>>> journal on same disk)
>>>>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
>>>>> 10, object size: 10MB
>>>>> 10 GbE, separate frontend+backend network
>>>>>
>>>>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Andreas
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com