Re: CephFS: Writes are faster than reads?

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Thu, 15 Sep 2016 00:24:28 +0200

2016-09-14 23:19 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
> This is pretty standard behavior within Ceph as a whole — the journals
> really help on writes;
How does the journal help with large blocks? I thought the journal
speed up is because of coalescing lots of small writes into bigger
blocks - but in my benchmark the block size is already 1MB.

> and especially with big block sizes you'll
> exceed the size of readahead, but writes will happily flush out in
> parallel.
The client buffers lots of pages in the page cache and sends them in
bulk to the storage nodes where multiple OSDs can write the data in
parallel (because each OSD has their own disk), whereas the size of
readahead is way smaller than the buffer cache and therefore it can't
be parallelized that much (too little data is requested from the
cluster)?
Did I get that right?

I just started a rados benchmark with similar settings (same
blocksize, 10 threads (same as the stripe count)):
$ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup
Total time run:         180.148379
Total writes made:      47285
Write size:             1048576
Object size:            1048576
Bandwidth (MB/sec):     262.478

Reading:
$ rados bench -p repl1 60 -t 10 seq
Total time run:       49.936949
Total reads made:     47285
Read size:            1048576
Object size:          1048576
Bandwidth (MB/sec):   946.894

Here the write is slower than the read benchmark. Is it because rados
does sync() each object after write? And there is no readahead, so all
the 10 threads are busy all the time during the benchmark, where in
the CephFS scenario it depends on the client readahead setting if 10
stripes are requested in parallel all the time?

>
> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote:
>> On 16-09-14 18:21, Andreas Gerstmayr wrote:
>>>
>>> Hello,
>>>
>>> I'm currently performing some benchmark tests with our Ceph storage
>>> cluster and trying to find the bottleneck in our system.
>>>
>>> I'm writing a random 30GB file with the following command:
>>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
>>> --randrepeat=0 --end_fsync=1
>>> [...]
>>>   WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
>>> maxb=893368KB/s, mint=35212msec, maxt=35212msec
>>>
>>> real    0m35.539s
>>>
>>> This makes use of the page cache, but fsync()s at the end (network
>>> traffic from the client stops here, so the OSDs should have the data).
>>>
>>> When I read the same file back:
>>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
>>> [...]
>>>     READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
>>> maxb=693854KB/s, mint=45337msec, maxt=45337msec
>>>
>>> real    0m45.627s
>>>
>>> It takes 10s longer. Why? When writing data to a Ceph storage cluster,
>>> the data is written twice (unbuffered to the journal and buffered to
>>> the backing filesystem [1]). On the other hand, reading should be much
>>> faster because it needs only a single operation, the data should be
>>> already in the page cache of the OSDs (I'm reading the same file I've
>>> written before, and the OSDs have plenty of RAM) and reading from
>>> disks is generally faster than writing. Any idea what is going on in
>>> the background, which makes reads more expensive than writes?
>>
>> I am not an expert here, but I think it basically boils down to that you
>> read it linearly and write (flush cache) in parallel.
>>
>> If you could read multiple parts of the same file in parallel you could
>> achieve better speeds

I thought the striping feature of CephFS does exactly that? Write and
read stripe_count stripes in parallel?

>>
>>
>>>
>>> I've run these tests multiple times with fairly consistent results.
>>>
>>> Cluster Config:
>>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
>>> journal on same disk)
>>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
>>> 10, object size: 10MB
>>> 10 GbE, separate frontend+backend network
>>>
>>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
>>>
>>>
>>> Thanks,
>>> Andreas
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com