Re: CephFS: Writes are faster than reads?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 14 Sep 2016 14:19:54 -0700

This is pretty standard behavior within Ceph as a whole — the journals
really help on writes; and especially with big block sizes you'll
exceed the size of readahead, but writes will happily flush out in
parallel.

On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote:
> On 16-09-14 18:21, Andreas Gerstmayr wrote:
>>
>> Hello,
>>
>> I'm currently performing some benchmark tests with our Ceph storage
>> cluster and trying to find the bottleneck in our system.
>>
>> I'm writing a random 30GB file with the following command:
>> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
>> --randrepeat=0 --end_fsync=1
>> [...]
>>   WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
>> maxb=893368KB/s, mint=35212msec, maxt=35212msec
>>
>> real    0m35.539s
>>
>> This makes use of the page cache, but fsync()s at the end (network
>> traffic from the client stops here, so the OSDs should have the data).
>>
>> When I read the same file back:
>> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
>> [...]
>>     READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
>> maxb=693854KB/s, mint=45337msec, maxt=45337msec
>>
>> real    0m45.627s
>>
>> It takes 10s longer. Why? When writing data to a Ceph storage cluster,
>> the data is written twice (unbuffered to the journal and buffered to
>> the backing filesystem [1]). On the other hand, reading should be much
>> faster because it needs only a single operation, the data should be
>> already in the page cache of the OSDs (I'm reading the same file I've
>> written before, and the OSDs have plenty of RAM) and reading from
>> disks is generally faster than writing. Any idea what is going on in
>> the background, which makes reads more expensive than writes?
>
> I am not an expert here, but I think it basically boils down to that you
> read it linearly and write (flush cache) in parallel.
>
> If you could read multiple parts of the same file in parallel you could
> achieve better speeds
>
>
>>
>> I've run these tests multiple times with fairly consistent results.
>>
>> Cluster Config:
>> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
>> journal on same disk)
>> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
>> 10, object size: 10MB
>> 10 GbE, separate frontend+backend network
>>
>> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
>>
>>
>> Thanks,
>> Andreas
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com