This is pretty standard behavior within Ceph as a whole — the journals really help on writes; and especially with big block sizes you'll exceed the size of readahead, but writes will happily flush out in parallel. On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote: > On 16-09-14 18:21, Andreas Gerstmayr wrote: >> >> Hello, >> >> I'm currently performing some benchmark tests with our Ceph storage >> cluster and trying to find the bottleneck in our system. >> >> I'm writing a random 30GB file with the following command: >> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB >> --randrepeat=0 --end_fsync=1 >> [...] >> WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s, >> maxb=893368KB/s, mint=35212msec, maxt=35212msec >> >> real 0m35.539s >> >> This makes use of the page cache, but fsync()s at the end (network >> traffic from the client stops here, so the OSDs should have the data). >> >> When I read the same file back: >> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G >> [...] >> READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s, >> maxb=693854KB/s, mint=45337msec, maxt=45337msec >> >> real 0m45.627s >> >> It takes 10s longer. Why? When writing data to a Ceph storage cluster, >> the data is written twice (unbuffered to the journal and buffered to >> the backing filesystem [1]). On the other hand, reading should be much >> faster because it needs only a single operation, the data should be >> already in the page cache of the OSDs (I'm reading the same file I've >> written before, and the OSDs have plenty of RAM) and reading from >> disks is generally faster than writing. Any idea what is going on in >> the background, which makes reads more expensive than writes? > > I am not an expert here, but I think it basically boils down to that you > read it linearly and write (flush cache) in parallel. > > If you could read multiple parts of the same file in parallel you could > achieve better speeds > > >> >> I've run these tests multiple times with fairly consistent results. >> >> Cluster Config: >> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs, >> journal on same disk) >> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count: >> 10, object size: 10MB >> 10 GbE, separate frontend+backend network >> >> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/ >> >> >> Thanks, >> Andreas >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com