CephFS: Writes are faster than reads?

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Wed, 14 Sep 2016 17:21:06 +0200

Hello,

I'm currently performing some benchmark tests with our Ceph storage
cluster and trying to find the bottleneck in our system.

I'm writing a random 30GB file with the following command:
$ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
--randrepeat=0 --end_fsync=1
[...]
 WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
maxb=893368KB/s, mint=35212msec, maxt=35212msec

real    0m35.539s

This makes use of the page cache, but fsync()s at the end (network
traffic from the client stops here, so the OSDs should have the data).

When I read the same file back:
$ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
[...]
   READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
maxb=693854KB/s, mint=45337msec, maxt=45337msec

real    0m45.627s

It takes 10s longer. Why? When writing data to a Ceph storage cluster,
the data is written twice (unbuffered to the journal and buffered to
the backing filesystem [1]). On the other hand, reading should be much
faster because it needs only a single operation, the data should be
already in the page cache of the OSDs (I'm reading the same file I've
written before, and the OSDs have plenty of RAM) and reading from
disks is generally faster than writing. Any idea what is going on in
the background, which makes reads more expensive than writes?

I've run these tests multiple times with fairly consistent results.

Cluster Config:
Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
journal on same disk)
Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
10, object size: 10MB
10 GbE, separate frontend+backend network

[1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/

Thanks,
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com