Hello, I'm currently performing some benchmark tests with our Ceph storage cluster and trying to find the bottleneck in our system. I'm writing a random 30GB file with the following command: $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB --randrepeat=0 --end_fsync=1 [...] WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s, maxb=893368KB/s, mint=35212msec, maxt=35212msec real 0m35.539s This makes use of the page cache, but fsync()s at the end (network traffic from the client stops here, so the OSDs should have the data). When I read the same file back: $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G [...] READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s, maxb=693854KB/s, mint=45337msec, maxt=45337msec real 0m45.627s It takes 10s longer. Why? When writing data to a Ceph storage cluster, the data is written twice (unbuffered to the journal and buffered to the backing filesystem [1]). On the other hand, reading should be much faster because it needs only a single operation, the data should be already in the page cache of the OSDs (I'm reading the same file I've written before, and the OSDs have plenty of RAM) and reading from disks is generally faster than writing. Any idea what is going on in the background, which makes reads more expensive than writes? I've run these tests multiple times with fairly consistent results. Cluster Config: Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs, journal on same disk) Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count: 10, object size: 10MB 10 GbE, separate frontend+backend network [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/ Thanks, Andreas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com