On Tue, Jul 19, 2016 at 3:25 PM, Fabiano de O. Lucchese <flucchese@xxxxxxxxx> wrote: > Hi, folks. > > I'm conducting a series of experiments and tests with CephFS and have been > facing a behavior over which I can't seem to have much control. > > I configured a 5-node Ceph cluster running on enterprise servers. Each > server has 10 x 6TB HDDs and 2 x 800GB SSDs. I configured the SSDs as a > RAID-1 device for journaling and also two of the HDDs for the same purpose > for the sake of comparison. All other 8 HDDs are configured as OSDs. The > servers have 196GB of RAM and our private network is backed by a 40GB/s > Brocade switch (frontend is 10Gb/s). > > When benchmarking the HDDs directly, here's the performance I get: > > dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1 > oflag=direct & > > 0+1 records in > 0+1 records out > 2147479552 bytes (2.1 GB) copied, 11.684 s, 184 MB/s > > For read performance: > > dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1 > iflag=direct & > > 0+1 records in > 0+1 records out > 2147479552 bytes (2.1 GB) copied, 8.30168 s, 259 MB/s > > Now, when I benchmark the OSDs configured with HDD-based journaling, here's > what I get: > > [root@cephnode1 ceph-cluster]# ceph tell osd.1 bench > > { > "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": 426840870.000000 > } > > which looks coherent. If I switch to the SDD-based journal, here's the new > figure: > > [root@cephnode1 ~]# ceph tell osd.1 bench > { > "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": 805229549.000000 > } > > which, again, looks as expected to me. > > Finally, when I run the rados bench, here's what I get: > > rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p > cephfs_data 300 seq > > Total time run: 300.345098 > Total writes made: 48327 > Write size: 4194304 > Bandwidth (MB/sec): 643.620 > > Stddev Bandwidth: 114.222 > Max bandwidth (MB/sec): 1196 > Min bandwidth (MB/sec): 0 > Average Latency: 0.0994289 > Stddev Latency: 0.112926 > Max latency: 1.85983 > Min latency: 0.0139412 > > ---------------------------------------- > > Total time run: 300.121930 > Total reads made: 31990 > Read size: 4194304 > Bandwidth (MB/sec): 426.360 > > Average Latency: 0.149346 > Max latency: 1.77489 > Min latency: 0.00382452 > > I configured the cluster to replicate data twice (3 copies), so these > numbers fall within my expectations. So far so good, but here's comes the > issue: I configured CephFS and mounted a share locally on one of my servers. > When I write data to it, it shows abnormally high performance at the > beginning for about 5 seconds, stalls for about 20 seconds and then picks up > again. For long running tests, the observed write throughput is very close > to what the rados bench provided (about 640 MB/s), but for short-lived > tests, I get peak performances of over 5GB/s. I know that journaling is > expected to cause spiky performance patters like that, but not to this > level, which makes me think that CephFS is buffering my writes and returning > the control back to client before persisting them to the jounal, which looks > undesirable. If you want to skip the caching in any filesystem, use the O_DIRECT flag when opening a file. You don't say exactly what your benchmark is, but presumably you have a shortage of fsync calls, so you're not actually waiting for things to persist? John > I searched the web for a couple of days looking for ways to disable this > apparent write buffering, but couldn't find anything. So here comes my > question: how can I disable it? > > Thanks and regards, > > F. Lucchese > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com