CephFS write performance

"Fabiano de O. Lucchese" <flucchese@xxxxxxxxx> · Tue, 19 Jul 2016 14:25:57 +0000 (UTC)

Hi, folks.

I'm conducting a series of experiments and tests with CephFS and have been facing a behavior over which I can't seem to have much control.

I configured a 5-node Ceph cluster running on enterprise servers. Each server has 10 x 6TB HDDs and 2 x 800GB SSDs. I configured the SSDs as a RAID-1 device for journaling and also two of the HDDs for the same purpose for the sake of comparison. All other 8 HDDs are configured as OSDs. The servers have 196GB of RAM and our private network is backed by a 40GB/s Brocade switch (frontend is 10Gb/s).

When benchmarking the HDDs directly, here's the performance I get:

dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1 oflag=direct &

0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 11.684 s, 184 MB/s

For read performance:

dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1 iflag=direct &

0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 8.30168 s, 259 MB/s

Now, when I benchmark the OSDs configured with HDD-based journaling, here's what I get:

[root@cephnode1 ceph-cluster]# ceph tell osd.1 bench

{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "bytes_per_sec": 426840870.000000
}

which looks coherent. If I switch to the SDD-based journal, here's the new figure:

[root@cephnode1 ~]# ceph tell osd.1 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "bytes_per_sec": 805229549.000000
}

which, again, looks as expected to me.

Finally, when I run the rados bench, here's what I get:

rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq

Total time run:         300.345098
Total writes made:      48327
Write size:             4194304
Bandwidth (MB/sec):     643.620

Stddev Bandwidth:       114.222
Max bandwidth (MB/sec): 1196
Min bandwidth (MB/sec): 0
Average Latency:        0.0994289
Stddev Latency:         0.112926
Max latency:            1.85983
Min latency:            0.0139412

----------------------------------------

Total time run:        300.121930
Total reads made:      31990
Read size:             4194304
Bandwidth (MB/sec):    426.360

Average Latency:       0.149346
Max latency:           1.77489
Min latency:           0.00382452

I configured the cluster to replicate data twice (3 copies), so these numbers fall within my expectations. So far so good, but here's comes the issue: I configured CephFS and mounted a share locally on one of my servers. When I write data to it, it shows abnormally high performance at the beginning for about 5 seconds, stalls for about 20 seconds and then picks up again. For long running tests, the observed write throughput is very close to what the rados bench provided (about 640 MB/s), but for short-lived tests, I get peak performances of over 5GB/s. I know that journaling is expected to cause spiky performance patters like that, but not to this level, which makes me think that CephFS is buffering my writes and returning the control back to client before persisting them to the jounal, which looks undesirable.

I searched the web for a couple of days looking for ways to disable this apparent write buffering, but couldn't find anything. So here comes my question: how can I disable it?

Thanks and regards,

F. Lucchese
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com