Hi again, I've fiddled around a lot with journal settings, so to make sure I'm comparing apples to apples, I went back and systematically re-ran the benchmark tests I've been running (and some more). A long data dump follows, but the end result is that it does look like something fishy is going on for small file sizes. For example, performance difference between 4MB and 4KB files in the rados write benchmark is a factor of 25 or more. Here are the details, with a recap of the configuration at the end. I started out by remaking the underlying xfs filesystems on the OSD hosts, and then rerunning mkcephfs. The journals are 120 GB SSDs. First, the rsync tests again: * Rsync of ~60 GB directory tree (mostly small files) from ceph client to mounted cephfs goes at about 5.2 MB/s. * I then turned off ceph (service ceph -a stop) and did the same rsync between the same two hosts, onto the same RAID array on one of the OSD hosts, but using ssh this time. This time it goes at about 37 MB/s. This implies to me that the slowdown is somewhere in ceph, not in the RAID array or the network connectivity. I then remade the xfs filessytems again, re-ran mkcephfs, restarted ceph and did some rados benchmarks. * rados bench -p pbench 900 write -t 256 -b 4096 Total time run: 900.184096 Total writes made: 1052511 Write size: 4096 Bandwidth (MB/sec): 4.567 Stddev Bandwidth: 4.34241 Max bandwidth (MB/sec): 23.1719 Min bandwidth (MB/sec): 0 Average Latency: 0.218949 Stddev Latency: 0.566181 Max latency: 9.92952 Min latency: 0.001449 * rados bench -p pbench 900 write -t 256 (default 4MB size) Total time run: 900.816140 Total writes made: 25263 Write size: 4194304 Bandwidth (MB/sec): 112.178 Stddev Bandwidth: 27.1239 Max bandwidth (MB/sec): 840 Min bandwidth (MB/sec): 0 Average Latency: 9.08281 Stddev Latency: 0.505372 Max latency: 9.31865 Min latency: 0.818949 I repeated each of these benchmarks three times, but saw similar results each time (a factor of 25 or more in speed between small and large object sizes). Next, I stopped ceph and took a look at local RAID performance as a function of file size using "iozone": http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf Then I re-made the ceph filesystem and restarted ceph, and used iozone on the ceph client to look at the mounted ceph filesystem: http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf I'm not sure how to interpret the iozone performance numbers, but the distribution certainly looks much less uniform across different file and chunk sizes for the mounted ceph filesystem. Finally, I took a look at the results of bonnie++ benchmarks for I/O directly to the RAID array, or to the mounted ceph filesystem. * Looking at RAID array from one of the OSD hosts: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP RAID on OSD 23800M 1155 99 318264 26 132959 19 2884 99 293464 20 535.4 23 Latency 7354us 30955us 129ms 8220us 119ms 62188us Version 1.96 ------Sequential Create------ --------Random Create-------- RAID on OSD -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 17680 58 +++++ +++ 26994 78 24715 81 +++++ +++ 26597 78 Latency 113us 105us 153us 109us 15us 94us * Looking at the mounted ceph filesystem from the ceph client: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cephfs, client 16G 1101 95 114623 8 45713 2 2665 98 133537 3 882.0 14 Latency 44515us 37018us 6437ms 12747us 469ms 60004us Version 1.96 ------Sequential Create------ --------Random Create-------- cephfs, client -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 653 3 19886 9 601 3 746 3 +++++ +++ 585 2 Latency 1171ms 7467us 174ms 104ms 19us 228ms This seems to show about a factor of 3 difference in speed between writing to the mounted ceph filesystem and writing directly to the RAID array. While I was doing these, I kept an eye on the OSDs and MDSs with collectl and atop, but I didn't see anything that looked like an obvious problem. The MDSs didn't see very high CPU, I/O or memory usage, for example. Finally, to recap the configuration: 3 MDS hosts 4 OSD hosts, each with a RAID array for object storage and an SSD journal xfs filesystems for the object stores gigabit network on the front end, and a separate back end gigabit network for the ceph hosts. 64-bit CentOS 6.3 and ceph 0.48.2 everywhere ceph servers running stock CentOS 2.6.32-279.9.1 kernel. client running "elrepo" 3.5.4-1 kernel. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@xxxxxxxxxxxx ======================================================================== -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html