Hi again, A few answers to questions from various people on the list after my last e-mail: greg@xxxxxxxxxxx said: > Yes. Bryan, you mentioned that you didn't see a lot of resource usage ? was it > perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in > theory, but in practice it has the equivalent of a Big Kernel Lock so it's not > going to get much past one cpu core of time... The CPU usage on the MDSs hovered around a few percent. They're quad-core machines, and I didn't see it ever get as high as 25% usage on any of the cores while watching with atop. greg@xxxxxxxxxxx said: > The rados bench results do indicate some pretty bad small-file write > performance as well though, so I guess it's possible your testing is running > long enough that the page cache isn't absorbing that hit. Did performance > start out higher or has it been flat? Looking at the details of the rados benchmark output, it does look like performance starts out better for the first few iterations, and then goes bad. Here's the begining of a typical small-file run: Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 255 3683 3428 13.3894 13.3906 0.002569 0.0696906 2 256 7561 7305 14.2661 15.1445 0.106437 0.0669534 3 256 10408 10152 13.2173 11.1211 0.002176 0.0689543 4 256 11256 11000 10.741 3.3125 0.002097 0.0846414 5 256 11256 11000 8.5928 0 - 0.0846414 6 256 11370 11114 7.23489 0.222656 0.002399 0.0962989 7 255 12480 12225 6.82126 4.33984 0.117658 0.142335 8 256 13289 13033 6.36311 3.15625 0.002574 0.151261 9 256 13737 13481 5.85051 1.75 0.120657 0.158865 10 256 14341 14085 5.50138 2.35938 0.022544 0.178298 I see the same behavior every time I repeat the small-file rados benchmark. Here's a graph showing the first 100 "cur MB/s" values for a short-file benchmark: http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf On the other hand, with 4MB files, I see results that start out like this: Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 49 49 0 0 0 - 0 2 76 76 0 0 0 - 0 3 105 105 0 0 0 - 0 4 133 133 0 0 0 - 0 5 159 159 0 0 0 - 0 6 188 188 0 0 0 - 0 7 218 218 0 0 0 - 0 8 246 246 0 0 0 - 0 9 256 274 18 7.99904 8 8.97759 8.66218 10 255 301 46 18.3978 112 9.1456 8.94095 11 255 330 75 27.2695 116 9.06968 9.013 12 255 358 103 34.3292 112 9.12486 9.04374 Here's a graph showing the first 100 "cur MB/s" values for a typical 4MB file benchmark: http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf mark.nelson@xxxxxxxxxxx said: > When you were doing this, what kind of results did collectl give you for > average write sizes to the underlying OSD disks? The average "rwsize" reported by collectl hovered around 6 +/- a few (in whatever units collectl reports) for the RAID array, and around 15 for the journal SSD, while doing the small-file rados benchmark. Here's a screenshot showing atop running on each of the MDS hosts, and collectl running on each of the OSD hosts, while the benchmark was running: http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png Here's the same, but with collectl running on the MDSs instead of atop: http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png Looking at the last screenshot again, it does look like the disks on the MDSs are getting some exercise, with ~40% utilization (if I'm interpreting the collectl output correctly). Here's a similar snapshot for the 4MB test: http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png It looks like similar "pct util" on the MDS disks, but much higher average rwsize values on the OSDs. mark.nelson@xxxxxxxxxxx said: > There's multiple issues potentially here. Part of it might be how writes are > coalesced by XFS in each scenario. Part of it might also be overhead due to > XFS metadata reads/writes. You could probably get a better idea of both of > these by running blktrace during the tests and making seekwatcher movies of > the results. You not only can look at the numbers of seeks, but also the > kind (read/writes) and where on the disk they are going. That, and some of > the raw blktrace data can give you a lot of information about what is going > on and whether or not seeks are I'll take a look at blktrace and see what I can find out. mark.nelson@xxxxxxxxxxx said: > Beyond that, I do think you are correct in suspecting that there are some > Ceph limitations as well. Some things that may be interesting to try: > - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various > thread counts - Increasing various op and byte limits (such as > journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under > the OSDs. And I'll give some of these a try. Regarding the iozone benchmarks: mark.nelson@xxxxxxxxxxx said: > Do you happen to have the settings you used when you ran these tests? I > probably don't have time to try to repeat them now, but I can at least take a > quick look at them. > I'm slightly confused by the labels on the graph. They can't possibly mean > that 2^16384 KB record sizes were tested. Was that just up to 16MB records > and 16GB files? That would make a lot more sense. I just did something like: cd /mnt/tmp (where the cephfs was mounted) iozone -a > /tmp/iozone.log By default, iozone does its tests in the current working directory. The graphs were just produced with the Generate_Graphs script that comes with iozone. There are certainly some problems with the axis labeling, but I think your interpretation is correct. mark.nelson@xxxxxxxxxxx said: > This might be a dumb question, but was the ceph version of this test on a > single client on gigabit Ethernet? If so, wouldn't that be the reason you > are maxing out at like 114MB/s? Duh. You're exactly right. I should have noticed this. And finally: tv@xxxxxxxxxxx said: > If you want to benchmark just the metadata part, rsync with 0-size files might > actually be an interesting workload. I'll see if I can work out a way to do this. Thanks to everyone for the suggestions. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@xxxxxxxxxxxx ======================================================================== -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html