Re: Slow ceph fs performance

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 3 Oct 2012 11:35:42 -0700

I think I'm with Mark now — this does indeed look like too much random
IO for the disks to handle. In particular, Ceph requires that each
write be synced to disk before it's considered complete, which rsync
definitely doesn't. In the filesystem this is generally disguised
fairly well by all the caches and such in the way, but this use case
is unfriendly to that arrangement.

However, I am particularly struck by seeing one of your OSDs at 96%
disk utilization while the others remain <50%, and I've just realized
we never saw output from ceph -s. Can you provide that, please?
-Greg

On Wed, Oct 3, 2012 at 7:55 AM, Bryan K. Wright
<bkw1a@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi again,
>
>         A few answers to questions from various people on the list
> after my last e-mail:
>
> greg@xxxxxxxxxxx said:
>> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it
>> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
>> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
>> going to get much past one cpu core of time...
>
>         The CPU usage on the MDSs hovered around a few percent.
> They're quad-core machines, and I didn't see it ever get as high
> as 25% usage on any of the cores while watching with atop.
>
> greg@xxxxxxxxxxx said:
>> The rados bench results do indicate some pretty bad small-file write
>> performance as well though, so I guess it's possible your testing is running
>> long enough that the page cache isn't absorbing that hit. Did performance
>> start out higher or has it been flat?
>
>         Looking at the details of the rados benchmark output, it does
> look like performance starts out better for the first few iterations,
> and then goes bad.  Here's the begining of a typical small-file run:
>
>  Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1     255      3683      3428   13.3894   13.3906  0.002569 0.0696906
>      2     256      7561      7305   14.2661   15.1445  0.106437 0.0669534
>      3     256     10408     10152   13.2173   11.1211  0.002176 0.0689543
>      4     256     11256     11000    10.741    3.3125  0.002097 0.0846414
>      5     256     11256     11000    8.5928         0         - 0.0846414
>      6     256     11370     11114   7.23489  0.222656  0.002399 0.0962989
>      7     255     12480     12225   6.82126   4.33984  0.117658  0.142335
>      8     256     13289     13033   6.36311   3.15625  0.002574  0.151261
>      9     256     13737     13481   5.85051      1.75  0.120657  0.158865
>     10     256     14341     14085   5.50138   2.35938  0.022544  0.178298
>
> I see the same behavior every time I repeat the small-file
> rados benchmark.  Here's a graph showing the first 100 "cur MB/s" values
> for a short-file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf
>
>         On the other hand, with 4MB files, I see results that start out like
> this:
>
>  Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      49        49         0         0         0         -         0
>      2      76        76         0         0         0         -         0
>      3     105       105         0         0         0         -         0
>      4     133       133         0         0         0         -         0
>      5     159       159         0         0         0         -         0
>      6     188       188         0         0         0         -         0
>      7     218       218         0         0         0         -         0
>      8     246       246         0         0         0         -         0
>      9     256       274        18   7.99904         8   8.97759   8.66218
>     10     255       301        46   18.3978       112    9.1456   8.94095
>     11     255       330        75   27.2695       116   9.06968     9.013
>     12     255       358       103   34.3292       112   9.12486   9.04374
>
> Here's a graph showing the first 100 "cur MB/s" values for a typical
> 4MB file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf
>
> mark.nelson@xxxxxxxxxxx said:
>> When you were doing this, what kind of results did collectl give you for
>> average write sizes to the underlying OSD disks?
>
>         The average "rwsize" reported by collectl hovered around
> 6 +/- a few (in whatever units collectl reports) for the RAID
> array, and around 15 for the journal SSD, while doing the small-file
> rados benchmark.  Here's a screenshot showing atop running on
> each of the MDS hosts, and collectl running on each of the OSD
> hosts, while the benchmark was running:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png
>
> Here's the same, but with collectl running on the MDSs instead of atop:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png
>
> Looking at the last screenshot again, it does look like the disks on
> the MDSs are getting some exercise, with ~40% utilization (if I'm
> interpreting the collectl output correctly).
>
> Here's a similar snapshot for the 4MB test:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png
>
> It looks like similar "pct util" on the MDS disks, but much higher
> average rwsize values on the OSDs.
>
> mark.nelson@xxxxxxxxxxx said:
>> There's multiple issues potentially here.  Part of it might be how  writes are
>> coalesced by XFS in each scenario.  Part of it might also be  overhead due to
>> XFS metadata reads/writes.  You could probably get a  better idea of both of
>> these by running blktrace during the tests and  making seekwatcher movies of
>> the results.  You not only can look at the  numbers of seeks, but also the
>> kind (read/writes) and where on the disk  they are going.  That, and some of
>> the raw blktrace data can give you a  lot of information about what is going
>> on and whether or not seeks are
>
>         I'll take a look at blktrace and see what I can find out.
>
> mark.nelson@xxxxxxxxxxx said:
>> Beyond that, I do think you are correct in suspecting that there are  some
>> Ceph limitations as well.  Some things that may be interesting to try:
>
>> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various
>> thread counts - Increasing various op and byte limits (such as
>> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under
>> the OSDs.
>
>         And I'll give some of these a try.
>
>         Regarding the iozone benchmarks:
> mark.nelson@xxxxxxxxxxx said:
>> Do you happen to have the settings you used when you ran these tests?  I
>> probably don't have time to try to repeat them now, but I can at least  take a
>> quick look at them.
>> I'm slightly confused by the labels on the graph.  They can't possibly  mean
>> that 2^16384 KB record sizes were tested.  Was that just up to 16MB  records
>> and 16GB files?  That would make a lot more sense.
>
> I just did something like:
>
>         cd /mnt/tmp (where the cephfs was mounted)
>         iozone -a > /tmp/iozone.log
>
> By default, iozone does its tests in the current working directory.
> The graphs were just produced with the Generate_Graphs script
> that comes with iozone.  There are certainly some problems with
> the axis labeling, but I think your interpretation is correct.
>
> mark.nelson@xxxxxxxxxxx said:
>> This might be a dumb question, but was the ceph version of this test on  a
>> single client on gigabit Ethernet?  If so, wouldn't that be the reason  you
>> are maxing out at like 114MB/s?
>
>         Duh.  You're exactly right.  I should have noticed this.
>
>         And finally:
> tv@xxxxxxxxxxx said:
>> If you want to benchmark just the metadata part, rsync with 0-size files might
>> actually be an interesting workload.
>
>         I'll see if I can work out a way to do this.
>
>                         Thanks to everyone for the suggestions.
>                         Bryan
> --
> ========================================================================
> Bryan Wright              |"If you take cranberries and stew them like
> Physics Department        | applesauce, they taste much more like prunes
> University of Virginia    | than rhubarb does."  --  Groucho
> Charlottesville, VA  22901|
> (434) 924-7218            |         bryan@xxxxxxxxxxxx
> ========================================================================
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html