Re: [Cbt] rados benchmark, even distribution vs. random distribution

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 27 Jan 2016 10:07:57 -0800



On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
>> Sent: Tuesday, January 26, 2016 1:33 PM
>> To: Deneau, Tom
>> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> distribution
>>
>> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>> > Looking for some help with a Ceph performance puzzler...
>> >
>> > Background:
>> > I had been experimenting with running rados benchmarks with a more
>> > controlled distribution of objects across osds.  The main reason was
>> > to reduce run to run variability since I was running on fairly small
>> > clusters.
>> >
>> > I modified rados bench itself to optionally take a file with a list of
>> > names and use those instead of the usual randomly generated names that
>> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
>> > then possible to generate and use a list of names which hashed to an
>> > "even distribution" across osds if desired.  The list of names was
>> > generated by finding a set of pgs that gave even distribution and then
>> > generating names that mapped to those pgs.  So for example with 5 osds
>> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
>> > [2,3] [3,1] [4,0] and then all the names generated would map to only
>> > those 5 pgs.
>> >
>> > Up until recently, the results from these experiments were what I
>> expected,
>> >    * better repeatability across runs
>> >    * even disk utilization.
>> >    * generally higher total Bandwidth
>> >
>> > Recently however I saw results on one platform that had much lower
>> > bandwidth (less than half) with the "even distribution" run compared
>> > to the default random distribution run.  Some notes were:
>> >
>> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>> >      the same cluster.
>> >
>> >    * In general, larger objects showed the problem more than smaller
>> >      objects.
>> >
>> >    * showed up only on reads from the erasure pool.  Writes to the
>> >      same pool had higher bandwidth with the "even distribution"
>> >
>> >    * showed up on only one cluster, a separate cluster with the same
>> >      number of nodes and disks but different system architecture did
>> >      not show this.
>> >
>> >    * showed up only when the total number of client threads got "high
>> >      enough".  For example showed up with 64 total client threads but
>> >      not with 16.  The distribution of threads across client processes
>> >      did not seem to matter.
>> >
>> > I tried looking at "dump_historic_ops" and did indeed see some read
>> > ops logged with high latency in the "even distribution" case.  The
>> > larger elapsed times in the historic ops were always in the
>> > "reached_pg" and "done" steps.  But I saw similar high latencies and
>> > elapsed times for "reached_pg" and "done" for historic read ops in the
>> > random case.
>> >
>> > I have perf counters before and after the read tests.  I see big
>> > differences in the op_r_out_bytes which makes sense because the higher
>> > bw run processed more bytes.  For some osds, op_r_latency/sum is
>> > slightly higher in the "even" run but not sure if this is significant.
>> >
>> > Anyway, I will probably just stop doing these "even distribution" runs
>> > but I was hoping to get an understanding of why they might have suche
>> > reduced bandwidth in this particular case.  Is there something about
>> > mapping to a smaller number of pgs that becomes a bottleneck?
>>
>> There's a lot of per-pg locking and pipelining that happens within the OSD
>> process. If you're mapping to only a single PG per OSD, then you're
>> basically forcing it to run single-threaded and to only handle one read at
>> a time. If you want to force an even distribution of operations across
>> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
>> counts you're using in order to avoid "artificial" bottlenecks.
>> -Greg
>
> Greg --
>
> Is there any performance counter which would show the
> fact that we were basically single-threading in the OSDs?

I'm not aware of anything covering that. It's probably not too hard to
add counters on how many ops per shard have been performed; PRs and
tickets welcome.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html