Re: [Cbt] rados benchmark, even distribution vs. random distribution

Pavan R <pavanrallabhandis@xxxxxxxxx> · Thu, 28 Jan 2016 00:14:49 +0530

I recollect doing something on those lines during the tcmalloc perf
issue days, wherein we wanted to see how evenly the shards in an OSD
work queue were populated for pure random workloads. Can pull that in
a production form if that helps.

Thanks,
Pavan.

On Wed, Jan 27, 2016 at 11:37 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>>
>>> -----Original Message-----
>>> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
>>> Sent: Tuesday, January 26, 2016 1:33 PM
>>> To: Deneau, Tom
>>> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
>>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>>> distribution
>>>
>>> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>>> > Looking for some help with a Ceph performance puzzler...
>>> >
>>> > Background:
>>> > I had been experimenting with running rados benchmarks with a more
>>> > controlled distribution of objects across osds.  The main reason was
>>> > to reduce run to run variability since I was running on fairly small
>>> > clusters.
>>> >
>>> > I modified rados bench itself to optionally take a file with a list of
>>> > names and use those instead of the usual randomly generated names that
>>> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
>>> > then possible to generate and use a list of names which hashed to an
>>> > "even distribution" across osds if desired.  The list of names was
>>> > generated by finding a set of pgs that gave even distribution and then
>>> > generating names that mapped to those pgs.  So for example with 5 osds
>>> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
>>> > [2,3] [3,1] [4,0] and then all the names generated would map to only
>>> > those 5 pgs.
>>> >
>>> > Up until recently, the results from these experiments were what I
>>> expected,
>>> >    * better repeatability across runs
>>> >    * even disk utilization.
>>> >    * generally higher total Bandwidth
>>> >
>>> > Recently however I saw results on one platform that had much lower
>>> > bandwidth (less than half) with the "even distribution" run compared
>>> > to the default random distribution run.  Some notes were:
>>> >
>>> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>>> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>>> >      the same cluster.
>>> >
>>> >    * In general, larger objects showed the problem more than smaller
>>> >      objects.
>>> >
>>> >    * showed up only on reads from the erasure pool.  Writes to the
>>> >      same pool had higher bandwidth with the "even distribution"
>>> >
>>> >    * showed up on only one cluster, a separate cluster with the same
>>> >      number of nodes and disks but different system architecture did
>>> >      not show this.
>>> >
>>> >    * showed up only when the total number of client threads got "high
>>> >      enough".  For example showed up with 64 total client threads but
>>> >      not with 16.  The distribution of threads across client processes
>>> >      did not seem to matter.
>>> >
>>> > I tried looking at "dump_historic_ops" and did indeed see some read
>>> > ops logged with high latency in the "even distribution" case.  The
>>> > larger elapsed times in the historic ops were always in the
>>> > "reached_pg" and "done" steps.  But I saw similar high latencies and
>>> > elapsed times for "reached_pg" and "done" for historic read ops in the
>>> > random case.
>>> >
>>> > I have perf counters before and after the read tests.  I see big
>>> > differences in the op_r_out_bytes which makes sense because the higher
>>> > bw run processed more bytes.  For some osds, op_r_latency/sum is
>>> > slightly higher in the "even" run but not sure if this is significant.
>>> >
>>> > Anyway, I will probably just stop doing these "even distribution" runs
>>> > but I was hoping to get an understanding of why they might have suche
>>> > reduced bandwidth in this particular case.  Is there something about
>>> > mapping to a smaller number of pgs that becomes a bottleneck?
>>>
>>> There's a lot of per-pg locking and pipelining that happens within the OSD
>>> process. If you're mapping to only a single PG per OSD, then you're
>>> basically forcing it to run single-threaded and to only handle one read at
>>> a time. If you want to force an even distribution of operations across
>>> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
>>> counts you're using in order to avoid "artificial" bottlenecks.
>>> -Greg
>>
>> Greg --
>>
>> Is there any performance counter which would show the
>> fact that we were basically single-threading in the OSDs?
>
> I'm not aware of anything covering that. It's probably not too hard to
> add counters on how many ops per shard have been performed; PRs and
> tickets welcome.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html