Re: [Cbt] rados benchmark, even distribution vs. random distribution

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 27 Jan 2016 11:17:37 -0800



On Wed, Jan 27, 2016 at 11:14 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
>> Sent: Wednesday, January 27, 2016 12:08 PM
>> To: Deneau, Tom
>> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> distribution
>>
>> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>> >
>> >> -----Original Message-----
>> >> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
>> >> Sent: Tuesday, January 26, 2016 1:33 PM
>> >> To: Deneau, Tom
>> >> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
>> >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> >> distribution
>> >>
>> >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx>
>> wrote:
>> >> > Looking for some help with a Ceph performance puzzler...
>> >> >
>> >> > Background:
>> >> > I had been experimenting with running rados benchmarks with a more
>> >> > controlled distribution of objects across osds.  The main reason
>> >> > was to reduce run to run variability since I was running on fairly
>> >> > small clusters.
>> >> >
>> >> > I modified rados bench itself to optionally take a file with a list
>> >> > of names and use those instead of the usual randomly generated
>> >> > names that rados bench uses,
>> >> > "benchmark_data_hostname_pid_objectxxx".  It was then possible to
>> >> > generate and use a list of names which hashed to an "even
>> >> > distribution" across osds if desired.  The list of names was
>> >> > generated by finding a set of pgs that gave even distribution and
>> >> > then generating names that mapped to those pgs.  So for example
>> >> > with 5 osds on a replicated=2 pool we might find 5 pgs mapping to
>> >> > [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would
>> map to only those 5 pgs.
>> >> >
>> >> > Up until recently, the results from these experiments were what I
>> >> expected,
>> >> >    * better repeatability across runs
>> >> >    * even disk utilization.
>> >> >    * generally higher total Bandwidth
>> >> >
>> >> > Recently however I saw results on one platform that had much lower
>> >> > bandwidth (less than half) with the "even distribution" run
>> >> > compared to the default random distribution run.  Some notes were:
>> >> >
>> >> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>> >> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>> >> >      the same cluster.
>> >> >
>> >> >    * In general, larger objects showed the problem more than smaller
>> >> >      objects.
>> >> >
>> >> >    * showed up only on reads from the erasure pool.  Writes to the
>> >> >      same pool had higher bandwidth with the "even distribution"
>> >> >
>> >> >    * showed up on only one cluster, a separate cluster with the same
>> >> >      number of nodes and disks but different system architecture did
>> >> >      not show this.
>> >> >
>> >> >    * showed up only when the total number of client threads got "high
>> >> >      enough".  For example showed up with 64 total client threads but
>> >> >      not with 16.  The distribution of threads across client
>> processes
>> >> >      did not seem to matter.
>> >> >
>> >> > I tried looking at "dump_historic_ops" and did indeed see some read
>> >> > ops logged with high latency in the "even distribution" case.  The
>> >> > larger elapsed times in the historic ops were always in the
>> >> > "reached_pg" and "done" steps.  But I saw similar high latencies
>> >> > and elapsed times for "reached_pg" and "done" for historic read ops
>> >> > in the random case.
>> >> >
>> >> > I have perf counters before and after the read tests.  I see big
>> >> > differences in the op_r_out_bytes which makes sense because the
>> >> > higher bw run processed more bytes.  For some osds,
>> >> > op_r_latency/sum is slightly higher in the "even" run but not sure if
>> this is significant.
>> >> >
>> >> > Anyway, I will probably just stop doing these "even distribution"
>> >> > runs but I was hoping to get an understanding of why they might
>> >> > have suche reduced bandwidth in this particular case.  Is there
>> >> > something about mapping to a smaller number of pgs that becomes a
>> bottleneck?
>> >>
>> >> There's a lot of per-pg locking and pipelining that happens within
>> >> the OSD process. If you're mapping to only a single PG per OSD, then
>> >> you're basically forcing it to run single-threaded and to only handle
>> >> one read at a time. If you want to force an even distribution of
>> >> operations across OSDs, you'll need to calculate names for enough PGs
>> >> to exceed the sharding counts you're using in order to avoid
>> "artificial" bottlenecks.
>> >> -Greg
>> >
>> > Greg --
>> >
>> > Is there any performance counter which would show the fact that we
>> > were basically single-threading in the OSDs?
>>
>> I'm not aware of anything covering that. It's probably not too hard to add
>> counters on how many ops per shard have been performed; PRs and tickets
>> welcome.
>> -Greg
>
> Greg --
>
> What is the meaning of 'shard' in this context?  Would this tell us
> how much parallelism was going on in the osd?

We have a "ShardedOpQueue" (or similar) in the OSD which handles all
the worker threads. PGs are mapped to a single shard for all
processing, and while operations within a single shard might be
concurrent (eg, a write can go to disk and leave the CPU free to
process an op on another PG within the same shard), it is the unit of
parallelism. So if you've got ops within only a single shard, you'll
know you're not getting an even spread and are probably bottlenecking
on that thread. You can do similar comparisons across time by taking
snapshots of the counters and seeing how they change, or introducing
more complicated counters to try and directly measure parallelism.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html