RE: [Cbt] rados benchmark, even distribution vs. random distribution

"Deneau, Tom" <tom.deneau@xxxxxxx> · Wed, 27 Jan 2016 17:01:50 +0000

> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
> Sent: Tuesday, January 26, 2016 1:33 PM
> To: Deneau, Tom
> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> distribution
> 
> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
> > Looking for some help with a Ceph performance puzzler...
> >
> > Background:
> > I had been experimenting with running rados benchmarks with a more
> > controlled distribution of objects across osds.  The main reason was
> > to reduce run to run variability since I was running on fairly small
> > clusters.
> >
> > I modified rados bench itself to optionally take a file with a list of
> > names and use those instead of the usual randomly generated names that
> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
> > then possible to generate and use a list of names which hashed to an
> > "even distribution" across osds if desired.  The list of names was
> > generated by finding a set of pgs that gave even distribution and then
> > generating names that mapped to those pgs.  So for example with 5 osds
> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
> > [2,3] [3,1] [4,0] and then all the names generated would map to only
> > those 5 pgs.
> >
> > Up until recently, the results from these experiments were what I
> expected,
> >    * better repeatability across runs
> >    * even disk utilization.
> >    * generally higher total Bandwidth
> >
> > Recently however I saw results on one platform that had much lower
> > bandwidth (less than half) with the "even distribution" run compared
> > to the default random distribution run.  Some notes were:
> >
> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
> >      objects.  Did not show the discrepancy on a replicated 2 pool on
> >      the same cluster.
> >
> >    * In general, larger objects showed the problem more than smaller
> >      objects.
> >
> >    * showed up only on reads from the erasure pool.  Writes to the
> >      same pool had higher bandwidth with the "even distribution"
> >
> >    * showed up on only one cluster, a separate cluster with the same
> >      number of nodes and disks but different system architecture did
> >      not show this.
> >
> >    * showed up only when the total number of client threads got "high
> >      enough".  For example showed up with 64 total client threads but
> >      not with 16.  The distribution of threads across client processes
> >      did not seem to matter.
> >
> > I tried looking at "dump_historic_ops" and did indeed see some read
> > ops logged with high latency in the "even distribution" case.  The
> > larger elapsed times in the historic ops were always in the
> > "reached_pg" and "done" steps.  But I saw similar high latencies and
> > elapsed times for "reached_pg" and "done" for historic read ops in the
> > random case.
> >
> > I have perf counters before and after the read tests.  I see big
> > differences in the op_r_out_bytes which makes sense because the higher
> > bw run processed more bytes.  For some osds, op_r_latency/sum is
> > slightly higher in the "even" run but not sure if this is significant.
> >
> > Anyway, I will probably just stop doing these "even distribution" runs
> > but I was hoping to get an understanding of why they might have suche
> > reduced bandwidth in this particular case.  Is there something about
> > mapping to a smaller number of pgs that becomes a bottleneck?
> 
> There's a lot of per-pg locking and pipelining that happens within the OSD
> process. If you're mapping to only a single PG per OSD, then you're
> basically forcing it to run single-threaded and to only handle one read at
> a time. If you want to force an even distribution of operations across
> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
> counts you're using in order to avoid "artificial" bottlenecks.
> -Greg

Greg --

Is there any performance counter which would show the
fact that we were basically single-threading in the OSDs?

-- Tom

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f