RE: [Cbt] rados benchmark, even distribution vs. random distribution

"Deneau, Tom" <tom.deneau@xxxxxxx> · Wed, 27 Jan 2016 19:14:16 +0000

> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
> Sent: Wednesday, January 27, 2016 12:08 PM
> To: Deneau, Tom
> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> distribution
> 
> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
> >
> >> -----Original Message-----
> >> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
> >> Sent: Tuesday, January 26, 2016 1:33 PM
> >> To: Deneau, Tom
> >> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx
> >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> >> distribution
> >>
> >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx>
> wrote:
> >> > Looking for some help with a Ceph performance puzzler...
> >> >
> >> > Background:
> >> > I had been experimenting with running rados benchmarks with a more
> >> > controlled distribution of objects across osds.  The main reason
> >> > was to reduce run to run variability since I was running on fairly
> >> > small clusters.
> >> >
> >> > I modified rados bench itself to optionally take a file with a list
> >> > of names and use those instead of the usual randomly generated
> >> > names that rados bench uses,
> >> > "benchmark_data_hostname_pid_objectxxx".  It was then possible to
> >> > generate and use a list of names which hashed to an "even
> >> > distribution" across osds if desired.  The list of names was
> >> > generated by finding a set of pgs that gave even distribution and
> >> > then generating names that mapped to those pgs.  So for example
> >> > with 5 osds on a replicated=2 pool we might find 5 pgs mapping to
> >> > [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would
> map to only those 5 pgs.
> >> >
> >> > Up until recently, the results from these experiments were what I
> >> expected,
> >> >    * better repeatability across runs
> >> >    * even disk utilization.
> >> >    * generally higher total Bandwidth
> >> >
> >> > Recently however I saw results on one platform that had much lower
> >> > bandwidth (less than half) with the "even distribution" run
> >> > compared to the default random distribution run.  Some notes were:
> >> >
> >> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
> >> >      objects.  Did not show the discrepancy on a replicated 2 pool on
> >> >      the same cluster.
> >> >
> >> >    * In general, larger objects showed the problem more than smaller
> >> >      objects.
> >> >
> >> >    * showed up only on reads from the erasure pool.  Writes to the
> >> >      same pool had higher bandwidth with the "even distribution"
> >> >
> >> >    * showed up on only one cluster, a separate cluster with the same
> >> >      number of nodes and disks but different system architecture did
> >> >      not show this.
> >> >
> >> >    * showed up only when the total number of client threads got "high
> >> >      enough".  For example showed up with 64 total client threads but
> >> >      not with 16.  The distribution of threads across client
> processes
> >> >      did not seem to matter.
> >> >
> >> > I tried looking at "dump_historic_ops" and did indeed see some read
> >> > ops logged with high latency in the "even distribution" case.  The
> >> > larger elapsed times in the historic ops were always in the
> >> > "reached_pg" and "done" steps.  But I saw similar high latencies
> >> > and elapsed times for "reached_pg" and "done" for historic read ops
> >> > in the random case.
> >> >
> >> > I have perf counters before and after the read tests.  I see big
> >> > differences in the op_r_out_bytes which makes sense because the
> >> > higher bw run processed more bytes.  For some osds,
> >> > op_r_latency/sum is slightly higher in the "even" run but not sure if
> this is significant.
> >> >
> >> > Anyway, I will probably just stop doing these "even distribution"
> >> > runs but I was hoping to get an understanding of why they might
> >> > have suche reduced bandwidth in this particular case.  Is there
> >> > something about mapping to a smaller number of pgs that becomes a
> bottleneck?
> >>
> >> There's a lot of per-pg locking and pipelining that happens within
> >> the OSD process. If you're mapping to only a single PG per OSD, then
> >> you're basically forcing it to run single-threaded and to only handle
> >> one read at a time. If you want to force an even distribution of
> >> operations across OSDs, you'll need to calculate names for enough PGs
> >> to exceed the sharding counts you're using in order to avoid
> "artificial" bottlenecks.
> >> -Greg
> >
> > Greg --
> >
> > Is there any performance counter which would show the fact that we
> > were basically single-threading in the OSDs?
> 
> I'm not aware of anything covering that. It's probably not too hard to add
> counters on how many ops per shard have been performed; PRs and tickets
> welcome.
> -Greg

Greg --

What is the meaning of 'shard' in this context?  Would this tell us
how much parallelism was going on in the osd?

-- Tom
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f