RE: perf counters from a performance discrepancy

"Deneau, Tom" <tom.deneau@xxxxxxx> · Wed, 7 Oct 2015 22:39:56 +0000

> -----Original Message-----
> From: Deneau, Tom
> Sent: Wednesday, September 23, 2015 3:05 PM
> To: 'Mark Nelson'; Gregory Farnum; Sage Weil
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: perf counters from a performance discrepancy
> 
> 
> 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > Sent: Wednesday, September 23, 2015 1:43 PM
> > To: Gregory Farnum; Sage Weil
> > Cc: Deneau, Tom; ceph-devel@xxxxxxxxxxxxxxx
> > Subject: Re: perf counters from a performance discrepancy
> >
> >
> >
> > On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> > > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> > >>> Hi all --
> > >>>
> > >>> Looking for guidance with perf counters...
> > >>> I am trying to see whether the perf counters can tell me anything
> > >>> about the following discrepancy
> > >>>
> > >>> I populate a number of 40k size objects in each of two pools,
> > >>> poolA
> > and poolB.
> > >>> Both pools cover osds on a single node, 5 osds total.
> > >>>
> > >>>     * Config 1 (1p):
> > >>>        * use single rados bench client with 32 threads to do seq
> > >>> read
> > of 20000 objects from poolA.
> > >>>
> > >>>     * Config 2 (2p):
> > >>>        * use two concurrent rados bench clients (running on same
> > client node) with 16 threads each,
> > >>>             one reading 10000 objects from poolA,
> > >>>             one reading 10000 objects from poolB,
> > >>>
> > >>> So in both configs, we have 32 threads total and the number of
> > >>> objects
> > read is the same.
> > >>> Note: in all cases, we drop the caches before doing the seq reads
> > >>>
> > >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> > >>> about 1/3 of the bandwidth for the single client in config 1.
> > >>
> > >> How were the object written?  I assume the cluster is backed by
> > >> spinning disks?
> > >>
> > >> I wonder if this is a disk layout issue.  If the 20,000 objects are
> > >> written in order, they willb e roughly sequential on disk, and the
> > >> 32 thread case will read them in order.  In the 2x 10,000 case, the
> > >> two clients are reading two sequences of objects written at
> > >> different times, and the disk arms will be swinging around more.
> > >>
> > >> My guess is that if the reads were reading the objects in a random
> > >> order the performance would be the same... I'm not sure that rados
> > >> bench does that though?
> > >>
> > >> sage
> > >>
> > >>>
> > >>>
> > >>> I gathered perf counters before and after each run and looked at
> > >>> the difference of the before and after counters for both the 1p
> > >>> and 2p cases.  Here are some things I noticed that are different
> > >>> between the two runs.  Can someone take a look and let me know
> > >>> whether any of these differences are significant.  In particular,
> > >>> for the
> > throttle-msgr_dispatch_throttler ones, since I don't know the detailed
> > definitions of these fields.
> > >>> Note: these are the numbers for one of the 5 osds, the other osds
> > >>> are
> > similar...
> > >>>
> > >>> * The field osd/loadavg is always about 3 times higher on the 2p c
> > >>>
> > >>> some latency-related counters
> > >>> ------------------------------
> > >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> > >>> osd/op_process_latency/sum 1p=3.48506945394911,
> > >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924,
> > >>> 2p=579.722513079003 osd/op_r_process_latency/sum
> > >>> 1p=3.48506945399276,
> > >>> 2p=42.6278494550061
> > >
> > > So, yep, the individual read ops are taking much longer in the
> > > two-client case. Naively that's pretty odd.
> > >
> > >>>
> > >>>
> > >>> and some throttle-msgr_dispatch_throttler related counters
> > >>> ----------------------------------------------------------
> > >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339,
> > >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> > >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> > >>> 1p=1337, 2p=1339, diff=2
> > >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> > >>> 2p=223088, diff=211
> > >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58,
> > >>> 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58,
> > >>> 2p=134,
> > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> > >>> 1p=2726, 2p=6298, diff=3572
> > >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> > >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
> > >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
> > >>> 1p=168, 2p=252, diff=84
> > >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896,
> > >>> 2p=11844,
> > >>> diff=3948
> > >
> > > IIRC these are just saying how many times the dispatch throttler was
> > > accessed on each messenger — nothing here is surprising, you're
> > > doing basically the same number of messages on the client
> > > messengers, and the heartbeat messengers are passing more because
> > > the test takes longer.
> > >
> > > I'd go with Sage's idea for what is actually causing this, or try
> > > and look at how the latency changes over time — if you're going to
> > > two pools instead of one, presumably you're doubling the amount of
> > > metadata that needs to be read into memory during the run? Perhaps
> > > that's just a significant enough effect with your settings that
> > > you're seeing a bunch of extra directory lookups impact your
> > > throughput more than expected... :/
> >
> > FWIW, typically if I've seen an effect, it's been the opposite where
> > multiple rados bench processes are slightly faster (maybe simply
> > related to the client side implementation).  Running collectl or
> > iostat would show various interval statistics that would help diagnose
> > if this is related to slower accesses on the disks.  blktrace of
> > course would give a more nuanced view.  Might be worth doing if there
> are extra metadata accesses.
> >
> > A couple of other random thoughts:
> >
> > - Are any of the reads coming from buffer cache?
> > - Readahead not working well?
> > - Were the pools recreated between tests?
> > - If not, what were the total number of objects and PGs for each pool
> > (Is it possible that the per-pg directory hirearchy was deeper for the
> > 2nd set of tests?)
> > - If the pools still exist, what does the following script say about
> them?
> >
> > https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> >
> >
> 
> Thanks Sage, Greg and Mark for the suggestions --
> 
> The objects in the pools were created by running two simultaneous client
> rados bench write processes, one to each pool.  I started with empty osd
> disks.  The pools both cover the same osds and since there was a
> replication factor of 2, so I would have expected that not all the primary
> objects in each pool would be sequential on the disk.
> 
> But when I changed the read test to do rados bench rand rather than seq,
> both the 1p and 2p configs went down in bandwidth, so clearly the disk
> layout had something to do with it.  With random reads, 1p is now "only"
> 40% faster than 2p.
> 
> Mark had mentioned that he's often seen the opposite effect, ie, more
> client processes increasing the bandwidth, and we have seen the same,
> which is why I started using multiple clients.  In fact if I do the same
> configuration tests as above but with 4M objects rather than 40K, I see
> the 2p beats 1p by 30% or so. (and I've seen 4p help even more).
> 
> So there's something about smaller object sizes that gets hurt by multiple
> clients vs. 1 client.
> 
> I will try to look into this with some of the low level stat tools that
> Mark mentioned.
> 
> -- Tom
> 
> > > -Greg
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >

I finally got around to looking at the dump_historic_ops output for the 1-client and 2-client cases.
As you recall these are all read-ops.  so the events in the dump are
   initiated
   reached_pg
   started
   done

The pattern I see for most of the slow ops recorded in the dump is:

   * In the 1-client case the typical slow op has duration between 50-65 ms
     and usually most of this is the interval between reached_pg and started.

   * In the 2-client case the typical slow op has duration between 95-120 ms
     and again usually most of this is the interval between reached_pg and started.

Could someone describe what the interval between reached_pg and started means?

I did see this earlier mail thread on the definition of the events, but
maybe this discussion only applies to writes (or else things have changed for v9.0.3)
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg12783.html

-- Tom
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f