RE: perf counters from a performance discrepancy

"Deneau, Tom" <tom.deneau@xxxxxxx> · Wed, 23 Sep 2015 20:46:40 +0000

I will be out of office for a week but will put this on the list of things to try when I get back.

-- Tom

> -----Original Message-----
> From: Samuel Just [mailto:sjust@xxxxxxxxxx]
> Sent: Wednesday, September 23, 2015 3:28 PM
> To: Deneau, Tom
> Cc: Mark Nelson; Gregory Farnum; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: perf counters from a performance discrepancy
> 
> Just to eliminate a variable, can you reproduce this on master, first with
> the simple messenger, and then with the async messenger? (make sure to
> switch the messengers on all daemons and clients, just put it in the
> [global] section on all configs).
> -Sam
> 
> On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> >> Sent: Wednesday, September 23, 2015 1:43 PM
> >> To: Gregory Farnum; Sage Weil
> >> Cc: Deneau, Tom; ceph-devel@xxxxxxxxxxxxxxx
> >> Subject: Re: perf counters from a performance discrepancy
> >>
> >>
> >>
> >> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> >> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@xxxxxxxxxxxx>
> wrote:
> >> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >> >>> Hi all --
> >> >>>
> >> >>> Looking for guidance with perf counters...
> >> >>> I am trying to see whether the perf counters can tell me anything
> >> >>> about the following discrepancy
> >> >>>
> >> >>> I populate a number of 40k size objects in each of two pools,
> >> >>> poolA
> >> and poolB.
> >> >>> Both pools cover osds on a single node, 5 osds total.
> >> >>>
> >> >>>     * Config 1 (1p):
> >> >>>        * use single rados bench client with 32 threads to do seq
> >> >>> read
> >> of 20000 objects from poolA.
> >> >>>
> >> >>>     * Config 2 (2p):
> >> >>>        * use two concurrent rados bench clients (running on same
> >> client node) with 16 threads each,
> >> >>>             one reading 10000 objects from poolA,
> >> >>>             one reading 10000 objects from poolB,
> >> >>>
> >> >>> So in both configs, we have 32 threads total and the number of
> >> >>> objects
> >> read is the same.
> >> >>> Note: in all cases, we drop the caches before doing the seq reads
> >> >>>
> >> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >> >>> about 1/3 of the bandwidth for the single client in config 1.
> >> >>
> >> >> How were the object written?  I assume the cluster is backed by
> >> >> spinning disks?
> >> >>
> >> >> I wonder if this is a disk layout issue.  If the 20,000 objects
> >> >> are written in order, they willb e roughly sequential on disk, and
> >> >> the 32 thread case will read them in order.  In the 2x 10,000
> >> >> case, the two clients are reading two sequences of objects written
> >> >> at different times, and the disk arms will be swinging around more.
> >> >>
> >> >> My guess is that if the reads were reading the objects in a random
> >> >> order the performance would be the same... I'm not sure that rados
> >> >> bench does that though?
> >> >>
> >> >> sage
> >> >>
> >> >>>
> >> >>>
> >> >>> I gathered perf counters before and after each run and looked at
> >> >>> the difference of the before and after counters for both the 1p
> >> >>> and 2p cases.  Here are some things I noticed that are different
> >> >>> between the two runs.  Can someone take a look and let me know
> >> >>> whether any of these differences are significant.  In particular,
> >> >>> for the
> >> throttle-msgr_dispatch_throttler ones, since I don't know the
> >> detailed definitions of these fields.
> >> >>> Note: these are the numbers for one of the 5 osds, the other osds
> >> >>> are
> >> similar...
> >> >>>
> >> >>> * The field osd/loadavg is always about 3 times higher on the 2p
> >> >>> c
> >> >>>
> >> >>> some latency-related counters
> >> >>> ------------------------------
> >> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >> >>> osd/op_process_latency/sum 1p=3.48506945394911,
> >> >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924,
> >> >>> 2p=579.722513079003 osd/op_r_process_latency/sum
> >> >>> 1p=3.48506945399276,
> >> >>> 2p=42.6278494550061
> >> >
> >> > So, yep, the individual read ops are taking much longer in the
> >> > two-client case. Naively that's pretty odd.
> >> >
> >> >>>
> >> >>>
> >> >>> and some throttle-msgr_dispatch_throttler related counters
> >> >>> ----------------------------------------------------------
> >> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339,
> >> >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >> >>> 1p=1337, 2p=1339, diff=2
> >> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >> >>> 2p=223088, diff=211
> >> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58,
> >> >>> 2p=134,
> >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> >> >>> 1p=2726, 2p=6298, diff=3572
> >> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> >> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum
> >> >>> 1p=7896, 2p=11844, diff=3948
> >> >>> throttle-msgr_dispatch_throttler-hbclient/put
> >> >>> 1p=168, 2p=252, diff=84
> >> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896,
> >> >>> 2p=11844,
> >> >>> diff=3948
> >> >
> >> > IIRC these are just saying how many times the dispatch throttler
> >> > was accessed on each messenger — nothing here is surprising, you're
> >> > doing basically the same number of messages on the client
> >> > messengers, and the heartbeat messengers are passing more because
> >> > the test takes longer.
> >> >
> >> > I'd go with Sage's idea for what is actually causing this, or try
> >> > and look at how the latency changes over time — if you're going to
> >> > two pools instead of one, presumably you're doubling the amount of
> >> > metadata that needs to be read into memory during the run? Perhaps
> >> > that's just a significant enough effect with your settings that
> >> > you're seeing a bunch of extra directory lookups impact your
> >> > throughput more than expected... :/
> >>
> >> FWIW, typically if I've seen an effect, it's been the opposite where
> >> multiple rados bench processes are slightly faster (maybe simply
> >> related to the client side implementation).  Running collectl or
> >> iostat would show various interval statistics that would help
> >> diagnose if this is related to slower accesses on the disks.
> >> blktrace of course would give a more nuanced view.  Might be worth
> doing if there are extra metadata accesses.
> >>
> >> A couple of other random thoughts:
> >>
> >> - Are any of the reads coming from buffer cache?
> >> - Readahead not working well?
> >> - Were the pools recreated between tests?
> >> - If not, what were the total number of objects and PGs for each pool
> >> (Is it possible that the per-pg directory hirearchy was deeper for
> >> the 2nd set of tests?)
> >> - If the pools still exist, what does the following script say about
> them?
> >>
> >> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> >>
> >>
> >
> > Thanks Sage, Greg and Mark for the suggestions --
> >
> > The objects in the pools were created by running two simultaneous
> > client rados bench write processes, one to each pool.  I started with
> > empty osd disks.  The pools both cover the same osds and since there
> > was a replication factor of 2, so I would have expected that not all
> > the primary objects in each pool would be sequential on the disk.
> >
> > But when I changed the read test to do rados bench rand rather than
> > seq, both the 1p and 2p configs went down in bandwidth, so clearly the
> > disk layout had something to do with it.  With random reads, 1p is now
> > "only" 40% faster than 2p.
> >
> > Mark had mentioned that he's often seen the opposite effect, ie, more
> > client processes increasing the bandwidth, and we have seen the same,
> > which is why I started using multiple clients.  In fact if I do the
> > same configuration tests as above but with 4M objects rather than 40K,
> > I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more).
> >
> > So there's something about smaller object sizes that gets hurt by
> > multiple clients vs. 1 client.
> >
> > I will try to look into this with some of the low level stat tools
> > that Mark mentioned.
> >
> > -- Tom
> >
> >> > -Greg
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f