Re: perf counters from a performance discrepancy

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Sep 2015 13:42:32 -0500

On 09/23/2015 01:25 PM, Gregory Farnum wrote:
On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Wed, 23 Sep 2015, Deneau, Tom wrote:
Hi all --

Looking for guidance with perf counters...
I am trying to see whether the perf counters can tell me anything about the following discrepancy

I populate a number of 40k size objects in each of two pools, poolA and poolB.
Both pools cover osds on a single node, 5 osds total.

    * Config 1 (1p):
       * use single rados bench client with 32 threads to do seq read of 20000 objects from poolA.

    * Config 2 (2p):
       * use two concurrent rados bench clients (running on same client node) with 16 threads each,
            one reading 10000 objects from poolA,
            one reading 10000 objects from poolB,

So in both configs, we have 32 threads total and the number of objects read is the same.
Note: in all cases, we drop the caches before doing the seq reads

The combined bandwidth (MB/sec) for the 2 clients in config 2 is about 1/3 of the bandwidth for
the single client in config 1.

How were the object written?  I assume the cluster is backed by spinning
disks?

I wonder if this is a disk layout issue.  If the 20,000 objects are
written in order, they willb e roughly sequential on disk, and the 32
thread case will read them in order.  In the 2x 10,000 case, the two
clients are reading two sequences of objects written at different
times, and the disk arms will be swinging around more.

My guess is that if the reads were reading the objects in a random order
the performance would be the same... I'm not sure that rados bench does
that though?

sage

I gathered perf counters before and after each run and looked at the difference of
the before and after counters for both the 1p and 2p cases.  Here are some things I noticed
that are different between the two runs.  Can someone take a look and let me know
whether any of these differences are significant.  In particular, for the
throttle-msgr_dispatch_throttler ones, since I don't know the detailed definitions of these fields.
Note: these are the numbers for one of the 5 osds, the other osds are similar...

* The field osd/loadavg is always about 3 times higher on the 2p c

some latency-related counters
------------------------------
osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
osd/op_r_process_latency/sum 1p=3.48506945399276, 2p=42.6278494550061

So, yep, the individual read ops are taking much longer in the
two-client case. Naively that's pretty odd.

and some throttle-msgr_dispatch_throttler related counters
----------------------------------------------------------
throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-client/put 1p=1337, 2p=1339, diff=2
throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, 2p=223088, diff=211
throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/get_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_back_server/put_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/get_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134, diff=76
throttle-msgr_dispatch_throttler-hb_front_server/put_sum 1p=2726, 2p=6298, diff=3572
throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, 2p=11844, diff=3948
throttle-msgr_dispatch_throttler-hbclient/put 1p=168, 2p=252, diff=84
throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844, diff=3948

IIRC these are just saying how many times the dispatch throttler was
accessed on each messenger — nothing here is surprising, you're doing
basically the same number of messages on the client messengers, and
the heartbeat messengers are passing more because the test takes
longer.

I'd go with Sage's idea for what is actually causing this, or try and
look at how the latency changes over time — if you're going to two
pools instead of one, presumably you're doubling the amount of
metadata that needs to be read into memory during the run? Perhaps
that's just a significant enough effect with your settings that you're
seeing a bunch of extra directory lookups impact your throughput more
than expected... :/

FWIW, typically if I've seen an effect, it's been the opposite where 
multiple rados bench processes are slightly faster (maybe simply related 
to the client side implementation).  Running collectl or iostat would 
show various interval statistics that would help diagnose if this is 
related to slower accesses on the disks.  blktrace of course would give 
a more nuanced view.  Might be worth doing if there are extra metadata 
accesses.

A couple of other random thoughts:

- Are any of the reads coming from buffer cache?
- Readahead not working well?
- Were the pools recreated between tests?
- If not, what were the total number of objects and PGs for each pool 
(Is it possible that the per-pg directory hirearchy was deeper for the 
2nd set of tests?)
- If the pools still exist, what does the following script say about them?

https://github.com/ceph/cbt/blob/master/tools/readpgdump.py

-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html