> -----Original Message----- > From: Deneau, Tom > Sent: Wednesday, September 23, 2015 3:05 PM > To: 'Mark Nelson'; Gregory Farnum; Sage Weil > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: perf counters from a performance discrepancy > > > > > -----Original Message----- > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > > Sent: Wednesday, September 23, 2015 1:43 PM > > To: Gregory Farnum; Sage Weil > > Cc: Deneau, Tom; ceph-devel@xxxxxxxxxxxxxxx > > Subject: Re: perf counters from a performance discrepancy > > > > > > > > On 09/23/2015 01:25 PM, Gregory Farnum wrote: > > > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > >> On Wed, 23 Sep 2015, Deneau, Tom wrote: > > >>> Hi all -- > > >>> > > >>> Looking for guidance with perf counters... > > >>> I am trying to see whether the perf counters can tell me anything > > >>> about the following discrepancy > > >>> > > >>> I populate a number of 40k size objects in each of two pools, > > >>> poolA > > and poolB. > > >>> Both pools cover osds on a single node, 5 osds total. > > >>> > > >>> * Config 1 (1p): > > >>> * use single rados bench client with 32 threads to do seq > > >>> read > > of 20000 objects from poolA. > > >>> > > >>> * Config 2 (2p): > > >>> * use two concurrent rados bench clients (running on same > > client node) with 16 threads each, > > >>> one reading 10000 objects from poolA, > > >>> one reading 10000 objects from poolB, > > >>> > > >>> So in both configs, we have 32 threads total and the number of > > >>> objects > > read is the same. > > >>> Note: in all cases, we drop the caches before doing the seq reads > > >>> > > >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is > > >>> about 1/3 of the bandwidth for the single client in config 1. > > >> > > >> How were the object written? I assume the cluster is backed by > > >> spinning disks? > > >> > > >> I wonder if this is a disk layout issue. If the 20,000 objects are > > >> written in order, they willb e roughly sequential on disk, and the > > >> 32 thread case will read them in order. In the 2x 10,000 case, the > > >> two clients are reading two sequences of objects written at > > >> different times, and the disk arms will be swinging around more. > > >> > > >> My guess is that if the reads were reading the objects in a random > > >> order the performance would be the same... I'm not sure that rados > > >> bench does that though? > > >> > > >> sage > > >> > > >>> > > >>> > > >>> I gathered perf counters before and after each run and looked at > > >>> the difference of the before and after counters for both the 1p > > >>> and 2p cases. Here are some things I noticed that are different > > >>> between the two runs. Can someone take a look and let me know > > >>> whether any of these differences are significant. In particular, > > >>> for the > > throttle-msgr_dispatch_throttler ones, since I don't know the detailed > > definitions of these fields. > > >>> Note: these are the numbers for one of the 5 osds, the other osds > > >>> are > > similar... > > >>> > > >>> * The field osd/loadavg is always about 3 times higher on the 2p c > > >>> > > >>> some latency-related counters > > >>> ------------------------------ > > >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945 > > >>> osd/op_process_latency/sum 1p=3.48506945394911, > > >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924, > > >>> 2p=579.722513079003 osd/op_r_process_latency/sum > > >>> 1p=3.48506945399276, > > >>> 2p=42.6278494550061 > > > > > > So, yep, the individual read ops are taking much longer in the > > > two-client case. Naively that's pretty odd. > > > > > >>> > > >>> > > >>> and some throttle-msgr_dispatch_throttler related counters > > >>> ---------------------------------------------------------- > > >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, > > >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, > > >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put > > >>> 1p=1337, 2p=1339, diff=2 > > >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, > > >>> 2p=223088, diff=211 > > >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134, > > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum > > >>> 1p=2726, 2p=6298, diff=3572 > > >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134, > > >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum > > >>> 1p=2726, 2p=6298, diff=3572 > > >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, > > >>> 2p=134, > > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum > > >>> 1p=2726, 2p=6298, diff=3572 > > >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, > > >>> 2p=134, > > >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum > > >>> 1p=2726, 2p=6298, diff=3572 > > >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, > > >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896, > > >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put > > >>> 1p=168, 2p=252, diff=84 > > >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, > > >>> 2p=11844, > > >>> diff=3948 > > > > > > IIRC these are just saying how many times the dispatch throttler was > > > accessed on each messenger — nothing here is surprising, you're > > > doing basically the same number of messages on the client > > > messengers, and the heartbeat messengers are passing more because > > > the test takes longer. > > > > > > I'd go with Sage's idea for what is actually causing this, or try > > > and look at how the latency changes over time — if you're going to > > > two pools instead of one, presumably you're doubling the amount of > > > metadata that needs to be read into memory during the run? Perhaps > > > that's just a significant enough effect with your settings that > > > you're seeing a bunch of extra directory lookups impact your > > > throughput more than expected... :/ > > > > FWIW, typically if I've seen an effect, it's been the opposite where > > multiple rados bench processes are slightly faster (maybe simply > > related to the client side implementation). Running collectl or > > iostat would show various interval statistics that would help diagnose > > if this is related to slower accesses on the disks. blktrace of > > course would give a more nuanced view. Might be worth doing if there > are extra metadata accesses. > > > > A couple of other random thoughts: > > > > - Are any of the reads coming from buffer cache? > > - Readahead not working well? > > - Were the pools recreated between tests? > > - If not, what were the total number of objects and PGs for each pool > > (Is it possible that the per-pg directory hirearchy was deeper for the > > 2nd set of tests?) > > - If the pools still exist, what does the following script say about > them? > > > > https://github.com/ceph/cbt/blob/master/tools/readpgdump.py > > > > > > Thanks Sage, Greg and Mark for the suggestions -- > > The objects in the pools were created by running two simultaneous client > rados bench write processes, one to each pool. I started with empty osd > disks. The pools both cover the same osds and since there was a > replication factor of 2, so I would have expected that not all the primary > objects in each pool would be sequential on the disk. > > But when I changed the read test to do rados bench rand rather than seq, > both the 1p and 2p configs went down in bandwidth, so clearly the disk > layout had something to do with it. With random reads, 1p is now "only" > 40% faster than 2p. > > Mark had mentioned that he's often seen the opposite effect, ie, more > client processes increasing the bandwidth, and we have seen the same, > which is why I started using multiple clients. In fact if I do the same > configuration tests as above but with 4M objects rather than 40K, I see > the 2p beats 1p by 30% or so. (and I've seen 4p help even more). > > So there's something about smaller object sizes that gets hurt by multiple > clients vs. 1 client. > > I will try to look into this with some of the low level stat tools that > Mark mentioned. > > -- Tom > > > > -Greg > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > I finally got around to looking at the dump_historic_ops output for the 1-client and 2-client cases. As you recall these are all read-ops. so the events in the dump are initiated reached_pg started done The pattern I see for most of the slow ops recorded in the dump is: * In the 1-client case the typical slow op has duration between 50-65 ms and usually most of this is the interval between reached_pg and started. * In the 2-client case the typical slow op has duration between 95-120 ms and again usually most of this is the interval between reached_pg and started. Could someone describe what the interval between reached_pg and started means? I did see this earlier mail thread on the definition of the events, but maybe this discussion only applies to writes (or else things have changed for v9.0.3) https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg12783.html -- Tom ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f