I will be out of office for a week but will put this on the list of things to try when I get back. -- Tom > -----Original Message----- > From: Samuel Just [mailto:sjust@xxxxxxxxxx] > Sent: Wednesday, September 23, 2015 3:28 PM > To: Deneau, Tom > Cc: Mark Nelson; Gregory Farnum; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: perf counters from a performance discrepancy > > Just to eliminate a variable, can you reproduce this on master, first with > the simple messenger, and then with the async messenger? (make sure to > switch the messengers on all daemons and clients, just put it in the > [global] section on all configs). > -Sam > > On Wed, Sep 23, 2015 at 1:05 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > > > > > >> -----Original Message----- > >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > >> Sent: Wednesday, September 23, 2015 1:43 PM > >> To: Gregory Farnum; Sage Weil > >> Cc: Deneau, Tom; ceph-devel@xxxxxxxxxxxxxxx > >> Subject: Re: perf counters from a performance discrepancy > >> > >> > >> > >> On 09/23/2015 01:25 PM, Gregory Farnum wrote: > >> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <sage@xxxxxxxxxxxx> > wrote: > >> >> On Wed, 23 Sep 2015, Deneau, Tom wrote: > >> >>> Hi all -- > >> >>> > >> >>> Looking for guidance with perf counters... > >> >>> I am trying to see whether the perf counters can tell me anything > >> >>> about the following discrepancy > >> >>> > >> >>> I populate a number of 40k size objects in each of two pools, > >> >>> poolA > >> and poolB. > >> >>> Both pools cover osds on a single node, 5 osds total. > >> >>> > >> >>> * Config 1 (1p): > >> >>> * use single rados bench client with 32 threads to do seq > >> >>> read > >> of 20000 objects from poolA. > >> >>> > >> >>> * Config 2 (2p): > >> >>> * use two concurrent rados bench clients (running on same > >> client node) with 16 threads each, > >> >>> one reading 10000 objects from poolA, > >> >>> one reading 10000 objects from poolB, > >> >>> > >> >>> So in both configs, we have 32 threads total and the number of > >> >>> objects > >> read is the same. > >> >>> Note: in all cases, we drop the caches before doing the seq reads > >> >>> > >> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is > >> >>> about 1/3 of the bandwidth for the single client in config 1. > >> >> > >> >> How were the object written? I assume the cluster is backed by > >> >> spinning disks? > >> >> > >> >> I wonder if this is a disk layout issue. If the 20,000 objects > >> >> are written in order, they willb e roughly sequential on disk, and > >> >> the 32 thread case will read them in order. In the 2x 10,000 > >> >> case, the two clients are reading two sequences of objects written > >> >> at different times, and the disk arms will be swinging around more. > >> >> > >> >> My guess is that if the reads were reading the objects in a random > >> >> order the performance would be the same... I'm not sure that rados > >> >> bench does that though? > >> >> > >> >> sage > >> >> > >> >>> > >> >>> > >> >>> I gathered perf counters before and after each run and looked at > >> >>> the difference of the before and after counters for both the 1p > >> >>> and 2p cases. Here are some things I noticed that are different > >> >>> between the two runs. Can someone take a look and let me know > >> >>> whether any of these differences are significant. In particular, > >> >>> for the > >> throttle-msgr_dispatch_throttler ones, since I don't know the > >> detailed definitions of these fields. > >> >>> Note: these are the numbers for one of the 5 osds, the other osds > >> >>> are > >> similar... > >> >>> > >> >>> * The field osd/loadavg is always about 3 times higher on the 2p > >> >>> c > >> >>> > >> >>> some latency-related counters > >> >>> ------------------------------ > >> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945 > >> >>> osd/op_process_latency/sum 1p=3.48506945394911, > >> >>> 2p=42.6278494549915 osd/op_r_latency/sum 1p=6.2480111719924, > >> >>> 2p=579.722513079003 osd/op_r_process_latency/sum > >> >>> 1p=3.48506945399276, > >> >>> 2p=42.6278494550061 > >> > > >> > So, yep, the individual read ops are taking much longer in the > >> > two-client case. Naively that's pretty odd. > >> > > >> >>> > >> >>> > >> >>> and some throttle-msgr_dispatch_throttler related counters > >> >>> ---------------------------------------------------------- > >> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, > >> >>> diff=2 throttle-msgr_dispatch_throttler-client/get_sum 1p=222877, > >> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put > >> >>> 1p=1337, 2p=1339, diff=2 > >> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877, > >> >>> 2p=223088, diff=211 > >> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, > >> >>> 2p=134, > >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum > >> >>> 1p=2726, 2p=6298, diff=3572 > >> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, > >> >>> 2p=134, > >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum > >> >>> 1p=2726, 2p=6298, diff=3572 > >> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, > >> >>> 2p=134, > >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum > >> >>> 1p=2726, 2p=6298, diff=3572 > >> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, > >> >>> 2p=134, > >> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum > >> >>> 1p=2726, 2p=6298, diff=3572 > >> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252, > >> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum > >> >>> 1p=7896, 2p=11844, diff=3948 > >> >>> throttle-msgr_dispatch_throttler-hbclient/put > >> >>> 1p=168, 2p=252, diff=84 > >> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, > >> >>> 2p=11844, > >> >>> diff=3948 > >> > > >> > IIRC these are just saying how many times the dispatch throttler > >> > was accessed on each messenger — nothing here is surprising, you're > >> > doing basically the same number of messages on the client > >> > messengers, and the heartbeat messengers are passing more because > >> > the test takes longer. > >> > > >> > I'd go with Sage's idea for what is actually causing this, or try > >> > and look at how the latency changes over time — if you're going to > >> > two pools instead of one, presumably you're doubling the amount of > >> > metadata that needs to be read into memory during the run? Perhaps > >> > that's just a significant enough effect with your settings that > >> > you're seeing a bunch of extra directory lookups impact your > >> > throughput more than expected... :/ > >> > >> FWIW, typically if I've seen an effect, it's been the opposite where > >> multiple rados bench processes are slightly faster (maybe simply > >> related to the client side implementation). Running collectl or > >> iostat would show various interval statistics that would help > >> diagnose if this is related to slower accesses on the disks. > >> blktrace of course would give a more nuanced view. Might be worth > doing if there are extra metadata accesses. > >> > >> A couple of other random thoughts: > >> > >> - Are any of the reads coming from buffer cache? > >> - Readahead not working well? > >> - Were the pools recreated between tests? > >> - If not, what were the total number of objects and PGs for each pool > >> (Is it possible that the per-pg directory hirearchy was deeper for > >> the 2nd set of tests?) > >> - If the pools still exist, what does the following script say about > them? > >> > >> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py > >> > >> > > > > Thanks Sage, Greg and Mark for the suggestions -- > > > > The objects in the pools were created by running two simultaneous > > client rados bench write processes, one to each pool. I started with > > empty osd disks. The pools both cover the same osds and since there > > was a replication factor of 2, so I would have expected that not all > > the primary objects in each pool would be sequential on the disk. > > > > But when I changed the read test to do rados bench rand rather than > > seq, both the 1p and 2p configs went down in bandwidth, so clearly the > > disk layout had something to do with it. With random reads, 1p is now > > "only" 40% faster than 2p. > > > > Mark had mentioned that he's often seen the opposite effect, ie, more > > client processes increasing the bandwidth, and we have seen the same, > > which is why I started using multiple clients. In fact if I do the > > same configuration tests as above but with 4M objects rather than 40K, > > I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more). > > > > So there's something about smaller object sizes that gets hurt by > > multiple clients vs. 1 client. > > > > I will try to look into this with some of the low level stat tools > > that Mark mentioned. > > > > -- Tom > > > >> > -Greg > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >> > majordomo info at http://vger.kernel.org/majordomo-info.html > >> > ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f