> -----Original Message----- > From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] > Sent: Wednesday, January 27, 2016 12:08 PM > To: Deneau, Tom > Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx > Subject: Re: [Cbt] rados benchmark, even distribution vs. random > distribution > > On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > > > >> -----Original Message----- > >> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] > >> Sent: Tuesday, January 26, 2016 1:33 PM > >> To: Deneau, Tom > >> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx > >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random > >> distribution > >> > >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> > wrote: > >> > Looking for some help with a Ceph performance puzzler... > >> > > >> > Background: > >> > I had been experimenting with running rados benchmarks with a more > >> > controlled distribution of objects across osds. The main reason > >> > was to reduce run to run variability since I was running on fairly > >> > small clusters. > >> > > >> > I modified rados bench itself to optionally take a file with a list > >> > of names and use those instead of the usual randomly generated > >> > names that rados bench uses, > >> > "benchmark_data_hostname_pid_objectxxx". It was then possible to > >> > generate and use a list of names which hashed to an "even > >> > distribution" across osds if desired. The list of names was > >> > generated by finding a set of pgs that gave even distribution and > >> > then generating names that mapped to those pgs. So for example > >> > with 5 osds on a replicated=2 pool we might find 5 pgs mapping to > >> > [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would > map to only those 5 pgs. > >> > > >> > Up until recently, the results from these experiments were what I > >> expected, > >> > * better repeatability across runs > >> > * even disk utilization. > >> > * generally higher total Bandwidth > >> > > >> > Recently however I saw results on one platform that had much lower > >> > bandwidth (less than half) with the "even distribution" run > >> > compared to the default random distribution run. Some notes were: > >> > > >> > * showed up in erasure coded pool with k=2, m=1, with 4M size > >> > objects. Did not show the discrepancy on a replicated 2 pool on > >> > the same cluster. > >> > > >> > * In general, larger objects showed the problem more than smaller > >> > objects. > >> > > >> > * showed up only on reads from the erasure pool. Writes to the > >> > same pool had higher bandwidth with the "even distribution" > >> > > >> > * showed up on only one cluster, a separate cluster with the same > >> > number of nodes and disks but different system architecture did > >> > not show this. > >> > > >> > * showed up only when the total number of client threads got "high > >> > enough". For example showed up with 64 total client threads but > >> > not with 16. The distribution of threads across client > processes > >> > did not seem to matter. > >> > > >> > I tried looking at "dump_historic_ops" and did indeed see some read > >> > ops logged with high latency in the "even distribution" case. The > >> > larger elapsed times in the historic ops were always in the > >> > "reached_pg" and "done" steps. But I saw similar high latencies > >> > and elapsed times for "reached_pg" and "done" for historic read ops > >> > in the random case. > >> > > >> > I have perf counters before and after the read tests. I see big > >> > differences in the op_r_out_bytes which makes sense because the > >> > higher bw run processed more bytes. For some osds, > >> > op_r_latency/sum is slightly higher in the "even" run but not sure if > this is significant. > >> > > >> > Anyway, I will probably just stop doing these "even distribution" > >> > runs but I was hoping to get an understanding of why they might > >> > have suche reduced bandwidth in this particular case. Is there > >> > something about mapping to a smaller number of pgs that becomes a > bottleneck? > >> > >> There's a lot of per-pg locking and pipelining that happens within > >> the OSD process. If you're mapping to only a single PG per OSD, then > >> you're basically forcing it to run single-threaded and to only handle > >> one read at a time. If you want to force an even distribution of > >> operations across OSDs, you'll need to calculate names for enough PGs > >> to exceed the sharding counts you're using in order to avoid > "artificial" bottlenecks. > >> -Greg > > > > Greg -- > > > > Is there any performance counter which would show the fact that we > > were basically single-threading in the OSDs? > > I'm not aware of anything covering that. It's probably not too hard to add > counters on how many ops per shard have been performed; PRs and tickets > welcome. > -Greg Greg -- What is the meaning of 'shard' in this context? Would this tell us how much parallelism was going on in the osd? -- Tom ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f