> -----Original Message----- > From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] > Sent: Tuesday, January 26, 2016 1:33 PM > To: Deneau, Tom > Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx > Subject: Re: [Cbt] rados benchmark, even distribution vs. random > distribution > > On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > > Looking for some help with a Ceph performance puzzler... > > > > Background: > > I had been experimenting with running rados benchmarks with a more > > controlled distribution of objects across osds. The main reason was > > to reduce run to run variability since I was running on fairly small > > clusters. > > > > I modified rados bench itself to optionally take a file with a list of > > names and use those instead of the usual randomly generated names that > > rados bench uses, "benchmark_data_hostname_pid_objectxxx". It was > > then possible to generate and use a list of names which hashed to an > > "even distribution" across osds if desired. The list of names was > > generated by finding a set of pgs that gave even distribution and then > > generating names that mapped to those pgs. So for example with 5 osds > > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4] > > [2,3] [3,1] [4,0] and then all the names generated would map to only > > those 5 pgs. > > > > Up until recently, the results from these experiments were what I > expected, > > * better repeatability across runs > > * even disk utilization. > > * generally higher total Bandwidth > > > > Recently however I saw results on one platform that had much lower > > bandwidth (less than half) with the "even distribution" run compared > > to the default random distribution run. Some notes were: > > > > * showed up in erasure coded pool with k=2, m=1, with 4M size > > objects. Did not show the discrepancy on a replicated 2 pool on > > the same cluster. > > > > * In general, larger objects showed the problem more than smaller > > objects. > > > > * showed up only on reads from the erasure pool. Writes to the > > same pool had higher bandwidth with the "even distribution" > > > > * showed up on only one cluster, a separate cluster with the same > > number of nodes and disks but different system architecture did > > not show this. > > > > * showed up only when the total number of client threads got "high > > enough". For example showed up with 64 total client threads but > > not with 16. The distribution of threads across client processes > > did not seem to matter. > > > > I tried looking at "dump_historic_ops" and did indeed see some read > > ops logged with high latency in the "even distribution" case. The > > larger elapsed times in the historic ops were always in the > > "reached_pg" and "done" steps. But I saw similar high latencies and > > elapsed times for "reached_pg" and "done" for historic read ops in the > > random case. > > > > I have perf counters before and after the read tests. I see big > > differences in the op_r_out_bytes which makes sense because the higher > > bw run processed more bytes. For some osds, op_r_latency/sum is > > slightly higher in the "even" run but not sure if this is significant. > > > > Anyway, I will probably just stop doing these "even distribution" runs > > but I was hoping to get an understanding of why they might have suche > > reduced bandwidth in this particular case. Is there something about > > mapping to a smaller number of pgs that becomes a bottleneck? > > There's a lot of per-pg locking and pipelining that happens within the OSD > process. If you're mapping to only a single PG per OSD, then you're > basically forcing it to run single-threaded and to only handle one read at > a time. If you want to force an even distribution of operations across > OSDs, you'll need to calculate names for enough PGs to exceed the sharding > counts you're using in order to avoid "artificial" bottlenecks. > -Greg I see. So I guess a pool that has "too small" a number of pgs would have the same problem... It was curious that the single-PG per OSD only affected a limited number of test configurations, but maybe other bottlenecks were taking over in the other configurations. -- Tom ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f