On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > >> -----Original Message----- >> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] >> Sent: Tuesday, January 26, 2016 1:33 PM >> To: Deneau, Tom >> Cc: ceph-devel@xxxxxxxxxxxxxxx; cbt@xxxxxxxxxxxxxx >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random >> distribution >> >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: >> > Looking for some help with a Ceph performance puzzler... >> > >> > Background: >> > I had been experimenting with running rados benchmarks with a more >> > controlled distribution of objects across osds. The main reason was >> > to reduce run to run variability since I was running on fairly small >> > clusters. >> > >> > I modified rados bench itself to optionally take a file with a list of >> > names and use those instead of the usual randomly generated names that >> > rados bench uses, "benchmark_data_hostname_pid_objectxxx". It was >> > then possible to generate and use a list of names which hashed to an >> > "even distribution" across osds if desired. The list of names was >> > generated by finding a set of pgs that gave even distribution and then >> > generating names that mapped to those pgs. So for example with 5 osds >> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4] >> > [2,3] [3,1] [4,0] and then all the names generated would map to only >> > those 5 pgs. >> > >> > Up until recently, the results from these experiments were what I >> expected, >> > * better repeatability across runs >> > * even disk utilization. >> > * generally higher total Bandwidth >> > >> > Recently however I saw results on one platform that had much lower >> > bandwidth (less than half) with the "even distribution" run compared >> > to the default random distribution run. Some notes were: >> > >> > * showed up in erasure coded pool with k=2, m=1, with 4M size >> > objects. Did not show the discrepancy on a replicated 2 pool on >> > the same cluster. >> > >> > * In general, larger objects showed the problem more than smaller >> > objects. >> > >> > * showed up only on reads from the erasure pool. Writes to the >> > same pool had higher bandwidth with the "even distribution" >> > >> > * showed up on only one cluster, a separate cluster with the same >> > number of nodes and disks but different system architecture did >> > not show this. >> > >> > * showed up only when the total number of client threads got "high >> > enough". For example showed up with 64 total client threads but >> > not with 16. The distribution of threads across client processes >> > did not seem to matter. >> > >> > I tried looking at "dump_historic_ops" and did indeed see some read >> > ops logged with high latency in the "even distribution" case. The >> > larger elapsed times in the historic ops were always in the >> > "reached_pg" and "done" steps. But I saw similar high latencies and >> > elapsed times for "reached_pg" and "done" for historic read ops in the >> > random case. >> > >> > I have perf counters before and after the read tests. I see big >> > differences in the op_r_out_bytes which makes sense because the higher >> > bw run processed more bytes. For some osds, op_r_latency/sum is >> > slightly higher in the "even" run but not sure if this is significant. >> > >> > Anyway, I will probably just stop doing these "even distribution" runs >> > but I was hoping to get an understanding of why they might have suche >> > reduced bandwidth in this particular case. Is there something about >> > mapping to a smaller number of pgs that becomes a bottleneck? >> >> There's a lot of per-pg locking and pipelining that happens within the OSD >> process. If you're mapping to only a single PG per OSD, then you're >> basically forcing it to run single-threaded and to only handle one read at >> a time. If you want to force an even distribution of operations across >> OSDs, you'll need to calculate names for enough PGs to exceed the sharding >> counts you're using in order to avoid "artificial" bottlenecks. >> -Greg > > Greg -- > > Is there any performance counter which would show the > fact that we were basically single-threading in the OSDs? I'm not aware of anything covering that. It's probably not too hard to add counters on how many ops per shard have been performed; PRs and tickets welcome. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html