Looking for some help with a Ceph performance puzzler... Background: I had been experimenting with running rados benchmarks with a more controlled distribution of objects across osds. The main reason was to reduce run to run variability since I was running on fairly small clusters. I modified rados bench itself to optionally take a file with a list of names and use those instead of the usual randomly generated names that rados bench uses, "benchmark_data_hostname_pid_objectxxx". It was then possible to generate and use a list of names which hashed to an "even distribution" across osds if desired. The list of names was generated by finding a set of pgs that gave even distribution and then generating names that mapped to those pgs. So for example with 5 osds on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would map to only those 5 pgs. Up until recently, the results from these experiments were what I expected, * better repeatability across runs * even disk utilization. * generally higher total Bandwidth Recently however I saw results on one platform that had much lower bandwidth (less than half) with the "even distribution" run compared to the default random distribution run. Some notes were: * showed up in erasure coded pool with k=2, m=1, with 4M size objects. Did not show the discrepancy on a replicated 2 pool on the same cluster. * In general, larger objects showed the problem more than smaller objects. * showed up only on reads from the erasure pool. Writes to the same pool had higher bandwidth with the "even distribution" * showed up on only one cluster, a separate cluster with the same number of nodes and disks but different system architecture did not show this. * showed up only when the total number of client threads got "high enough". For example showed up with 64 total client threads but not with 16. The distribution of threads across client processes did not seem to matter. I tried looking at "dump_historic_ops" and did indeed see some read ops logged with high latency in the "even distribution" case. The larger elapsed times in the historic ops were always in the "reached_pg" and "done" steps. But I saw similar high latencies and elapsed times for "reached_pg" and "done" for historic read ops in the random case. I have perf counters before and after the read tests. I see big differences in the op_r_out_bytes which makes sense because the higher bw run processed more bytes. For some osds, op_r_latency/sum is slightly higher in the "even" run but not sure if this is significant. Anyway, I will probably just stop doing these "even distribution" runs but I was hoping to get an understanding of why they might have suche reduced bandwidth in this particular case. Is there something about mapping to a smaller number of pgs that becomes a bottleneck? -- Tom Deneau -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html