Re: Cephalocon QA: Performance testing in teuthology

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 6 Apr 2018 14:50:58 -0700

On Thu, Apr 5, 2018 at 12:44 PM, Mohamad Gebai <mgebai@xxxxxxx> wrote:
> Thanks for starting this thread. This is something very interesting that
> would be very useful to have. I had talked about this with a few people,
> here's what we had in mind:
>
> - Instead of having a pass/fail against a baseline, track the
> performance over time. We can have the suite run periodically, and
> trigger runs after a significant code change and at specific milestones
>
> - Store performance values reported by the client; specifically, store
> the percentiles for a better understanding of how the performance changes
>
> - Store performance values reported by Ceph (from the perf dump?), these
> might be less volatile than the ones reported by the client
>
> - Use a database that allows easy querying those metrics. Jan had
> suggested InfluxDB which is a time series DB and would allow querying
> quite easy
>
> - Graph the performance through versions of Ceph (through commits) so we
> can find any regressions/improvements
>
> Of course, for this to be relevant we'd need a setup and HW that doesn't
> change.
>
> Does that fit with what's suggested here?

I think generally yes. The question is exactly which of those data
points you list we actually store and query, and how we identify the
trends to track and use as pass/fail points. We're going to need a
pretty substantial database/monitoring system history just to identify
how much data we need to keep to get meaningful results out of our
ongoing individual tests. :/

>
> Mohamad
>
>
> On 04/04/2018 02:59 PM, Neha Ojha wrote:
>> With the aim of doing automated performance testing using teuthology,
>> we integrated a cbt task[1] into it. This task enables teuthology to
>> run benchmarks like radosbench and librbdfio, by making use of the
>> workloads and settings defined in the performance suite[2]. This suite
>> runs as a part of the rados suite and the test results are stored in
>> the teuthology archives in JSON format.
>>
>> The final aim is to pass/fail tests based on performance results, but
>> we have faced a few challenges in this process.
>>
>> Determining reasonable baseline values for tests is difficult.
>>
>> - Teuthology applies different combination of configuration settings
>> each time it runs these workloads, which creates a large sample space
>> of configurations for us to track baselines for.
>> - Variability of hardware on which the performance tests are run in the lab.
>> - Repeatability of tests under the same conditions.
>>
>> Storing performance results.
>>
>> - Currently, the test results are stored in the teuthology archives.
>> We have figured out a way to store these results longer than usual(~2
>> weeks), but in the long run this may not be an ideal location.
>> - +1 to Greg's idea of some kind of database system to feed these
>> results into and do better analysis.
>>
>>
>> We had a discussion at Cephalocon regarding the above, and based on
>> the ideas that came up, we have attempted to solve some of the issues.
>>
>> Last week, we merged a minimal performance suite[3], that runs 4 basic
>> jobs(subset of the perf suite) outside of the rados suite.
>> This suite is now run as a part of the nightly teuthology runs on a
>> specific set of machines(smithi) in the sepia lab, on the ceph master
>> branch.
>> Our aim here is to reduce the sample space of tests, and the
>> variability around these tests, so that we can come up with baselines
>> for this smaller subset.
>> We already have a simple result analysis tool, which can be integrated
>> with the cbt task to do the analysis and pass/fail tests based on
>> configurable thresholds.
>>
>> We are also planning to expand the cbt task to cover rgw workloads.
>>
>> Something that will be very useful in the short term would be a way to
>> easily view/compare the data collected in these nightly runs.
>>
>> [1] https://github.com/ceph/ceph/pull/17583
>> [2] https://github.com/ceph/ceph/tree/master/qa/suites/rados/perf
>> [3] https://github.com/ceph/ceph/tree/master/qa/suites/perf-basic
>>
>> On Wed, Apr 4, 2018 at 12:55 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>> Performance testing is an area that teuthology does not currently
>>> address. Neha is doing some work around integrating cbt (ceph
>>> benchmark tool, from Mark and other performance-interested people)
>>> into teuthology so we can run some performance jobs. But there’s a lot
>>> more work if we want to make long-term use of these to quantify our
>>> changes in performance, rather than micro-targeting specific patches
>>> in the lab. We’re concerned about random noise, machine variation, and
>>> reproducibility of results; and we have no way to identify trends. In
>>> short, we need some kind of database system to feed these results into
>>> and do analysis. This would be a whole new competency for the
>>> teuthology system and we’re not sure how best to go about it. But it’s
>>> becoming a major concern.
>>> PROBLEM TOPIC: how do we do performance testing and reliable analysis
>>> in the lab?
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html