Re: Resource requirements for integration test cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/6/20 10:24 AM, Sage Weil wrote:
Hi Ulrich,

On Wed, 25 Mar 2020, Ulrich Weigand wrote:
Hello,

we're currently investigating to set up a Teuthology cluster to run the
Ceph integration test suite on IBM Z, to improve test coverage on our
platform.

However, we're not sure what hardware resources are required to do so.  The
target configuration should be large enough to comfortably support running
an instance of the full Ceph integration tests.  Is there some data
available from your experience with such installations on how large this
cluster needs to be then?

In particular, what number of nodes, #cpus and memory per node, number
(type/size) of disks that should be attached?

Thanks for any data / estimates you can provide!
I can provide some estimates and general guidance.

First, the test suites are currently targetted to run on 'smithi' nodes,
which are relatively low-powered x86 1u machines with a single NVMe
divided into 4 scratch LVs (+ an HDD for boot + logs).  (This is somewhat
arbitrary--it's just the hardware we picked so the tests are written to
target that.)

Each test tends to take anywhere from 15m to 2h to run (with a few
outliers that take longer).  Each test suite is somewhere between 100 and
400 tests.  There are maybe 10 different suites we run with some
regularly, with a few (e.g., rados) taking up a larger portion of the
machine time.  We currently have about 175 healthy smithi in service to
support developers.

I've been telling the aarch64 folks that we probably want at least 25-50
similarly-sized nodes in order to run the test suites in a reasonable
amount of time (e.g., minimal rados suite ~day and not days).

I'm not really sure how this maps on the Z hardware, but hopefully this
provides some guidance!

sage


FWIW, the hardware wasn't purchased arbitrarily.  Before we purchased the smithi nodes we did some analysis of the runtime and behavior of the existing teuthology runs specifically to understand what hardware should be purchased to complete tests as efficiently as possible.  You can see the report here:


https://drive.google.com/file/d/0B2gTBZrkrnpZYVpPb3VpTkw5aFk/view?usp=sharing


At the time, the RADOS and Upgrade suites were by far the biggest consumer of resources.  The majority of time in both of those suites was spent in the RADOS task (though data transfer and log compression did make up a somewhat significant chunk of runtime). One of the reasons we specifically went for a single fast NVMe drive in those nodes was that simulated "thrashing" rados workloads completed much faster with a single NVMe drive vs 4 independent HDDs.  In both cases CPU usage did not appear to be the dominating factor.

While the data is quite old now, the general trends are likely still true.  Utilizing (enterprise grade) flash should accelerate the tests significantly.  Beyond that, on X86 we expect to see up to 200-500MB/s and 1000-3000 IOPS per core depending on specifics of the HW and version of Ceph being tested.  I don't know how power cores would compare exactly, but that at least provides a rough ballpark.  Each OSD should have at least 4GB of memory with some extra reserved for temporary spikes or delayed kernel reclaim of released pages.


Mark


_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux