Presenting the pre-flight checks project

Erwan Velu <evelu@xxxxxxxxxx> · Thu, 6 Sep 2018 11:36:58 -0400 (EDT)

Hi fellows,

I've been thinking about it for a long while and had a chance to pitch that idea during the Mountpoint.io event.
I think it's time to share it will all of you, present the idea & concepts to get your feedback on it.

Deploying Ceph, but generally speaking any distributed software, means having a software running on a given set of nodes to gain a particular service : storage in our case.

But what is the confidence level of people deploying it, that the platform is performing well before getting Ceph on it ?
How much of the raw performance are you really using ?
How far are you from what the platform is capable of ?
Do you have any disk/interface/<place here any hardware device>/ slowing down the whole infra ?

I'm pretty sure that people operating Ceph have usually no answer to that questions and the classical one is "it works good enough so no-one complains" or "someone prepared it, I trust what he did".

And what would you do if someone says : "That's pretty curious, the Ceph cluster seems slower since /a couple of days/kernel upgrade/<place any reason here/.
I'm still pretty sure that making the split between Ceph & platform responsibilities is almost impossible for many.

There is were the project is starting.

What should be the set of pre-flight checks to insure the platform doesn't have any mis-configuration or even damaged devices to deliver a good distributed service.

To my understanding of that topic, the tool should:
- be lightweight to be easily installed on hosts
- application agnostic so it could be used for any distributed software : ceph-medic was made for detecting bad ceph's configuration while this tool will be focused on the platform
- check status of network / storage / cpu / ram (bandwidth, latency, any specific metric)
- generate some loads (network / storage / cpu / ram) to see the impact of one component to the whole platform
- detect non-homogenous results / configuration (meaning that if a set of node is said to be identical, it have to be)
- offer a good interface so everyone can use it
- automated as its most to avoid complex cli/options/tuning to gain a good result
- allow comparing results over time to analyze how much the platform changed over-time (install time vs incident time)

I do think this will be helpful for 
- users
- admins
- support teams (bug triage, support level 1/2/3)
- infra people that setup hw configurations
- performance people

I'm sending this email to the Ceph project because
- I'm working on this beautiful software
- Ceph's performance is very dependent from the quality of the platform,
- I think it's the right place to bootstrap that project

If you have some interest in that project, feel free to reply to this email and let's do it !

Erwan,