Re: Presenting the pre-flight checks project

John Spray <jspray@xxxxxxxxxx> · Thu, 6 Sep 2018 17:07:54 +0100

On Thu, Sep 6, 2018 at 4:37 PM Erwan Velu <evelu@xxxxxxxxxx> wrote:
>
> Hi fellows,
>
> I've been thinking about it for a long while and had a chance to pitch that idea during the Mountpoint.io event.
> I think it's time to share it will all of you, present the idea & concepts to get your feedback on it.
>
> Deploying Ceph, but generally speaking any distributed software, means having a software running on a given set of nodes to gain a particular service : storage in our case.
>
> But what is the confidence level of people deploying it, that the platform is performing well before getting Ceph on it ?
> How much of the raw performance are you really using ?
> How far are you from what the platform is capable of ?
> Do you have any disk/interface/<place here any hardware device>/ slowing down the whole infra ?
>
> I'm pretty sure that people operating Ceph have usually no answer to that questions and the classical one is "it works good enough so no-one complains" or "someone prepared it, I trust what he did".
>
> And what would you do if someone says : "That's pretty curious, the Ceph cluster seems slower since /a couple of days/kernel upgrade/<place any reason here/.
> I'm still pretty sure that making the split between Ceph & platform responsibilities is almost impossible for many.
>
> There is were the project is starting.
>
> What should be the set of pre-flight checks to insure the platform doesn't have any mis-configuration or even damaged devices to deliver a good distributed service.
>
> To my understanding of that topic, the tool should:
> - be lightweight to be easily installed on hosts
> - application agnostic so it could be used for any distributed software : ceph-medic was made for detecting bad ceph's configuration while this tool will be focused on the platform
> - check status of network / storage / cpu / ram (bandwidth, latency, any specific metric)
> - generate some loads (network / storage / cpu / ram) to see the impact of one component to the whole platform
> - detect non-homogenous results / configuration (meaning that if a set of node is said to be identical, it have to be)
> - offer a good interface so everyone can use it
> - automated as its most to avoid complex cli/options/tuning to gain a good result
> - allow comparing results over time to analyze how much the platform changed over-time (install time vs incident time)
>
> I do think this will be helpful for
> - users
> - admins
> - support teams (bug triage, support level 1/2/3)
> - infra people that setup hw configurations
> - performance people
>
> I'm sending this email to the Ceph project because
> - I'm working on this beautiful software
> - Ceph's performance is very dependent from the quality of the platform,
> - I think it's the right place to bootstrap that project
>
> If you have some interest in that project, feel free to reply to this email and let's do it !

Cool!  Perhaps this could take over the ceph-medic codebase as a
starting point, as that hasn't had any commits for a long time, and
there is some overlap in scope.

John

> Erwan,