I have a much more evil plan that you force me to unveil ;))) I'm considering using the skydive project : http://skydive.network/ It have all I need to perform analysis & reporting, its distributed, it's done by a Redhat people I know and there are happy about my usage to enhance the tool. Adding hw reporting + storage will be very easy, the UI is nice & dynamic (adding lldp will be a killer feature), they have network replay done (so we could replay a real ceph traffic as a reproducible test case) and adding new testing will be easy too. Looks like a good place to start ;) ----- Mail original ----- De: "John Spray" <jspray@xxxxxxxxxx> À: "Erwan Velu" <evelu@xxxxxxxxxx> Cc: "Ceph Development" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Jeudi 6 Septembre 2018 18:07:54 Objet: Re: Presenting the pre-flight checks project On Thu, Sep 6, 2018 at 4:37 PM Erwan Velu <evelu@xxxxxxxxxx> wrote: > > Hi fellows, > > I've been thinking about it for a long while and had a chance to pitch that idea during the Mountpoint.io event. > I think it's time to share it will all of you, present the idea & concepts to get your feedback on it. > > Deploying Ceph, but generally speaking any distributed software, means having a software running on a given set of nodes to gain a particular service : storage in our case. > > But what is the confidence level of people deploying it, that the platform is performing well before getting Ceph on it ? > How much of the raw performance are you really using ? > How far are you from what the platform is capable of ? > Do you have any disk/interface/<place here any hardware device>/ slowing down the whole infra ? > > I'm pretty sure that people operating Ceph have usually no answer to that questions and the classical one is "it works good enough so no-one complains" or "someone prepared it, I trust what he did". > > And what would you do if someone says : "That's pretty curious, the Ceph cluster seems slower since /a couple of days/kernel upgrade/<place any reason here/. > I'm still pretty sure that making the split between Ceph & platform responsibilities is almost impossible for many. > > There is were the project is starting. > > What should be the set of pre-flight checks to insure the platform doesn't have any mis-configuration or even damaged devices to deliver a good distributed service. > > To my understanding of that topic, the tool should: > - be lightweight to be easily installed on hosts > - application agnostic so it could be used for any distributed software : ceph-medic was made for detecting bad ceph's configuration while this tool will be focused on the platform > - check status of network / storage / cpu / ram (bandwidth, latency, any specific metric) > - generate some loads (network / storage / cpu / ram) to see the impact of one component to the whole platform > - detect non-homogenous results / configuration (meaning that if a set of node is said to be identical, it have to be) > - offer a good interface so everyone can use it > - automated as its most to avoid complex cli/options/tuning to gain a good result > - allow comparing results over time to analyze how much the platform changed over-time (install time vs incident time) > > I do think this will be helpful for > - users > - admins > - support teams (bug triage, support level 1/2/3) > - infra people that setup hw configurations > - performance people > > I'm sending this email to the Ceph project because > - I'm working on this beautiful software > - Ceph's performance is very dependent from the quality of the platform, > - I think it's the right place to bootstrap that project > > If you have some interest in that project, feel free to reply to this email and let's do it ! Cool! Perhaps this could take over the ceph-medic codebase as a starting point, as that hasn't had any commits for a long time, and there is some overlap in scope. John > Erwan,