On Thu, 25 Oct 2018, Rick Chen wrote: > Hi Sage: > Thank your feedback. > Below is my understanding, I have last one question "*Q" that need your advice. > Do I have any omission? Please let me know. > > - devicehealth: (act the device health manger) > Handle configuration, and control diskprediction_local and diskprediction_cloud start. See https://github.com/ceph/ceph/pull/24755 for the config option piece of this. > Use ceph cluster configuration to store devicehealth seting. > * New PR will handle. > diskprediction_* scraped by the devicehealth > *Q: Current mgr plugin enabled by ceph mgr module, it cannot enabled or triggered by another plugin. How to communicate with both plugin? Do both plugin default enabled? And let it to be api daemon to receive devicehealth scrape so the devicehealth can receive prediction result from both plugin? For the moment, we can manually enable the right one. Probably we want the devicehealth to look at the config setting and enable the right module for you, though (and disable any inactive ones). We can do that a bit later. We can make python calls between modules with self.remote(), and devicehealth will know which is active, so it can remote into the correct module to do whatever operation it wants... > - diskprediction_local (act the device predictor like as job executor) > Generate prediction data by devicehealth plugin notify. > > - diskprediction_cloud(act the device predictor like as job executor) > # But it should has post metrices interval time and control by itself. The metrices data does not only based on the devicehealth plugin provided. Because the cloud need more data to do analysis and the cloud server data display and based on it's condition. > Get prediction data by devicehealth plugin notify. Yeah, I think this one would have a serve() method that does the scraping of (non-smart) metrics at the short intervals. It can ignore the device metric scraping and let the normal piece do that part, and only deal with the pushing of those metrics to the cloud service on demand. Is that reasonable, or is there an alternative approach that makes more sense? sage > > > > > > The devicehealth loads prediction_mode config value, it mean the user > > > use devicehealth to config prediction_mode and argements. How the > > > devicehealth_local and devicehealth_clould access this plugin stored > > > configuration? Does these plugins access the same mgr store value? > > > > I think we should make this a global ceph option, not a mgr-specific option, so > > that users set it via a more familiar 'ceph config set > > device_failure_prediction_mode local'. I can push a PR with this part of it as > > IIRC there is a missing mgr_module method to access the cluster config. > Great. > > > > > > - generic function to get a prediction for agiven device, that calls into > > > the enabled module via self.remote() > > > - called by 'device predict-life-expectancy' > > > Does it related on the which devicehealth_* enabled? Right. > > > > Right > > > > > This approach did not automatic set device life expectancy day > > > description. Does it still keep on each devicehealth_* plugin? > > > > I can't decide if it's useful to have both variants or not (one that just calculates > > a prediction and shows you, vs one that also stores it). > > Either way, I think both commands would live in devicehealth and > > remote() into the enabled module to get the prediction, so the prediction > > module doesn't have to worry about storing at all. > > > > > Current cloud plugin push metrices as below: > > > Performance metrices per 10 minutes that include ceph cluster status/ > > ceph each object correlation / osd performance counter. > > > Device smart data metrics per 12 hours that related on the devicehealth > > shared metrics. > > > Current could plugin get device life expectance day from the cloud per 12 > > hours. > > > > Perhaps something like this: > > > > 1- devicehealth already has a health metrics scrape interval. let it > > scrape as it already does. > > 2- once it has scraped a device's metrics, it can remote() into the > > enabled module to notify it that there are fresh metrics available. > > - the cloud module could then make an API to push the latest values. > > the local module would do nothing from this hook. > > 3- later, devicehealth would refresh its life expectancies by calling > > into the prediction module for each device. the cloud module would > > make it's API call then to get a new prediction. > > > > The #2 step isn't strictly needed in the above, since the module could push the > > latest (or even all) metrics as part of #3 when it is asked for a prediction; up to > > you! > > > > sage > > > > > > > > > > > > -----Original Message----- > > > From: Sage Weil <sage@xxxxxxxxxxxx> > > > Sent: Tuesday, October 23, 2018 8:14 PM > > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx> > > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx> > > > Subject: Re: About separate the diskprediction plugin > > > > > > On Tue, 23 Oct 2018, Rick Chen wrote: > > > > Hi Sage: > > > > Do you have any suggestion about the separate diskprediction task? > > > > Do we separate diskprediction_cloud and diskprediction_local to > > > > individual plugin? Or separate the local predictor and integrate > > > > with the devicehealth plugin. And does both plugin work simultaneously? > > > > > > I suspect the best approach is something like: > > > > > > devicehealth > > > - shared metrics > > > - loads prediction_mode config value > > > - later: something to auto-enable the right devicehealth_* module > > > - generic function to get a prediction for agiven device, that calls into > > > the enabled module via self.remote() > > > - called by 'device predict-life-expectancy' > > > > > > devicehealth_local > > > - implement the predict method for a device w/ sklearn models > > > > > > devicehealth_cloud > > > - addition metrics gathering > > > - calls out to cloud to publish metrics > > > - implement the predict method for a device by making call to cloud > > > > > > Does that work? I'm not completely clear what the current status of the > > cloud mode is with the metrics publish vs query to get life expectancy. > > > If they're separate calls, I think the above makes sense? > > > > > > sage > > > > > > > > > > > > > > > > > Current block diagram for you reference. > > > > [cid:image002.png@01D46AC6.AB38EB10] > > > > > > > > > > [https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-ani > > mated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=em > > ail&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > > 不含病毒。 > > www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm > > _source=link&utm_campaign=sig-email&utm_content=emailclient> > > > > > > > > > > > > > --- > > > Avast 防毒軟體已檢查此封電子郵件的病毒。 > > > https://www.avast.com/antivirus > > > > > > > > > --- > Avast 防毒軟體已檢查此封電子郵件的病毒。 > https://www.avast.com/antivirus > >