HI Sage: I test ceph config set global device_failure_prediction_mode local failed. Does the "Option("device_failure_prediction_mode", Option::TYPE_STR, Option::LEVEL_BASIC)" should add ".set_flag(Option::FLAG_RUNTIME) "? root@devcnode1:/usr/lib/ceph/mgr# ceph config set global device_failure_prediction_mode local 2018-11-05 12:06:50.534 7f9acb7fe700 -1 set_mon_vals failed to set device_failure_prediction_mode = local: Configuration option 'device_failure_prediction_mode' may not be modified at runtime > -----Original Message----- > From: Rick Chen <rick.chen@xxxxxxxxxxxxxxx> > Sent: Wednesday, October 31, 2018 6:21 PM > To: 'Sage Weil' <sage@xxxxxxxxxxxx> > Cc: 'Sheng-Lin Wu' <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei' > <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx; 'Albert Lin' > <Albert.Lin@xxxxxxxxxxxxxxx>; brian.huang@xxxxxxxxxxxxxxx > Subject: RE: About separate the diskprediction plugin > > HI Sage: > I am starting to implement the separate the diskprediction plugin task. > Update task status for you. > 1. diskprediction_local done. > 2. diskprediction_cloud need 3~5 working days. > Next Monday will create new PR about this task. > Do I need create two PRs for both new plugin? Or create one PR to include > both plugin? > > > -----Original Message----- > > From: Sage Weil <sage@xxxxxxxxxxxx> > > Sent: Friday, October 26, 2018 3:32 AM > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx> > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei' > > <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx > > Subject: RE: About separate the diskprediction plugin > > > > On Thu, 25 Oct 2018, Rick Chen wrote: > > > Hi Sage: > > > Thank your feedback. > > > Below is my understanding, I have last one question "*Q" that need > > > your > > advice. > > > Do I have any omission? Please let me know. > > > > > > - devicehealth: (act the device health manger) > > > Handle configuration, and control diskprediction_local and > > diskprediction_cloud start. > > > > See https://github.com/ceph/ceph/pull/24755 for the config option > > piece of this. > > > > > Use ceph cluster configuration to store devicehealth seting. > > > * New PR will handle. > > > diskprediction_* scraped by the devicehealth > > > *Q: Current mgr plugin enabled by ceph mgr module, it cannot > > enabled or triggered by another plugin. How to communicate with both > plugin? > > Do both plugin default enabled? And let it to be api daemon to receive > > devicehealth scrape so the devicehealth can receive prediction result > > from both plugin? > > > > For the moment, we can manually enable the right one. Probably we > > want the devicehealth to look at the config setting and enable the > > right module for you, though (and disable any inactive ones). We can do > that a bit later. > > > > We can make python calls between modules with self.remote(), and > > devicehealth will know which is active, so it can remote into the > > correct module to do whatever operation it wants... > > > > > - diskprediction_local (act the device predictor like as job executor) > > > Generate prediction data by devicehealth plugin notify. > > > > > > - diskprediction_cloud(act the device predictor like as job executor) > > > # But it should has post metrices interval time and control by > > > itself. The > > metrices data does not only based on the devicehealth plugin provided. > > Because the cloud need more data to do analysis and the cloud server > > data display and based on it's condition. > > > Get prediction data by devicehealth plugin notify. > > > > Yeah, I think this one would have a serve() method that does the > > scraping of > > (non-smart) metrics at the short intervals. It can ignore the device > > metric scraping and let the normal piece do that part, and only deal > > with the pushing of those metrics to the cloud service on demand. > > > > Is that reasonable, or is there an alternative approach that makes more > sense? > > > > sage > > > > > > > > > > > > > > > > > > The devicehealth loads prediction_mode config value, it mean the > > > > > user use devicehealth to config prediction_mode and argements. > > > > > How the devicehealth_local and devicehealth_clould access this > > > > > plugin stored configuration? Does these plugins access the same > > > > > mgr store > > value? > > > > > > > > I think we should make this a global ceph option, not a > > > > mgr-specific option, so that users set it via a more familiar > > > > 'ceph config set device_failure_prediction_mode local'. I can > > > > push a PR with this part of it as IIRC there is a missing > > > > mgr_module method to access the > > cluster config. > > > Great. > > > > > > > > > > > > - generic function to get a prediction for agiven device, that calls into > > > > > the enabled module via self.remote() > > > > > - called by 'device predict-life-expectancy' > > > > > Does it related on the which devicehealth_* enabled? Right. > > > > > > > > Right > > > > > > > > > This approach did not automatic set device life expectancy day > > > > > description. Does it still keep on each devicehealth_* plugin? > > > > > > > > I can't decide if it's useful to have both variants or not (one > > > > that just calculates a prediction and shows you, vs one that also stores it). > > > > Either way, I think both commands would live in devicehealth and > > > > remote() into the enabled module to get the prediction, so the > > > > prediction module doesn't have to worry about storing at all. > > > > > > > > > Current cloud plugin push metrices as below: > > > > > Performance metrices per 10 minutes that include ceph cluster > > > > > status/ > > > > ceph each object correlation / osd performance counter. > > > > > Device smart data metrics per 12 hours that related on the > > > > > devicehealth > > > > shared metrics. > > > > > Current could plugin get device life expectance day from the > > > > > cloud per 12 > > > > hours. > > > > > > > > Perhaps something like this: > > > > > > > > 1- devicehealth already has a health metrics scrape interval. let it > > > > scrape as it already does. > > > > 2- once it has scraped a device's metrics, it can remote() into the > > > > enabled module to notify it that there are fresh metrics available. > > > > - the cloud module could then make an API to push the latest > values. > > > > the local module would do nothing from this hook. > > > > 3- later, devicehealth would refresh its life expectancies by calling > > > > into the prediction module for each device. the cloud module > > would > > > > make it's API call then to get a new prediction. > > > > > > > > The #2 step isn't strictly needed in the above, since the module > > > > could push the latest (or even all) metrics as part of #3 when it > > > > is asked for a prediction; up to you! > > > > > > > > sage > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Sage Weil <sage@xxxxxxxxxxxx> > > > > > Sent: Tuesday, October 23, 2018 8:14 PM > > > > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx> > > > > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx> > > > > > Subject: Re: About separate the diskprediction plugin > > > > > > > > > > On Tue, 23 Oct 2018, Rick Chen wrote: > > > > > > Hi Sage: > > > > > > Do you have any suggestion about the separate diskprediction task? > > > > > > Do we separate diskprediction_cloud and diskprediction_local > > > > > > to individual plugin? Or separate the local predictor and > > > > > > integrate with the devicehealth plugin. And does both plugin > > > > > > work > > simultaneously? > > > > > > > > > > I suspect the best approach is something like: > > > > > > > > > > devicehealth > > > > > - shared metrics > > > > > - loads prediction_mode config value > > > > > - later: something to auto-enable the right devicehealth_* > > > > > module > > > > > - generic function to get a prediction for agiven device, that calls into > > > > > the enabled module via self.remote() > > > > > - called by 'device predict-life-expectancy' > > > > > > > > > > devicehealth_local > > > > > - implement the predict method for a device w/ sklearn models > > > > > > > > > > devicehealth_cloud > > > > > - addition metrics gathering > > > > > - calls out to cloud to publish metrics > > > > > - implement the predict method for a device by making call to > > > > > cloud > > > > > > > > > > Does that work? I'm not completely clear what the current > > > > > status of the > > > > cloud mode is with the metrics publish vs query to get life expectancy. > > > > > If they're separate calls, I think the above makes sense? > > > > > > > > > > sage > > > > > > > > > > > > > > > > > > > > > > > > > > > Current block diagram for you reference. > > > > > > [cid:image002.png@01D46AC6.AB38EB10] > > > > > > > > > > > > > > > > [https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-or > > > > an > > > > ge-ani > > > > > > mated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=e > > > > m > > > > > > ail&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > > > > 不含病毒。 > > > > > > > www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm > > > > _source=link&utm_campaign=sig-email&utm_content=emailclient> > > > > > > > > > > > > > > > > > > > > > --- > > > > > Avast 防毒軟體已檢查此封電子郵件的病毒。 > > > > > https://www.avast.com/antivirus > > > > > > > > > > > > > > > > > > > --- > > > Avast 防毒軟體已檢查此封電子郵件的病毒。 > > > https://www.avast.com/antivirus > > > > > > > > > --- > Avast 防毒軟體已檢查此封電子郵件的病毒。 > https://www.avast.com/antivirus --- Avast 防毒軟體已檢查此封電子郵件的病毒。 https://www.avast.com/antivirus