RE: About separate the diskprediction plugin

Rick Chen <rick.chen@xxxxxxxxxxxxxxx> · Mon, 5 Nov 2018 04:17:40 +0000

HI Sage:
I test ceph config set global device_failure_prediction_mode local failed. 
Does the "Option("device_failure_prediction_mode", Option::TYPE_STR, Option::LEVEL_BASIC)" should add ".set_flag(Option::FLAG_RUNTIME) "?

root@devcnode1:/usr/lib/ceph/mgr# ceph config set global device_failure_prediction_mode local
2018-11-05 12:06:50.534 7f9acb7fe700 -1 set_mon_vals failed to set device_failure_prediction_mode = local: Configuration option 'device_failure_prediction_mode' may not be modified at runtime

> -----Original Message-----
> From: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> Sent: Wednesday, October 31, 2018 6:21 PM
> To: 'Sage Weil' <sage@xxxxxxxxxxxx>
> Cc: 'Sheng-Lin Wu' <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei'
> <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx; 'Albert Lin'
> <Albert.Lin@xxxxxxxxxxxxxxx>; brian.huang@xxxxxxxxxxxxxxx
> Subject: RE: About separate the diskprediction plugin
> 
> HI Sage:
> I am starting to implement the separate the diskprediction plugin task.
> Update task status for you.
> 1. diskprediction_local done.
> 2. diskprediction_cloud need 3~5 working days.
> Next Monday will create new PR about this task.
> Do I need create two PRs for both new plugin? Or create one PR to include
> both plugin?
> 
> > -----Original Message-----
> > From: Sage Weil <sage@xxxxxxxxxxxx>
> > Sent: Friday, October 26, 2018 3:32 AM
> > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei'
> > <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: About separate the diskprediction plugin
> >
> > On Thu, 25 Oct 2018, Rick Chen wrote:
> > > Hi Sage:
> > > 	Thank your feedback.
> > > 	Below is my understanding, I have last one question "*Q" that need
> > > your
> > advice.
> > > 	Do I have any omission? Please let me know.
> > >
> > > - devicehealth: (act the device health manger)
> > > 	Handle configuration, and control diskprediction_local and
> > diskprediction_cloud start.
> >
> > See https://github.com/ceph/ceph/pull/24755 for the config option
> > piece of this.
> >
> > > 		Use ceph cluster configuration to store devicehealth seting.
> > > 		* New PR will handle.
> > > 		diskprediction_* scraped by the devicehealth
> > > 		*Q: Current mgr plugin enabled by ceph mgr module, it cannot
> > enabled or triggered by another plugin. How to communicate with both
> plugin?
> > Do both plugin default enabled? And let it to be api daemon to receive
> > devicehealth scrape so the devicehealth can receive prediction result
> > from both plugin?
> >
> > For the moment, we can manually enable the right one.  Probably we
> > want the devicehealth to look at the config setting and enable the
> > right module for you, though (and disable any inactive ones).  We can do
> that a bit later.
> >
> > We can make python calls between modules with self.remote(), and
> > devicehealth will know which is active, so it can remote into the
> > correct module to do whatever operation it wants...
> >
> > > - diskprediction_local (act the device predictor like as job executor)
> > > 	Generate prediction data by devicehealth plugin notify.
> > >
> > > - diskprediction_cloud(act the device predictor like as job executor)
> > >  	# But it should has post metrices interval time and control by
> > > itself. The
> > metrices data does not only based on the devicehealth plugin provided.
> > Because the cloud need more data to do analysis and the cloud server
> > data display and based on it's condition.
> > > 	Get prediction data by devicehealth plugin notify.
> >
> > Yeah, I think this one would have a serve() method that does the
> > scraping of
> > (non-smart) metrics at the short intervals.  It can ignore the device
> > metric scraping and let the normal piece do that part, and only deal
> > with the pushing of those metrics to the cloud service on demand.
> >
> > Is that reasonable, or is there an alternative approach that makes more
> sense?
> >
> > sage
> >
> >
> > >
> > >
> > >
> > > > > The devicehealth loads prediction_mode config value, it mean the
> > > > > user use devicehealth to config prediction_mode and argements.
> > > > > How the devicehealth_local and devicehealth_clould access this
> > > > > plugin stored configuration? Does these plugins access the same
> > > > > mgr store
> > value?
> > > >
> > > > I think we should make this a global ceph option, not a
> > > > mgr-specific option, so that users set it via a more familiar
> > > > 'ceph config set device_failure_prediction_mode local'.  I can
> > > > push a PR with this part of it as IIRC there is a missing
> > > > mgr_module method to access the
> > cluster config.
> > > Great.
> > >
> > > >
> > > > > - generic function to get a prediction for agiven device, that calls into
> > > > >    the enabled module via self.remote()
> > > > >    - called by 'device predict-life-expectancy'
> > > > > Does it related on the which devicehealth_* enabled? Right.
> > > >
> > > > Right
> > > >
> > > > > This approach did not automatic set device life expectancy day
> > > > > description. Does it still keep on each devicehealth_* plugin?
> > > >
> > > > I can't decide if it's useful to have both variants or not (one
> > > > that just calculates a prediction and shows you, vs one that also stores it).
> > > > Either way, I think both commands would live in devicehealth and
> > > > remote() into the enabled module to get the prediction, so the
> > > > prediction module doesn't have to worry about storing at all.
> > > >
> > > > > Current cloud plugin push metrices as below:
> > > > > 	Performance metrices per 10 minutes that include ceph cluster
> > > > > status/
> > > > ceph each object correlation / osd performance counter.
> > > > > 	Device smart data metrics per 12 hours that related on the
> > > > > devicehealth
> > > > shared metrics.
> > > > > Current could plugin get device life expectance day from the
> > > > > cloud per 12
> > > > hours.
> > > >
> > > > Perhaps something like this:
> > > >
> > > >  1- devicehealth already has a health metrics scrape interval.  let it
> > > >     scrape as it already does.
> > > >  2- once it has scraped a device's metrics, it can remote() into the
> > > >     enabled module to notify it that there are fresh metrics available.
> > > >     - the cloud module could then make an API to push the latest
> values.
> > > >       the local module would do nothing from this hook.
> > > >  3- later, devicehealth would refresh its life expectancies by calling
> > > >     into the prediction module for each device.  the cloud module
> > would
> > > >     make it's API call then to get a new prediction.
> > > >
> > > > The #2 step isn't strictly needed in the above, since the module
> > > > could push the latest (or even all) metrics as part of #3 when it
> > > > is asked for a prediction; up to you!
> > > >
> > > > sage
> > > >
> > > >
> > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Sage Weil <sage@xxxxxxxxxxxx>
> > > > > Sent: Tuesday, October 23, 2018 8:14 PM
> > > > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> > > > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>
> > > > > Subject: Re: About separate the diskprediction plugin
> > > > >
> > > > > On Tue, 23 Oct 2018, Rick Chen wrote:
> > > > > > Hi Sage:
> > > > > > Do you have any suggestion about the separate diskprediction task?
> > > > > > Do we separate diskprediction_cloud and diskprediction_local
> > > > > > to individual plugin? Or separate the local predictor and
> > > > > > integrate with the devicehealth plugin. And does both plugin
> > > > > > work
> > simultaneously?
> > > > >
> > > > > I suspect the best approach is something like:
> > > > >
> > > > > devicehealth
> > > > >  - shared metrics
> > > > >  - loads prediction_mode config value
> > > > >  - later: something to auto-enable the right devicehealth_*
> > > > > module
> > > > >  - generic function to get a prediction for agiven device, that calls into
> > > > >    the enabled module via self.remote()
> > > > >    - called by 'device predict-life-expectancy'
> > > > >
> > > > > devicehealth_local
> > > > >  - implement the predict method for a device w/ sklearn models
> > > > >
> > > > > devicehealth_cloud
> > > > >  - addition metrics gathering
> > > > >  - calls out to cloud to publish metrics
> > > > >  - implement the predict method for a device by making call to
> > > > > cloud
> > > > >
> > > > > Does that work?  I'm not completely clear what the current
> > > > > status of the
> > > > cloud mode is with the metrics publish vs query to get life expectancy.
> > > > > If they're separate calls, I think the above makes sense?
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Current block diagram for you reference.
> > > > > > [cid:image002.png@01D46AC6.AB38EB10]
> > > > > >
> > > > > >
> > > > [https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-or
> > > > an
> > > > ge-ani
> > > >
> > mated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=e
> > > > m
> > > >
> > ail&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> > > > 不含病毒。
> > > >
> >
> www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm
> > > > _source=link&utm_campaign=sig-email&utm_content=emailclient>
> > > > > >
> > > > >
> > > > >
> > > > > ---
> > > > > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > > > > https://www.avast.com/antivirus
> > > > >
> > > > >
> > >
> > >
> > > ---
> > > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > > https://www.avast.com/antivirus
> > >
> > >
> 
> 
> ---
> Avast 防毒軟體已檢查此封電子郵件的病毒。
> https://www.avast.com/antivirus

---
Avast 防毒軟體已檢查此封電子郵件的病毒。
https://www.avast.com/antivirus