RE: About separate the diskprediction plugin

Rick Chen <rick.chen@xxxxxxxxxxxxxxx> · Tue, 6 Nov 2018 01:12:55 +0000

Hi Sage:
Sorry, I already include this fix in the PR https://github.com/ceph/ceph/pull/24925.
Please review it, thanks.

> -----Original Message-----
> From: Sage Weil <sage@xxxxxxxxxxxx>
> Sent: Monday, November 5, 2018 10:20 PM
> To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei'
> <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx; Albert Lin
> <albert.lin@xxxxxxxxxxxxxxx>; Brian Huang <brian.huang@xxxxxxxxxxxxxxx>
> Subject: RE: About separate the diskprediction plugin
> 
> On Mon, 5 Nov 2018, Rick Chen wrote:
> > HI Sage:
> > I test ceph config set global device_failure_prediction_mode local failed.
> > Does the "Option("device_failure_prediction_mode", Option::TYPE_STR,
> Option::LEVEL_BASIC)" should add ".set_flag(Option::FLAG_RUNTIME) "?
> 
> Yes, that will fix it!
> 
> s
> 
> >
> > root@devcnode1:/usr/lib/ceph/mgr# ceph config set global
> > device_failure_prediction_mode local
> > 2018-11-05 12:06:50.534 7f9acb7fe700 -1 set_mon_vals failed to set
> > device_failure_prediction_mode = local: Configuration option
> > 'device_failure_prediction_mode' may not be modified at runtime
> >
> >
> > > -----Original Message-----
> > > From: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> > > Sent: Wednesday, October 31, 2018 6:21 PM
> > > To: 'Sage Weil' <sage@xxxxxxxxxxxx>
> > > Cc: 'Sheng-Lin Wu' <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei'
> > > <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx; 'Albert Lin'
> > > <Albert.Lin@xxxxxxxxxxxxxxx>; brian.huang@xxxxxxxxxxxxxxx
> > > Subject: RE: About separate the diskprediction plugin
> > >
> > > HI Sage:
> > > I am starting to implement the separate the diskprediction plugin task.
> > > Update task status for you.
> > > 1. diskprediction_local done.
> > > 2. diskprediction_cloud need 3~5 working days.
> > > Next Monday will create new PR about this task.
> > > Do I need create two PRs for both new plugin? Or create one PR to
> > > include both plugin?
> > >
> > > > -----Original Message-----
> > > > From: Sage Weil <sage@xxxxxxxxxxxx>
> > > > Sent: Friday, October 26, 2018 3:32 AM
> > > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> > > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>; 'Jeremy Wei'
> > > > <jeremycwei@xxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> > > > Subject: RE: About separate the diskprediction plugin
> > > >
> > > > On Thu, 25 Oct 2018, Rick Chen wrote:
> > > > > Hi Sage:
> > > > > 	Thank your feedback.
> > > > > 	Below is my understanding, I have last one question "*Q" that
> > > > > need your
> > > > advice.
> > > > > 	Do I have any omission? Please let me know.
> > > > >
> > > > > - devicehealth: (act the device health manger)
> > > > > 	Handle configuration, and control diskprediction_local and
> > > > diskprediction_cloud start.
> > > >
> > > > See https://github.com/ceph/ceph/pull/24755 for the config option
> > > > piece of this.
> > > >
> > > > > 		Use ceph cluster configuration to store devicehealth seting.
> > > > > 		* New PR will handle.
> > > > > 		diskprediction_* scraped by the devicehealth
> > > > > 		*Q: Current mgr plugin enabled by ceph mgr module, it cannot
> > > > enabled or triggered by another plugin. How to communicate with
> > > > both
> > > plugin?
> > > > Do both plugin default enabled? And let it to be api daemon to
> > > > receive devicehealth scrape so the devicehealth can receive
> > > > prediction result from both plugin?
> > > >
> > > > For the moment, we can manually enable the right one.  Probably we
> > > > want the devicehealth to look at the config setting and enable the
> > > > right module for you, though (and disable any inactive ones).  We
> > > > can do
> > > that a bit later.
> > > >
> > > > We can make python calls between modules with self.remote(), and
> > > > devicehealth will know which is active, so it can remote into the
> > > > correct module to do whatever operation it wants...
> > > >
> > > > > - diskprediction_local (act the device predictor like as job executor)
> > > > > 	Generate prediction data by devicehealth plugin notify.
> > > > >
> > > > > - diskprediction_cloud(act the device predictor like as job executor)
> > > > >  	# But it should has post metrices interval time and control by
> > > > > itself. The
> > > > metrices data does not only based on the devicehealth plugin provided.
> > > > Because the cloud need more data to do analysis and the cloud
> > > > server data display and based on it's condition.
> > > > > 	Get prediction data by devicehealth plugin notify.
> > > >
> > > > Yeah, I think this one would have a serve() method that does the
> > > > scraping of
> > > > (non-smart) metrics at the short intervals.  It can ignore the
> > > > device metric scraping and let the normal piece do that part, and
> > > > only deal with the pushing of those metrics to the cloud service on
> demand.
> > > >
> > > > Is that reasonable, or is there an alternative approach that makes
> > > > more
> > > sense?
> > > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > > The devicehealth loads prediction_mode config value, it mean
> > > > > > > the user use devicehealth to config prediction_mode and
> argements.
> > > > > > > How the devicehealth_local and devicehealth_clould access
> > > > > > > this plugin stored configuration? Does these plugins access
> > > > > > > the same mgr store
> > > > value?
> > > > > >
> > > > > > I think we should make this a global ceph option, not a
> > > > > > mgr-specific option, so that users set it via a more familiar
> > > > > > 'ceph config set device_failure_prediction_mode local'.  I can
> > > > > > push a PR with this part of it as IIRC there is a missing
> > > > > > mgr_module method to access the
> > > > cluster config.
> > > > > Great.
> > > > >
> > > > > >
> > > > > > > - generic function to get a prediction for agiven device, that calls
> into
> > > > > > >    the enabled module via self.remote()
> > > > > > >    - called by 'device predict-life-expectancy'
> > > > > > > Does it related on the which devicehealth_* enabled? Right.
> > > > > >
> > > > > > Right
> > > > > >
> > > > > > > This approach did not automatic set device life expectancy
> > > > > > > day description. Does it still keep on each devicehealth_* plugin?
> > > > > >
> > > > > > I can't decide if it's useful to have both variants or not
> > > > > > (one that just calculates a prediction and shows you, vs one that also
> stores it).
> > > > > > Either way, I think both commands would live in devicehealth
> > > > > > and
> > > > > > remote() into the enabled module to get the prediction, so the
> > > > > > prediction module doesn't have to worry about storing at all.
> > > > > >
> > > > > > > Current cloud plugin push metrices as below:
> > > > > > > 	Performance metrices per 10 minutes that include ceph
> > > > > > > cluster status/
> > > > > > ceph each object correlation / osd performance counter.
> > > > > > > 	Device smart data metrics per 12 hours that related on the
> > > > > > > devicehealth
> > > > > > shared metrics.
> > > > > > > Current could plugin get device life expectance day from the
> > > > > > > cloud per 12
> > > > > > hours.
> > > > > >
> > > > > > Perhaps something like this:
> > > > > >
> > > > > >  1- devicehealth already has a health metrics scrape interval.  let it
> > > > > >     scrape as it already does.
> > > > > >  2- once it has scraped a device's metrics, it can remote() into the
> > > > > >     enabled module to notify it that there are fresh metrics
> available.
> > > > > >     - the cloud module could then make an API to push the
> > > > > > latest
> > > values.
> > > > > >       the local module would do nothing from this hook.
> > > > > >  3- later, devicehealth would refresh its life expectancies by calling
> > > > > >     into the prediction module for each device.  the cloud
> > > > > > module
> > > > would
> > > > > >     make it's API call then to get a new prediction.
> > > > > >
> > > > > > The #2 step isn't strictly needed in the above, since the
> > > > > > module could push the latest (or even all) metrics as part of
> > > > > > #3 when it is asked for a prediction; up to you!
> > > > > >
> > > > > > sage
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil <sage@xxxxxxxxxxxx>
> > > > > > > Sent: Tuesday, October 23, 2018 8:14 PM
> > > > > > > To: Rick Chen <rick.chen@xxxxxxxxxxxxxxx>
> > > > > > > Cc: Sheng-Lin Wu <shenglin.wu@xxxxxxxxxxxxxxx>
> > > > > > > Subject: Re: About separate the diskprediction plugin
> > > > > > >
> > > > > > > On Tue, 23 Oct 2018, Rick Chen wrote:
> > > > > > > > Hi Sage:
> > > > > > > > Do you have any suggestion about the separate diskprediction
> task?
> > > > > > > > Do we separate diskprediction_cloud and
> > > > > > > > diskprediction_local to individual plugin? Or separate the
> > > > > > > > local predictor and integrate with the devicehealth
> > > > > > > > plugin. And does both plugin work
> > > > simultaneously?
> > > > > > >
> > > > > > > I suspect the best approach is something like:
> > > > > > >
> > > > > > > devicehealth
> > > > > > >  - shared metrics
> > > > > > >  - loads prediction_mode config value
> > > > > > >  - later: something to auto-enable the right devicehealth_*
> > > > > > > module
> > > > > > >  - generic function to get a prediction for agiven device, that calls
> into
> > > > > > >    the enabled module via self.remote()
> > > > > > >    - called by 'device predict-life-expectancy'
> > > > > > >
> > > > > > > devicehealth_local
> > > > > > >  - implement the predict method for a device w/ sklearn
> > > > > > > models
> > > > > > >
> > > > > > > devicehealth_cloud
> > > > > > >  - addition metrics gathering
> > > > > > >  - calls out to cloud to publish metrics
> > > > > > >  - implement the predict method for a device by making call
> > > > > > > to cloud
> > > > > > >
> > > > > > > Does that work?  I'm not completely clear what the current
> > > > > > > status of the
> > > > > > cloud mode is with the metrics publish vs query to get life expectancy.
> > > > > > > If they're separate calls, I think the above makes sense?
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Current block diagram for you reference.
> > > > > > > > [cid:image002.png@01D46AC6.AB38EB10]
> > > > > > > >
> > > > > > > >
> > > > > > [https://ipmcdn.avast.com/images/icons/icon-envelope-tick-roun
> > > > > > d-or
> > > > > > an
> > > > > > ge-ani
> > > > > >
> > > >
> mated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium
> > > > =e
> > > > > > m
> > > > > >
> > > >
> ail&utm_source=link&utm_campaign=sig-email&utm_content=emailclient
> > > > >
> > > > > > 不含病毒。
> > > > > >
> > > >
> > >
> www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm
> > > > > > _source=link&utm_campaign=sig-email&utm_content=emailclient>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---
> > > > > > > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > > > > > > https://www.avast.com/antivirus
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > > ---
> > > > > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > > > > https://www.avast.com/antivirus
> > > > >
> > > > >
> > >
> > >
> > > ---
> > > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > > https://www.avast.com/antivirus
> >
> >
> >
> > ---
> > Avast 防毒軟體已檢查此封電子郵件的病毒。
> > https://www.avast.com/antivirus
> >
> >

---
Avast 防毒軟體已檢查此封電子郵件的病毒。
https://www.avast.com/antivirus