On Fri, Jun 18, 2021 at 11:53 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > Hi Kefu, > > On Thu, Jun 17, 2021 at 9:24 PM kefu chai <tchaikov@xxxxxxxxx> wrote: > > > > On Wed, Jun 16, 2021 at 10:23 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > > > Introduced by [1] for Quincy release. This builds on work in [2] to > > > add RADOS-backed sqlite3 support to Ceph (available in Pacific). > > > > > > The MgrModule API for accessing your module's database is introduced > > > in [3]. An example of a module ("devicehealth") using the API can be > > > seen in [4]. > > > > > > Please let me know if you have any questions or feedback. > > > > > > Hi Patrick, > > > > my concern is that, without carefully planning on the segmentation of > > the pool for storing the healthy data and the pools being monitored, > > we could interfere with the system being monitored by mutating its > > status. > > > > for instance, if a cluster is experiencing large-scale slow ops, and > > pumping lots of warning messages and/or structured performance related > > metrics, some mgr module might want to collect this information from > > the health monitoring subsystem, and persist them into the sqlite3 > > database. but it is in turn backed by the same cluster. without > > carefully planning, the objects stored in .mgr pool could be mapped to > > the same set of OSDs and monitors which are suffering from the > > performance issue. in the worst case, this could in turn even worsen > > the situation. but to allocate dedicated OSDs and create a CRUSH map > > picking them just for the .mgr pool might be difficult or overkill > > from the maintainability point of view. > > > > we actually had the same issue when adding the cluster log back to OSD > > for recording the slow requests. the large amount of clog puts more > > burden on the shoulder of the monitors. if the slow requests is caused > > by monitor, these clogs actually in turn slow down the monitors > > further. > > > > shall we switch to a (local) backup sqlite backend if we identify a > > performance issue, and restore / backfill the records once the issue > > is resolved? > > Thanks for bringing this up. I think it would be reasonable to decide > this depending on what the mgr module is doing. For example, I think > devicehealth and snap_schedule are innocuous enough that we don't need > to give special consideration for the system potentially being under > load. Also these modules' mutations of the databases do not depend on > the cluster state, healthy or degraded. OTOH, a module that is okay, that's a relief. just for future developers, we should not use the sqlite backend for storing the alerts created when the whole cluster is not healthy. > collecting large streams of data into the database might first ingest > that data into a local in-memory database and only backup [1] that > in-memory database to RADOS when the cluster is healthy. If the > database is very large then a backup would not be desirable as the > in-memory database would be too large. In that case I would suggest > streaming batch updates in large transactions. thanks. glad that we have a plan B in that case. > > What do you think? > > [1] https://www.sqlite.org/backup.html > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Principal Software Engineer > Red Hat Sunnyvale, CA > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > -- Regards Kefu Chai _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx