On Fri, 26 Jun 2020 09:43:54 -0700 Kevin Fenzi <kevin@xxxxxxxxx> wrote: > > I'd be very much in favour of having an Infra managed Prometheus > > instance (+ grafana and alertmanager on Openshift), its something I > > hoped to work on within CPE sustaining infact. > > You know, I'm not in love with that stack. It could well be that I > just haven't used it enough or know enough about it, but it seems just > needlessly complex. ;( > > I'd prefer we start out at a lower level... what are our requirements? > Then, see how we can setup something to meet those. > > Off the top of my head (I'm sure I can think of more): > > * Ability to collect/gather rsyslog output from all our machines. > * Ability to generate reports of 'variances' from all that (ie, what > odd messages should a human look at?) > * Handle all the logs from openshift, possibly multiple clusters? > * Ability to easily drill down and look at some specifc historical > logs (ie, show me the logs for the bodhi-web pods from last week when > there was a issue). > > Perhaps prometheus/graphana/alertmanager is the solution, but there's > also tons of other open source projects out there too that we might > look into. I appreciate this reply because it contains very relevant technical points, but it highlights that we are looking at different problems at different levels. Your list of requirements basically describes an "ingestion + storage + anomaly inference for logs" solution, where the infrastructure and the applications running on top of it are blended together. This would be handy to have and it's indeed a good problem to work on (I don't have specific answers/suggestions for this). But that's not the space I'm looking at, nor the one that Prometheus applies to. Instead, the gap/usecase is narrower and more akin to "SNMP counters for containerized web services". Logs are surely useful to drill-down into problems and investigate root-causes, but that comes after being able to answer "is any of those web services experiencing non-transient issues". Instrumented applications and a metrics-gathering system enable this kind of observation before sorting through logs, in the same way that SNMP counters allow you to gauge if any switch port on your network is seeing an eyebrow-raising amount of traffic, without having to log every single frame-processing event and having to infer traffic levels from that. So, for a developer/operator of an instrumented service running as "openshift-apps", the requirements would be: * Existence of an in-cluster component which periodically collects application-level metrics * Short-term retention of those metrics into time-series (2 to 4 weeks) * Ability to interactively query those metrics, at a point in time * Ability to interactively graph those metrics, over a recent timespan * (optional) Ability to persist the timespan-graphs and organize them in dashboards * (optional²) Ability to define thresholds for anomalous metrics and getting alerts on such events That is, most of the value is in collecting/recording/querying metrics to answer "is everything fine today" (Prometheus). Dashboarding helps in cutting down the amount of manual interactive queries (optional, Grafana). Alerting on metrics helps skipping active observations, but formalizing the right anomaly thresholds is hard (optional, AlertManager; but for my personal needs I honestly wouldn't venture into that now). As a real world example, this[0] is a buggy client behavior we noticed by observing metrics exposed from one of our services running under openshift-apps. Detecting and adjusting the behavior in turn removed a potential thundering-herd effect on the service, before actually cascading onto other infra components. Possibly this was a lucky case, but not a single line of logs was involved into this investigation, as the clients are instrumented and can be observed in the same way too[1]. [0] https://github.com/coreos/zincati/issues/139 [1] https://github.com/coreos/zincati/pull/312 Ciao, Luca _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx