Collecting application metrics from instrumented openshift-apps services

Luca BRUNO <lucab@xxxxxxxxxx> · Sat, 27 Jun 2020 09:29:46 +0000

On Fri, 26 Jun 2020 09:43:54 -0700
Kevin Fenzi <kevin@xxxxxxxxx> wrote:

> > I'd be very much in favour of having an Infra managed Prometheus
> > instance (+ grafana and alertmanager on Openshift), its something I
> > hoped to work on within CPE sustaining infact.  
> 
> You know, I'm not in love with that stack. It could well be that I
> just haven't used it enough or know enough about it, but it seems just
> needlessly complex. ;( 
> 
> I'd prefer we start out at a lower level... what are our requirements?
> Then, see how we can setup something to meet those. 
> 
> Off the top of my head (I'm sure I can think of more): 
> 
> * Ability to collect/gather rsyslog output from all our machines. 
> * Ability to generate reports of 'variances' from all that (ie, what
> odd messages should a human look at?)
> * Handle all the logs from openshift, possibly multiple clusters?
> * Ability to easily drill down and look at some specifc historical
> logs (ie, show me the logs for the bodhi-web pods from last week when
> there was a issue). 
> 
> Perhaps prometheus/graphana/alertmanager is the solution, but there's
> also tons of other open source projects out there too that we might
> look into. 

I appreciate this reply because it contains very relevant technical
points, but it highlights that we are looking at different problems
at different levels.

Your list of requirements basically describes an "ingestion + storage +
anomaly inference for logs" solution, where the infrastructure and the
applications running on top of it are blended together.
This would be handy to have and it's indeed a good problem to work on
(I don't have specific answers/suggestions for this).

But that's not the space I'm looking at, nor the one that Prometheus
applies to. Instead, the gap/usecase is narrower and more akin to "SNMP
counters for containerized web services".
Logs are surely useful to drill-down into problems and investigate
root-causes, but that comes after being able to answer "is any of those
web services experiencing non-transient issues".

Instrumented applications and a metrics-gathering system enable this
kind of observation before sorting through logs, in the same way that
SNMP counters allow you to gauge if any switch port on your network is
seeing an eyebrow-raising amount of traffic, without having to log every
single frame-processing event and having to infer traffic levels from
that.

So, for a developer/operator of an instrumented service running as
"openshift-apps", the requirements would be:
 * Existence of an in-cluster component which periodically collects
   application-level metrics
 * Short-term retention of those metrics into time-series (2 to 4 weeks)
 * Ability to interactively query those metrics, at a point in time
 * Ability to interactively graph those metrics, over a recent timespan 
 * (optional) Ability to persist the timespan-graphs and organize them
   in dashboards
 * (optional²) Ability to define thresholds for anomalous metrics and
   getting alerts on such events

That is, most of the value is in collecting/recording/querying metrics
to answer "is everything fine today" (Prometheus).
Dashboarding helps in cutting down the amount of manual interactive
queries (optional, Grafana).
Alerting on metrics helps skipping active observations, but formalizing
the right anomaly thresholds is hard (optional, AlertManager; but for
my personal needs I honestly wouldn't venture into that now).

As a real world example, this[0] is a buggy client behavior we noticed
by observing metrics exposed from one of our services running under
openshift-apps.
Detecting and adjusting the behavior in turn removed a potential
thundering-herd effect on the service, before actually cascading onto
other infra components.
Possibly this was a lucky case, but not a single line of logs was
involved into this investigation, as the clients are instrumented and
can be observed in the same way too[1].

[0] https://github.com/coreos/zincati/issues/139
[1] https://github.com/coreos/zincati/pull/312

Ciao, Luca
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx