NOTE: Above approach is already implemented in
'experimental' branch, excluding handling of [6].
b) where to measure the latency and fops counts?
One of the possible way is to load io-stats in between all
the nodes, but it has its own limitations. Mainly, how to
configure options in each of this translator, will having too
many translators slow down operation ? (ie, create one extra
'frame' for every fop, and in a graph of 20 xlator, it will be
20 extra frame creates for a single fop).
I propose we handle this in 'STACK_WIND/UNWIND' macros
itself, and provide a placeholder to store all this data in
translator structure itself. This will be more cleaner, and no
changes are required in code base, other than in 'stack.h (and
some in xlator.h)'.
Also, we can provide 'option monitoring enable' (or
disable) option as a default option for every translator, and
can handle it at xlator_init() time itself. (This is not a
blocker for 4.0, but good to have). Idea proposed @ github
#304 [7].
NOTE: this approach is working pretty good already at
'experimental' branch, excluding [7]. Depending on feedback,
we can improve it further.
c) framework for xlators to provide private metrics
One possible solution is to use statedump functions. But to
cause least disruption to an existing code, I propose 2 new
methods. 'dump_metrics()', and 'reset_metrics()' to xlator
methods, which can be dl_open()'d to xlator structure.
'dump_metrics()' dumps the private metrics in the expected
format, and will be called from the global dump-metrics
framework, and 'reset_metrics()' would be called from a CLI
command when someone wants to restart metrics from 0 to check
/ validate few things in a running cluster. Helps
debug-ability.
Further feedback welcome.
NOTE: a sample code is already implemented in
'experimental' branch, and protocol/server xlator uses this
framework to dump metrics from rpc layer, and client
connections.
d) format of the 'metrics' file.
If you want any plot-able data on a graph, you need key
(should be string), and value (should be a number), collected
over time. So, this file should output data for the monitoring
systems and not exactly for the debug-ability. We have
'statedump' for debug-ability.
So, I propose a plain text file, where data would be dumped
like below.
```
# anything starting from # would be treated as comment.
<key><space><value>
# anything after the value would be ignored.
```
Any better solutions are welcome. Ideally, we should keep
this friendly for external projects to consume, like tendrl
[8] or graphite, prometheus etc. Also note that, once we agree
to the format, it would be very hard to change it as external
projects would use it.
I would like to hear the feedback from people who are
experienced with monitoring systems here.
NOTE: the above format works fine with 'glustermetrics'
project [9] and is working decently on 'experimental' branch.
------
* Discussions:
Let me know how you all want to take the discussion
forward?
Should we get to github, and discuss on each issue? or
should I rebase and send the current patches from experimental
to 'master' branch and discuss in our review system? Or
should we continue on the email here!
Regards,
Amar
References: