Dear Community,
(following is a summary of a discussion in the RGW refactoring meeting from 6-Apr-22)
Recently I've been debugging multisite issues in the RGW on a large scale. This poses several challenges, some of which require significant improvements, but some could be achieved with little to no extra development. The most challenging issue is the ability to correlate logs for asynchronous actions and across daemons (e.g. RGWs in two different zones, or RGWs and OSDs). This is currently being addressed by introducing a tracing framework (opentelemetry) to the RGW and OSD.
However, there are 3 other issues for which we already have working solutions, that are misused in many areas of our code:
log levels
when dealing with huge log files focusing only on errors is critical. currently, the numeric value of the log level is any number between -1 and 20.
* level -1: this is a log level that cannot be turned off. not sure that we should even be using it (especially on the fast path), or maybe just use it for FATAL errors (currently it is used in 303 places in the RGW, including places marked as WARNINGs)
* level 0 (default level): most commonly used for "ERROR" but could be found as "WARNING", "NOTICE" or in some cases just describing some behavior that is not erroneous in any way
* the other level are used pretty much for everything (including 29 ERRORs at level 20).
Would be great to converge on 5 log levels (as in many other systems): FATAL, ERROR, WARNING, INFO, DEBUG.
This won't require any extra development, and mainly discipline of the developers and reviewers.
Making sure that the right text appears at the right level, would also make the logs more "greppable".
subsystems
when debugging certain areas of the code, it would be very useful to set only this area with a higher debug level, while keeping other areas less verbose.
currently, we set the "dout_subsys" to "ceph_subsys_rgw" in all the RGW files. having one flag at RGW level is useful, but it would be even more useful if we could have more fine-grained subsystems under the RGW (ideally not set as a file-level macro)
context
until we have a fully-fledged tracing solution, we would need more context in the logs:
* dpp - tells you where the code was invoked from. recently there was a huge effort to add dpp to many areas of the RGW. but there are still places where the work was not finished. also, there are places where more fine-grained dpp (similarly to what we have a prefix per request in the frontend) would be useful
* add "correlative info" to the log: object and bucket names, shard ids, request-id etc. to allow correlating different log messages in the log and across logs
Your feedback is much appreciated!
Yuval
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx