Re: Logging future direction and ideas.

Ludwig <lkrispen@xxxxxxxxxx> · Wed, 15 May 2019 11:37:54 +0200

Hi William,

this is a good starting point to discuss and decide how to move forward 
with logging (although it looks like you are already some steps ahead).

I have no strong ties to existing logging format and if we do it in rust 
or not I don't care, but I do not want to use existing functionality for 
the sake af a new unified approach.

First we need to define what is the purpose and usage of our logs, what 
do we want to keep or extend. I see these main area:

- statistics

- auditing (authentications, modifications)

. debugging (for my usage the most important aspect)

Next: what are the real problems

- performance: yes, it would be nice to have less logging impact on 
performance, but I haven't seen real complaints and doing it differently 
does not guarantee better performance, we have debug levels (like 
tracing) which are not usable because they slow down everything, I do 
not think this will be resolved by just replacing the library

- information:  we continue to add information to the logs, and there 
are still open requests

So: If I look at your suggestions I do like some and have concerns with 
others, and I am not sure if the priorities are right.

You list a couple of tickets related to logging, but forgot others, eg:

https://pagure.io/389-ds-base/issue/47822 - improve logging of ACL decisions

https://pagure.io/389-ds-base/issue/48680 - enable logging for 3rd party 
libs

https://pagure.io/389-ds-base/issue/50002 - improve password policy logging

https://pagure.io/389-ds-base/issue/50187 - improve replication logging

these requests for improvements of logging exist for quite some time, 
they were accepted as useful and good to have, but we didn't have the 
capacity to work on them. The work is tedious, going thru ACL or 
replication code and decide what and how to log better requires good 
knowledge of the areas and is not fancy - but I wonder if the effort 
wouldn't be spent better to improve the "what" and not the "how".

Now: some comments on your suggestions.

- perf: I think everything we can do to reduce locking in logging is a 
good thing, if we do logging more asynchronous it is ok for me as long 
as we can see the "real time" of the event and get the logs ordered. If 
replacing the "log" component by a rust implemention helps it is good. 
We could in a first step keep the slapi_log_* api and only do it more 
efficient

- merging of logs: I do not like the idea to have everything in one log 
file, if you say we can provide tools to split, we can also provide 
tools to merge. The logs have a different format and for good reasons, 
eg the audit log is ldif like and can easily be used to replay changes, 
why force it into another format.

Also, in my opinion, the different logs have a different life time. You 
might want to keep all audit logs, keep the error log for the main 
events for some longer time than the fast growing access log, defining 
log rotation based on the log type is a benfit in my opinion

- operation orientation: there are logging events which are independent 
of operations, especially in replicatio. I would not want to 
artificially enforce a "operation context" eg for changelog processing, 
tombstone purging,.....

- format: for machines it might be good that logs are machine readable, 
but I did spend a lot of times reading logs and lokkinng for patterns 
and clues, I think as soon as they are in json I will stop doing this.

This may sound too negative, I agree that logging deserves attention and 
improvement, but we need to agree on what we want to achieve.

Best regards,

Ludwig

On 05/10/2019 05:13 AM, William Brown wrote:
Hi all,

So I think it's time for me to write some logging code to improve the situation. Relevant links before we start:

https://pagure.io/389-ds-base/issue/49415
http://www.port389.org/docs/389ds/design/logging-performance-improvement.html
https://pagure.io/389-ds-base/issue/50350
https://pagure.io/389-ds-base/issue/50361

All of these links touch on issues around logging, and I think they all combine to create three important points:

* The performance of logging should be improved
* The amount of details (fine grain) and information in logs should improve
* The structure of the log content should be improved to aid interaction (possibly even machine parsable)

I will turn this into a design document, but there are some questions I would like some input to help answer as part of this process to help set the direction and tasks to achieve.

-- Should our logs as they exist today, continue to exist?

I think that my view on this is "no". I think if we make something better, we have little need to continue to support our legacy interfaces. Of course, this would be a large change and it may not sit comfortably with people.

A large part of this thinking is that the "new" log interface I want to add is focused on *operations* rather than auditing accesses or changes, or over looking at errors. The information of both the current access/audit/error would largely be melded into a single operation log, and then with tools like logconv, we
could parse and extract information that would behave the same way as access/error/audit.

At the same time - I can see how people *may* want a "realtime" audit of operations as they occur (IE access log), but this still today is already limited by having to "wait" for operations to complete.

In a crash scenario, we would be able to still view the logs that are queued, so I think there are not so many concerns about losing information in these cases (in fact we'd probably have more).

-- What should the operation log look like?

I think it should be structured, and should be whole-units of information, related to a single operation. IE only at the conclusion of the operation is it logged (thus the async!). It should support arbitrary, nested timers, and would *not* support log levels - it's a detailed log of the process each query goes through.

An example could be something like:

[timestamp] - [conn=id op=id] - start operation
[timestamp] - [conn=id op=id] - start time = time ...
[timestamp] - [conn=id op=id] - started internal search '(some=filter)'
[timestamp] - [conn=id op=id parentop=id] - start nested operation
[timestamp] - [conn=id op=id parentop=id] - start time = time ...
...
[timestamp] - [conn=id op=id parentop=id] - end time = time...
[timestamp] - [conn=id op=id parentop=id] - duration = diff end - start
[timestamp] - [conn=id op=id parentop=id] - end nested operation - result -> ...
[timestamp] - [conn=id op=id] - ended internal search '(some=filter)'
...
[timestamp] - [conn=id op=id] - end time = time
[timestamp] - [conn=id op=id] - duration = diff end - start

Due to the structured - blocked nature, there would be no interleaving of operation messages. therefor the log would appear as:

[timestamp] - [conn=00 op=00] - start operation
[timestamp] - [conn=00 op=00] - start time = time ...
[timestamp] - [conn=00 op=00] - started internal search '(some=filter)'
[timestamp] - [conn=00 op=00 parentop=01] - start nested operation
[timestamp] - [conn=00 op=00 parentop=01] - start time = time ...
...
[timestamp] - [conn=00 op=00 parentop=01] - end time = time...
[timestamp] - [conn=00 op=00 parentop=01] - duration = diff end - start
[timestamp] - [conn=00 op=00 parentop=01] - end nested operation - result -> ...
[timestamp] - [conn=00 op=00] - ended internal search '(some=filter)'
...
[timestamp] - [conn=00 op=00] - end time = time
[timestamp] - [conn=00 op=00] - duration = diff end - start
[timestamp] - [conn=22 op=00] - start operation
[timestamp] - [conn=22 op=00] - start time = time ...
[timestamp] - [conn=22 op=00] - started internal search '(some=filter)'
[timestamp] - [conn=22 op=00 parentop=01] - start nested operation
[timestamp] - [conn=22 op=00 parentop=01] - start time = time ...
...
[timestamp] - [conn=22 op=00 parentop=01] - end time = time...
[timestamp] - [conn=22 op=00 parentop=01] - duration = diff end - start
[timestamp] - [conn=22 op=00 parentop=01] - end nested operation - result -> ...
[timestamp] - [conn=22 op=00] - ended internal search '(some=filter)'
...
[timestamp] - [conn=22 op=00] - end time = time
[timestamp] - [conn=22 op=00] - duration = diff end - start

An alternate method for structuring could be a machine readable format like json:

{
     'timestamp': 'time',
     'duration': ....,
     'bind': 'dn of who initiated operation',
     'events': [
         'debug': 'msg',
         'internal_search': {
              'timestamp': 'time',
              'duration': ....,
         }
     ],
}

This would make writing tools like logconv much easier, as we wouldn't need regex or others to parse and understand these logs.

-- Should our log interface from slapi_log_err etc change?

This is a super tricky question ... I think the answer is yes - we are using log_err for messages that should be associated to an operation - and additionally, we often disable or lower the level of error messages due to the low performance. It would be hard to shift behaviour and expectations around the behaviour of this log, so a better idea would be to fix things oourselves by adjusting how we log in plugins. We'll probably always need to support log_err, but then those developers would not gain the benefit of our improvements.

We could translate log_err to emit simple text messages into the events section of the log, but for full benefit, using the operation log interfaces would be best.

-- Should we still improve the performance of logs as they exist today?

Probably yes - but I think there is a good reason to approach this differently.

I think we should be log_async.c and log_operation.c in parallel to existing logging, and have a similer feature gating flag in use like we have done for other features. We can then build this in parallel and gain confidence and experience with it. Once complete, we could then change the log_access, log_audit and log_err macros/functions to call related log_op calls, or rewrite those logs to have log_async.c execution backends.

SUMMARY:

* Make slapi_log_op and an async log system in parallel to existing log infra
* Determine a deprecation or improvement plan for other log types
* Port log types to async, or remove them.

--
Sincerely,

William
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx