Re: Q: non-ASCII in syslog

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Wed, 27 Apr 2022 13:10:43 +0200

On Mi, 27.04.22 09:09, Ulrich Windl (Ulrich.Windl@xxxxxxxxxxxxxxxxxxxx) wrote:

> Hi!
>
> Having written an RFC 3164 compatible syslog daemon, I noticed that systemd
> created syslog messages with non-ASCII characters.
> The problem is that a remote syslogd can hardly guess the correct character
> set (I'm using rsyslog to forward local messages to a remote
> server).

It's 2022. I think at this point, software should always assume the
charset is UTF-8 if it doesn't have an reason to believe otherwise.

It's kinda what we started to do all across our codebase really. We'll
use UTF-8 for everything by default. For some things where people
complain sufficeintly loudly we'll conditionalize them so that we have
some fallback in place if we know for sure UTF-8 is not OK, but the
default we do is always and everywhere UTF-8.

> Example of such message:
> systemd-tmpfiles[3311]: [/usr/lib/tmpfiles.d/svnserve.conf:1] Line references
> path below legacy directory /var/run/, updating /var/run/svnserve →
> /run/svnserve; please update the tmpfiles.d/ drop-in file accordingly.
>
> (The arrow is encoded as three bytes (\xe2\x86\x92))
>
> RFC 5425 syslog messages require the use of a BOM (%xEF.BB.BF) at the
> beginning of a message if the message used UTF-8:

We do not implement RFC 5425, as glibc doesn't support that. In fact
we don't even implement RFC 3164 in full, since glibc generates the
messages in a very specific format only.

>
>       MSG             = MSG-ANY / MSG-UTF8
>       MSG-ANY         = *OCTET ; not starting with BOM
>       MSG-UTF8        = BOM UTF-8-STRING
>       BOM             = %xEF.BB.BF
>
> Wouldn't it make sense to add such a BOM for RFC 3164 syslog messages also if
> non-ASCII (i.e.: UTF-8) encoded characters are used?

There's plenty software that doesn't support RFC 5425, and putting a
BOM first is certainly not implemented in any of those. I think BOM is
hideous and defaulting to UTF-8 generally safe. If we'd put BOM first,
these messages would likely not be compatible with a large variety of
consumers anymore, because they can't handle BOM. This would be worse
than the status quo I am sure, since if we just send UTF-8 things
should generally just work fine for any software that either a) also
defaults to UTF-8 when encountering an 8bit char or b) is agonistic to
charsets and just passes data thorugh.

So, yeah, we might be stretching stdandards and tradition a bit, but
it actually works out quite well so far.

Lennart

--
Lennart Poettering, Berlin