Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)

Markus Heiser <markus.heiser@xxxxxxxxxxx> · Thu, 6 May 2021 19:53:25 +0200

Am 06.05.21 um 19:27 schrieb Mauro Carvalho Chehab:
Em Thu, 6 May 2021 19:04:44 +0200
Markus Heiser <markus.heiser@xxxxxxxxxxx> escreveu:

Am 06.05.21 um 18:46 schrieb Mauro Carvalho Chehab:
Em Thu, 6 May 2021 17:57:15 +0200
Markus Heiser <markus.heiser@xxxxxxxxxxx> escreveu:

Am 06.05.21 um 12:39 schrieb Michal Suchánek:
When building HTML documentation I get this output:
...
[  412s] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
...

It does not say which input file contains the offending character so I can't tell which file is broken.

Any idea how to debug?

I guess the build host is a very simple container, what does

     echo $LC_ALL
     echo $LANG

prompt?  If it is latin, change it to something using utf-8 (I recommend
'en_US.utf8').

A UnicodeEncodeError can occour everywhere where characters are
encoded from (internal) unicode to the encoding of the stream.

By example:

A print or log statement which streams to stdout needs to encode
from unicode to stdout's encoding.  If there is one unicode symbol
which can not encoded to stream's encoding a UnicodeEncodeError
is raised.

Hi Markus,

It shouldn't matter the builder's locale when building the Kernel
documentation (or any other documents built from other git trees
on other open source projects), as the Kernel's *.rpm document charset
won't change, no matter on what part of the globe it was built.

I vaguely remember about a change we made a couple of years ago
in order to address this issue.

Hi Mauro :)

sure? .. what if the logger wants to log some symbols from the
chines translated parts to stdout and the encoding of stdout is
latin?

In python the logger will raise a UnicodeEncodeError, this is
what I know .. but I'm often wrong ;)

Yeah, Python (and almost all python apps) has a mad behavior when
it finds an unexpected character: instead of ignoring it, it

Hi Mauro,

it is not comfortable but is it mad? ..

Most often languages (or applications) do not handle encoding
of strings they just piping a binary stream while python
decode / encodes strings.

"The Zen of Python" [1] says

   Explicit is better than implicit.

If a stream can't encode symbols and these symbols should be ignored
you have to set the encoding of the stream explicit to ignore
such symbols.

I guess this encode discussions will haunt me for the rest of my
life.  My escape strategy is to use UTF-8 wherever possible.

[1] https://www.python.org/dev/peps/pep-0020/

  -- Markus --