On 28/04/2024 01:49, Herbert Xu wrote:
On Sat, Apr 27, 2024 at 11:31:43PM +0200, Christoph Anton Mitterer wrote:
Long story short:
The recommended way was to add a sentinel character '.' at the end of
the output within the command substitution and strip that off later
with parameter expansion.
But despite of the very special properties[0] of '.', it's apparently
still required to set LC_ALL=C when stripping the sentinel, because the
pattern matching notation in ${foo%.} is defined only on strings of
characters, not on strings of bytes.
Are you talking about a theoretical undefined condition, or an
actual one? Which shell doesn't deal with ${foo%.} correctly?
The way you are implementing it, once you get to pmatch(), arguably you
will not handle ${foo%.} correctly.
Consider an UTF-8 locale, where '\303' is not a valid multibyte
character. In this locale, consider
foo=$(printf '\303.')
foo=${foo%.}
This is something I expect to set foo to '\303', and it does in all
shells I know of, despite POSIX not saying this needs to work. The way
you are implementing multibyte character support, if I am reading it
right, as long as a full multibyte character has not been read, the next
byte will be taken as part of that multibyte character, meaning you will
take '\303.' as a single invalid multibyte character.
At the same time, '\303\251' is a valid multibyte character, and '\251'
is not. So also consider
foo=$(printf '\303\251')
foo=${foo%$(printf '\251')}
Here, it is not clear what the correct result is, and indeed, shells
disagree. bosh, ksh, zsh, and my shell do not break up characters, which
I believe to be the most sensible behaviour. bash and mksh do.
The corner cases need to be carefully considered in order to figure out
how to write the multibyte character support core functionality.
Cheers,
Harald van Dijk