Re: [PATCH 0/8] Add multi-byte support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey.


On Sat, 2024-04-27 at 19:03 +0800, Herbert Xu wrote:
> This patch series adds multi-byte support to dash.  For now only
> fnmatch is supported as the native pmatch function has not been
> modified to support multi-byte characters.

Nothing against the functionality per se, but I think for all scripts
that assumed dash's (and thus on may systems /bin/sh's) current
behaviour of being C locale only even without explicitly setting
LC_ALL=C, this may have quite some subtle issues.


AFAIU, in the C locale, all bytes is a character, and thus in
particular pattern matching notation is defined for every defined
outcome of command substitution respectively every content of variables
(that is: in every(!) locale every byte other than NUL).


For example:
************
A while ago I've asked on the Austin Group mailing list for a portable
way to get command substitution without stripping of trailing newlines.

Long story short:
The recommended way was to add a sentinel character '.' at the end of
the output within the command substitution and strip that off later
with parameter expansion.
But despite of the very special properties[0] of '.', it's apparently
still required to set LC_ALL=C when stripping the sentinel, because the
pattern matching notation in ${foo%.} is defined only on strings of
characters, not on strings of bytes.

Back then, Harald van Dijk had some ideas how that might be resolved
for good, but IIRC none of the shell implementors seemed to really have
interest.

My goal was to make a portable function like
   command_subst_with_newlines "eval-ed-command-string" "target-variable-name"
which, with the requirement of setting LC_ALL proved more or less
impossible when the function should have no side effects (like keeping
the LC_ALL overridden, over possibly overriding some existing var like
OLD_LC_ALL).


Anyway... I could image, that if dash becomes multi-byte aware, there
might be more or less subtle surprises.


Cheers,
Chris.


[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
"The encoded values associated with <period>, <slash>, <newline>, and
<carriage-return> shall be invariant across all locales supported by
the implementation."





[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux