On Mon, Jan 27, 2025 at 12:22:49PM +0100, Alejandro Colomar wrote: > Hi Jason, > > I think the recommendation to use the current locale for handling > pathnames isn't good. > > If I use the C locale (and I do have systems with the C locale), then > programs running on that system would corrupt files that go through that > system. I agree. I think that this is just a limitation of the design of UNIX-like systems. As long as users are allowed to choose different locale codesets, mojibake will always be possible. Sometimes, you just have to temporarily switch to a different locale in order to make things work. For example, I was trying to run some old Japanese software a while ago, and I had to add a Shift-JIS locale to my system in order to get it to work. > Let's say you send me María.song, and I download it on a system > using the C locale. Programs would fail to copy the file. Not necessarily. pathname(7) says “Paths should be encoded and decoded using the current locale’s codeset in order to help prevent mojibake.” In many cases, you don’t need to encode or decode a pathname. Here’s a program that copies a file without encoding or decoding any pathnames: #include <stdio.h> int main(int argc, char *argv[]) { FILE *src = NULL, *dest = NULL; int exit_status = 0; if (argc != 3) { fprintf(stderr, "USAGE: %s <src> <dest>\n", argv[0]); exit_status = 1; goto clean_up; }; // Everything in argv[] should already use the correct character encoding, // so we don’t need to do any encoding or decoding here. // See <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>. src = fopen(argv[1], "r"); if (src == NULL) { fprintf(stderr, "ERROR: Failed to open %s\n", argv[1]); exit_status = 1; goto clean_up; } dest = fopen(argv[2], "w"); if (dest == NULL) { fprintf(stderr, "ERROR: Failed to open %s\n", argv[2]); exit_status = 1; goto clean_up; } int c; while((c = fgetc(src)) != EOF) { if (fputc(c, dest) == EOF) { fprintf(stderr, "ERROR: Error while writing to %s\n", argv[2]); exit_status = 1; goto clean_up; } } clean_up: if (src != NULL) { if (fclose(src) == EOF) { fprintf(stderr, "ERROR: Failed to close %s\n", argv[1]); exit_status = 1; } } if (dest != NULL) { if (fclose(dest) == EOF) { fprintf(stderr, "ERROR: Failed to close %s\n", argv[2]); exit_status = 1; } } return exit_status; } > Instead, I think a good recommendation would be to behave in one of the > following ways: > > - Accept only the POSIX Portable Filename Character Set. This one isn’t quite a complete recommendation. The POSIX Portable Filename Character Set is just a character set. It’s not a character encoding. If we go with this one, then we would need to say something along the lines of “Encode and decode paths using ASCII and only accept characters that are in the POSIX Protable Filename Character Set.” > - Assume UTF-8, but reject control characters. > - Assume UTF-8. > - Accept anything, but reject control characters. > - Accept anything, just like the kernel. These last two also aren’t quite complete recommendations. If a GUI program wants to display a pathname on the screen, then what character encoding should it use when decoding the bytes? > The current locale should actively be ignored when handling pathnames. > > I've modified the example in the manual page to use a filename that's > non-ASCII, to make it more interesting. See how it fails: > > alx@devuan:~/tmp/gcc$ cat path.c > #include <err.h> > #include <iconv.h> > #include <langinfo.h> > #include <locale.h> > #include <stdio.h> > #include <stdlib.h> > #include <uchar.h> > > #define NELEMS(a) (sizeof(a) / sizeof(a[0])) > > int > main(void) > { > char *locale_pathname; > char *in, *out; > FILE *fp; > size_t size; > size_t inbytes, outbytes; > iconv_t cd; > char32_t utf32_pathname[] = U"María"; > > if (setlocale(LC_ALL, "") == NULL) > err(EXIT_FAILURE, "setlocale"); > > size = NELEMS(utf32_pathname) * MB_CUR_MAX; > locale_pathname = malloc(size); > if (locale_pathname == NULL) > err(EXIT_FAILURE, "malloc"); > > cd = iconv_open(nl_langinfo(CODESET), "UTF-32"); > if (cd == (iconv_t)-1) > err(EXIT_FAILURE, "iconv_open"); > > in = (char *) utf32_pathname; > inbytes = sizeof(utf32_pathname); > out = locale_pathname; > outbytes = size; > if (iconv(cd, &in, &inbytes, &out, &outbytes) == (size_t) -1) > err(EXIT_FAILURE, "iconv"); > > if (iconv_close(cd) == -1) > err(EXIT_FAILURE, "iconv_close"); > > fp = fopen(locale_pathname, "w"); > if (fp == NULL) > err(EXIT_FAILURE, "fopen"); > > fputs("Hello, world!\n", fp); > if (fclose(fp) == EOF) > err(EXIT_FAILURE, "fclose"); > > free(locale_pathname); > exit(EXIT_SUCCESS); > } > > alx@devuan:~/tmp/gcc$ cc -Wall -Wextra path.c > alx@devuan:~/tmp/gcc$ ls > a.out path.c > alx@devuan:~/tmp/gcc$ ./a.out ; echo $? > 0 > alx@devuan:~/tmp/gcc$ ls > María a.out path.c > alx@devuan:~/tmp/gcc$ cat María > Hello, world! > alx@devuan:~/tmp/gcc$ LC_ALL=C ./a.out ; echo $? > a.out: iconv: Invalid or incomplete multibyte or wide character > 1 > > What do you think? I honestly don’t know what the recommendation should be. Here’s what I would need to know in order to figure out what the recommendation should be. A while ago, I asked this question on the Unix & Linux Stack Exchange [1]: > What does a locale’s codeset get used for? > > According to glibc’s manual: > > > Most locale names follow XPG syntax and consist of up to four parts: > > > > ``` > > language[_territory[.codeset]][@modifier] > > ``` > > For example, you could have a locale named zh_CN.GB18030 which would > use the GB 18030 character encoding, or you could have a locale named > zh_CN.UTF-8 which would use the UTF-8 character encoding. > > Here’s where I’m confused: let’s say that I switch from a zh_CN.UTF-8 > locale to a zh_CN.GB18030 locale. In that situation, some things that > used to be encoded in UTF-8 are now going to be encoded in GB 18030. > Which things will now be encoded in GB 18030? Will stdin, stdout and > stderr use GB 18030? What about argv? What about filesystem paths? > > Technically, a program can do whatever it wants and ignore the locale > completely, but let’s assume that programs are doing the correct thing > here. What is supposed to be encoded in GB 18030 if I use a > zh_CN.GB18030 locale? I didn’t get an answer to that question, so I asked it again on the libc-help mailing list [2]. I got one response that was super helpful [3]. That response clearly said that paths should be encoded using the locale’s codeset. If you think that answer was incorrect, then I would like a very specific list of things that should and should not be encoded using the locale’s codeset so that I can add it to the glibc manual (and maybe the POSIX standard if I can figure out how to contribute to that). [1]: <https://unix.stackexchange.com/q/780404/316181> [2]: <https://sourceware.org/pipermail/libc-help/2024-August/006736.html> [3]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html> > Have a lovely day! > Alex > > -- > <https://www.alejandro-colomar.es/>