On Mon, Jan 27, 2025 at 07:27:59PM +0100, наб wrote: > Skimming the thread: UNIX paths are sequences of non-NUL bytes. > > It is never correct to expect to be able to have a (parse, unparse) > operation pair for which unparse(parse(x)) = x for path x. > > It's obviously wrong to reject a pathname just because you dont like it. > > Thus, when displaying a path, either (a) dump it directly to the output > (the user has configured their display device to understand the paths they use), > or if that's not possible (b) setlocale(LC_ALL, "") + mbrtowc() loop > and render the result (applying usual ?/� substitutions for mbrtowc() > errors makes sense here). > > There are very few operations on paths that are actually reasonable > to do, ever; those are: appending stuff, prepending stuff > (this is just appending stuff with the arguments backwards), > and cleaving at /es; > the "stuff" better be copied whole-sale from some other path > or an unprocessed argument (or, sure, the PFCS). > > If you're getting bytes to append to a path, do that directly. > > If you're getting characters to append to a path, > then wctomb(3) is the only non-invalid solution, > since that (obviously) turns characters into bytes in the current > locale, which (ex def) is the operation desired. > > I don't understand what the UTF-32 dance is supposed to be. > > If you're recommending transcoding paths, don't. > > To re-iterate: paths are not character sequences. > They do not represent characters. > You can't meaningfully coerce them thusly without loss of precision > (this is ok to do for display! and nothing else). > If at any point you find yourself turning wchar_t -> char > you are doing something wrong; > if you find yourself doing char -> wchar_t for anything beside display > you should probably reconsider. > > This is different under Win32 of course. But that concerns us naught. Suggested-by: наб <nabijaczleweli@xxxxxxxxxxxxxxxxxx> Cc: Jason Yundt <jason@jasonyundt.email> Cc: Florian Weimer <fweimer@xxxxxxxxxx> Cc: "G. Branden Robinson" <branden@xxxxxxxxxx> Signed-off-by: Alejandro Colomar <alx@xxxxxxxxxx> --- Hi наб! Thanks for the detailed response. I applied this patch based on it. Does it sound good to you? Please review. Have a lovely day! Alex man/man7/pathname.7 | 87 ++------------------------------------------- 1 file changed, 2 insertions(+), 85 deletions(-) diff --git a/man/man7/pathname.7 b/man/man7/pathname.7 index 59650ef6e..996436606 100644 --- a/man/man7/pathname.7 +++ b/man/man7/pathname.7 @@ -17,7 +17,7 @@ .SH DESCRIPTION The kernel stores pathnames as C strings, that is, sequences of non-null bytes terminated by a null byte. -The kernel has a few general rules that apply to all pathnames: +There are a few general rules that apply to all pathnames: .IP \[bu] 3 The last byte in the sequence needs to be a null byte. .IP \[bu] @@ -59,17 +59,8 @@ .SH DESCRIPTION .P Some filesystems or APIs may apply further restrictions, such as requiring shorter filenames, -or restricting the allowed characters in a filename. +or restricting the allowed bytes in a filename. .P -User-space programs treat pathnames differently. -They typically expect pathnames to -use a consistent character encoding. -For maximum interoperability, -programs should use -.BR nl_langinfo (3) -to determine the current locale's codeset. -Pathnames should be encoded and decoded using the current locale's codeset -in order to help prevent mojibake. For maximum interoperability, programs and users should also limit the characters that they use for their own pathnames to @@ -77,83 +68,9 @@ .SH DESCRIPTION .UR https://pubs.opengroup.org/\:onlinepubs/\:9799919799/\:basedefs/\:V1_chap03.html#tag_03_265 Portable Filename Character Set .UE . -.SH EXAMPLES -The following program demonstrates -how to ensure that a pathname uses the proper encoding. -The program starts with a UTF-32 encoded pathname. -It then calls -.BR nl_langinfo (3) -in order to determine what the current locale's codeset is. -After that, it uses -.BR iconv (3) -to convert the UTF-32-encoded pathname into a locale-codeset-encoded pathname. -Finally, -the program uses the locale-codeset-encoded pathname -to create a file that contains the message \[lq]Hello, world!\[rq]. -.SS Program source -.\" SRC BEGIN (pathname_encoding_example.c) -.EX -#include <err.h> -#include <iconv.h> -#include <langinfo.h> -#include <locale.h> -#include <stdio.h> -#include <stdlib.h> -#include <uchar.h> -\& -#define NELEMS(a) (sizeof(a) / sizeof(a[0])) -\& -int -main(void) -{ - char *locale_pathname; - char *in, *out; - FILE *fp; - size_t size; - size_t inbytes, outbytes; - iconv_t cd; - char32_t utf32_pathname[] = U"María"; -\& - if (setlocale(LC_ALL, "") == NULL) - err(EXIT_FAILURE, "setlocale"); -\& - size = NELEMS(utf32_pathname) * MB_CUR_MAX; - locale_pathname = malloc(size); - if (locale_pathname == NULL) - err(EXIT_FAILURE, "malloc"); -\& - cd = iconv_open(nl_langinfo(CODESET), "UTF\-32"); - if (cd == (iconv_t)\-1) - err(EXIT_FAILURE, "iconv_open"); -\& - in = (char *) utf32_pathname; - inbytes = sizeof(utf32_pathname); - out = locale_pathname; - outbytes = size; - if (iconv(cd, &in, &inbytes, &out, &outbytes) == (size_t) \-1) - err(EXIT_FAILURE, "iconv"); -\& - if (iconv_close(cd) == \-1) - err(EXIT_FAILURE, "iconv_close"); -\& - fp = fopen(locale_pathname, "w"); - if (fp == NULL) - err(EXIT_FAILURE, "fopen"); -\& - fputs("Hello, world!\[rs]n", fp); - if (fclose(fp) == EOF) - err(EXIT_FAILURE, "fclose"); -\& - free(locale_pathname); - exit(EXIT_SUCCESS); -} -.EE -.\" SRC END .SH SEE ALSO .BR limits.h (0p), .BR open (2), .BR fpathconf (3), -.BR iconv (3), -.BR nl_langinfo (3), .BR path_resolution (7), .BR mount (8) Range-diff against v0: -: --------- > 1: b9f5079f6 man/man7/pathname.7: Pathnames are opaque C strings -- 2.47.2
Attachment:
signature.asc
Description: PGP signature