On 02/14/2014 12:42 PM, Colin Watson wrote: > On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote: >> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to >> convert the pages of the the "man-pages" project to UTF 8. I thought >> it worthwhile bringing that topic to the list, and CCing a few people >> who may have some ideas about this step, since I'm not too sure of the >> implications. >> >> Peter Schiffer has kindly written some some scripts to do the >> conversion, which would touch about 40 files. However, as far I can >> tell, many of the pages that have non-ASCII characters have inside >> groff comments (author's names, etc.). The only pages that have >> non-ASCII characters in the rendered source are various man7 pages on >> character sets. These were the pages to which I added a groff encoding >> marker in response to Colin Watson's input on this Debian bug: >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209 >> >> Moving to UTF-8 for the pages seems like a good idea, at least at some >> point. However, I'm wondering whether there are any backward >> compatibility issues that I should need to worry about. As far as I >> know, groff added UTF-8 support back in Jan 2009, so, just over 5 >> years ago. Perhaps that's long enough ago now, that any backward >> compatibility issues with old versions of groff would be minimal. >> (I.e., the number of people installing new man-pages on systems with >> old groff is likely to be very small, and anyway, only a dozen or so >> pages in Section 7 are affected. Furthermore, I'm assuming that Linux >> distros have been shipping groff v1.20+ for quite a long time now.) > > I think for characters in comments you're probably fine, and any > problems you might have had should be gone as of groff 1.20. Debian > switched to that in July 2009, and I think we were late to the party > because we had some difficult historical baggage to clean up at the same > time. I'm not aware of anyone shipping older versions of groff any > more. > > When you convert characters that show up in rendered source, I suspect > systems using the other man package (1.6g or similar versions) may > render them poorly, because it invokes nroff in some fairly naïve and > hardcoded ways. However, they already break in various related ways, > and most distributions have switched to man-db now, or dealt with things > some other way. My rough survey of the major distributions for this is: > > Arch has been good since about 2009 > > Debian and descendants are good as of late 2007 / early 2008 (addition > of manconv to man-db) > > Fedora is definitely good as of 2010 (switch to man-db), and I think > was good before that as IIRC they did a flag day to switch everything > to UTF-8 with man > > Gentoo switched to man-db at the end of 2013, so should be good now > > Mageia has a current groff, but uses man 1.6g with a stack of patches > (some encoding-related) > > openSUSE has been fine for about the same length of time as Debian > > Slackware has a current groff, but uses man 1.6g without much in the > way of special patches (just one to make things work for UTF-8 > *output*) > > My guess is that Mageia and Slackware may find that things only work > properly for users in UTF-8 locales, but most other major distributions > should be fine. You won't be the first author to switch to UTF-8 manual > pages; all you'll be doing is making existing shortcomings perhaps > marginally more obvious. In any case, the pages currently encoded in > ISO-8859-1 won't be very seriously affected, and users of problematic > systems will only have been able to read the other pages with good luck > and a following wind anyway; switching to UTF-8 will probably actually > improve things for them if they're using a UTF-8 locale. (That is, the > problems that the affected systems have generally relate to attempting > to read pages whose encoding doesn't match that of their locale.) They > might possibly need to add the -k option to their nroff invocation in > man.conf. > > If I were you I would just go ahead. > > Regarding your questions in the bug, please do keep the "coding:" tag in > there; man-db will figure this out by brute force, but if left to its > own devices I think groff's preconv will default to the locale's > encoding, so it will only work for some people. Hello Colin, Thanks for the extensive reply! One final point. For the pages that have non-ASCII characters only in source comments, not in rendered input source, does it matter whether or not the "coding:" tag is added? I ask because, simply for documentary purposes, I'm wondering whether we should add that tag only in the pages that have UTF-8 in the rendered input. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html