Re: Converting man-pages to UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/14/2014 12:42 PM, Colin Watson wrote:
> On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote:
>> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
>> convert the pages of the the "man-pages" project to UTF 8. I thought
>> it worthwhile bringing that topic to the list, and CCing a few people
>> who may have some ideas about this step, since I'm not too sure of the
>> implications.
>>
>> Peter Schiffer has kindly written some some scripts to do the
>> conversion, which would touch about 40 files. However, as far I can
>> tell, many of the pages that have non-ASCII characters have inside
>> groff comments (author's names, etc.). The only pages that have
>> non-ASCII characters in the rendered source are various man7 pages on
>> character sets. These were the pages to which I added a groff encoding
>> marker in response to Colin Watson's input on this Debian bug:
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209
>>
>> Moving to UTF-8 for the pages seems like a good idea, at least at some
>> point. However, I'm wondering whether there are any backward
>> compatibility issues that I should need to worry about. As far as I
>> know, groff added UTF-8 support back in Jan 2009, so, just over 5
>> years ago. Perhaps that's long enough ago now, that any backward
>> compatibility issues with old versions of groff would be minimal.
>> (I.e., the number of people installing new man-pages on systems with
>> old groff is likely to be very small, and anyway, only a dozen or so
>> pages in Section 7 are affected. Furthermore, I'm assuming that Linux
>> distros have been shipping groff v1.20+ for quite a long time now.)
> 
> I think for characters in comments you're probably fine, and any
> problems you might have had should be gone as of groff 1.20.  Debian
> switched to that in July 2009, and I think we were late to the party
> because we had some difficult historical baggage to clean up at the same
> time.  I'm not aware of anyone shipping older versions of groff any
> more.
> 
> When you convert characters that show up in rendered source, I suspect
> systems using the other man package (1.6g or similar versions) may
> render them poorly, because it invokes nroff in some fairly naïve and
> hardcoded ways.  However, they already break in various related ways,
> and most distributions have switched to man-db now, or dealt with things
> some other way.  My rough survey of the major distributions for this is:
> 
>   Arch has been good since about 2009
> 
>   Debian and descendants are good as of late 2007 / early 2008 (addition
>   of manconv to man-db)
> 
>   Fedora is definitely good as of 2010 (switch to man-db), and I think
>   was good before that as IIRC they did a flag day to switch everything
>   to UTF-8 with man
> 
>   Gentoo switched to man-db at the end of 2013, so should be good now
> 
>   Mageia has a current groff, but uses man 1.6g with a stack of patches
>   (some encoding-related)
> 
>   openSUSE has been fine for about the same length of time as Debian
> 
>   Slackware has a current groff, but uses man 1.6g without much in the
>   way of special patches (just one to make things work for UTF-8
>   *output*)
> 
> My guess is that Mageia and Slackware may find that things only work
> properly for users in UTF-8 locales, but most other major distributions
> should be fine.  You won't be the first author to switch to UTF-8 manual
> pages; all you'll be doing is making existing shortcomings perhaps
> marginally more obvious.  In any case, the pages currently encoded in
> ISO-8859-1 won't be very seriously affected, and users of problematic
> systems will only have been able to read the other pages with good luck
> and a following wind anyway; switching to UTF-8 will probably actually
> improve things for them if they're using a UTF-8 locale.  (That is, the
> problems that the affected systems have generally relate to attempting
> to read pages whose encoding doesn't match that of their locale.)  They
> might possibly need to add the -k option to their nroff invocation in
> man.conf.
> 
> If I were you I would just go ahead.
> 
> Regarding your questions in the bug, please do keep the "coding:" tag in
> there; man-db will figure this out by brute force, but if left to its
> own devices I think groff's preconv will default to the locale's
> encoding, so it will only work for some people.

Hello Colin,

Thanks for the extensive reply! One final point. For the pages that
have non-ASCII characters only in source comments, not in rendered
input source, does it matter whether or not the "coding:" tag is added?
I ask because, simply for documentary purposes, I'm wondering whether
we should add that tag only in the pages that have UTF-8 in the rendered
input.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux