[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MHonArc and multi-byte characters in HTML



[ Refer to http://www.xray.mpe.mpg.de/mailing-lists/mhonarc/1998-04/msg00129.html 
  for the history behind this message ]

I just upgraded to v2.5.0b2 and was surprised to find I'm still having the 
same problem (see link above) years later!  Is the "dirty workaround" shown 
below still the best way to solve it?


Koichi Nakatani <nakatani@konica.co.jp> writes:

> Jason R Mastaler wrote:
> > &gt;&gt; &gt; ESC$B<i2,ESC(B ESC$BCNI'ESC(B / MORIOKA Tomohiko ...
> >                    ^
> > See the unescaped open bracket there?  I don't know enough about
> > encodings to say whether or not the bracket is specifically legal
> > there or not, but it doesn't look like legal HTML to me.  My
> > understanding is that the wilma_striphtml program requires legal HTML
> > for correct operation.
> 
> It is a complicated problem, and I cannot say what should be the right solution.
> However, I can show you a dirty workaround.
>   The following patch will set MSB of Japanese characters, and strip
> ESC$B & ESC(B.  This is called EUC-JP character encoding.  In this way,
> Japanese characters will not affect wilma.
> 
> diff -urN MHonArc2.2.0/lib/mhtxtplain.pl MHonArc2.2.0-jp0/lib/mhtxtplain.pl
> --- MHonArc2.2.0/lib/mhtxtplain.pl      Wed Mar  4 09:12:54 1998
> +++ MHonArc2.2.0-euc-jp/lib/mhtxtplain.pl  Fri Mar 20 21:19:12 1998
> @@ -174,7 +174,7 @@
>  sub jp2022 {
>      local(*body) = shift;
>      local(@lines) = split(/\r?\n/,$body);
> -    local($ret, $ascii_text);
> +    local($ret, $ascii_text, $jp_text);
>      local($_);
> 
>      $ret = "<PRE>\n";
> @@ -205,7 +205,7 @@
>         # Process Each Segment
>         while(1) {
>             if (s/^(\033\([BJ])//) { # Single Byte Segment
> -               $ret .= $1;
> +               # $ret .= $1;
>                 while(1) {
>                     if (s/^([^\033]+)//) {      # ASCII plain text
>                         $ascii_text = $1;
> @@ -228,10 +228,12 @@
>                     }
>                 }
>             } elsif (s/^(\033\$[\@AB]|\033\$\([CD])//) { # Double Byte Segment
> -               $ret .= $1;
> +               # $ret .= $1;
>                 while (1) {
>                     if (s/^([!-~][!-~]+)//) { # Double Char plain text
> -                       $ret .= $1;
> +                       $jp_text = $1;
> +                       $jp_text =~ tr/\041-\176/\241-\376/;
> +                       $ret .= $jp_text;
>                     } elsif (s/(\033\.[A-F])//) { # G2 Designate Sequence
>                         $ret .= $1;
>                     } elsif (s/(\033N[ -^?])//) { # Single Shift Sequence
> End of patch.
> -- 
> Koichi Nakatani
> Graphic Arts Center, Konica Corporation
> 


[Index of Archives]     [Bugtraq]     [Yosemite News]     [Mhonarc Home]