Re: On CJK font selection (was Re: [Fwd: Re: Request for review and advice on wqy-bitmap-fonts fontconfig settings])

Behdad Esfahbod <behdad@xxxxxxxxxx> · Sun, 16 Dec 2007 18:22:01 -0500

On Thu, 2007-12-13 at 12:13 -0500, Qianqian Fang wrote:
> hi Behdad

Hi,

> I would have agreed with you if you clearly tell me why this change SHOULD
> be done in the fonts, or in the font selection, not in the layout 
> engine. Your
> previous replies, either to the bug reports or to my email, simply 
> refused to
> make this change by saying this is "technically impossible", but you do
> not tell me based on what model that you made the statement. If you can
> give me a diagram or document to illustrate that this is not the business of
> layout engine, I would not insist to continue this discussion.

You've kept saying it should be different for CJK and I've always asked
you to describe how exactly it should behave to no avail.

Here is the set of assumptions that best describes the problem:

  A1. The layout engine is not provided any hints whatsoever on which of
the CJK languages to prefer.

  A2. Any font available on the system is suitable for (aka "supports")
at most one CJK language, not more.

  A3. For every CJK language, there exists a positive number of
characters solely used in that CJK language and not any other one.

  A4. There exists a positive number of Unicode characters that are used
in more than one CJK language.

That's enough to prove that you can't fix both of these bugs at the same
time:

  B1. "multiple CJK fonts on the same line"

  B2. "font face changes when more text is typed"

This is what we will prove: "for any layout engine with font fallback
support [1], there exists some CJK text that when typed on a line by the
user, either results in more than one CJK font being used, or a font
change for the already typed text happens", where font fallback support
means that a character is assigned a font that is known to *support*
that character, if any such font is available on the system.  We prove
by constructing such a piece of text.  Here's a sketch:

  - Pick a Unicode character that is used in more than one CJK language.
This is possible because of A4.  Call it c[0].

  - Let the layout engine choose a font to render this character.  Let
f[0] be the font used to render it.

  - Find the CJK language l[0] that font f[0] supports.  By A2 we know
that there can't be more than one such language.  If no such language
exists, the layout engine suffers from the bug "no CJK font is chosen".
Abort.

  - Let l[1] be any CJK language other than l[0].

  - Choose c[1] to be any CJK character used in language l[1] and l[1]
only.  That's possible because of A3.

  - Pass text c[0]c[1] to the layout engine, let f'[0]f[1] be the two
fonts chosen to render characters c[0] and c[1] respectively.

  - Observe that:

    * if f'[0] == f[0]: We know f[0] supports l[0], and that l[0] !=
l[1].  By A2, it follows that f[0] does not support l[1], so f[0] cannot
be chosen for c[1] and as a result, f'[0] != f[1], that is, multiple
fonts are chosen to render the text.

    * if f'[0] != f[0]: Typing character c[1] on the line containing
text c[0] caused the chosen font for c[0] to change.

End of proof ∎

[1] I'm tempted to say deterministic Turing machine here, but I pass :)

Similar proofs can be constructed for other CJK "bugs" (those involving
Latin text, ASCII digits, etc), but I've already exceeded my time limit
for this message.

> Secondly, you said that "contextual font selection" is a "cool"
> feature, I am wondering what languages are beneficial from this feature? 
> (I believe there are, but just want to know).

Pretty much every non-Latin script.  In some situations even the Latin
script.

Take the Unicode character U+002E FULL STOP, aka ASCII period.  It is
used in more than just Latin, in Arabic for example, in Hebrew, possibly
in Indic and many other scripts.  If it was not grouped with neighboring
characters for font selection purposes all those people would have got
their Arabic/Hebrew/... text assigned an Arabic/Hebrew/... font while
the periods in at the end of sentences assigned a different (default
Latin for example) font.

The same happens for Latin under a document tagged as non-Latin.  It's
not a luxury thing.  It's just how things are supposed to work.

> As I said in the previous email, this 
> creates more
> troubles for CJK languages than benefits.Particularly this ruins the text
> alignment in monospace environment (see attachment). I doubt anyone
> see it would say "cool", rather, they would feel annoyed.

That's not true.  If you have Chinese text and Latin text in the same
line, and your Latin and Chinese monospace fonts have different widths,
you are screwed no matter what.

There are situations that that particular bug you are referencing here
can be improved, and that's why I filed bug 345386, but you already knew
that.

> In addition, you seem to underestimate the difficulties of ripping out 
> part of
> a CJK font. This is not possible for commercial fonts. Even it is doable
> for open fonts (very few choices though), the incompatibility of the 
> resulting
> fonts will make it totally unusable on most platforms.

I've put three different ways in front of you.  The fontconfig one is
not hard at all for anyone willing to put their fingers where their
mouth is.  You on the other hand, seem to ignore the impossibility (not
difficulty) of what you are asking for.

> I want to add that on Windows, CJK users had never had such a problem,
> all known CJKfonts have their Latin glyphs (some are crappy), but the text
> rendering are "normal" (nothing like in the attachment). How window
> structures the style propagation for COMMON characters?

Windows does no font fallback.  You choose which font to use.  But you
want your Latin characters in a different font than your Chinese
characters AND you want to keep the crappy glyphs.  They don't mix.

> Qianqian

behdad

> Behdad Esfahbod wrote:
> > Hi Qianqian,
> >
> > [CC'ing to gtk-i18n-list, so hopefully this is the last time I have to
> > repeat this.]
> >
> > On Mon, 2007-12-10 at 18:01 -0500, Qianqian Fang wrote:
> >   
> >> Go back to the digit font change issue as we discussed earlier, I
> >> spent some time in the past few days, trying to get myself a more
> >> clear 
> >> picture on this. I dug out some bug reports from various bugzillas
> >> (Mozilla, Redbat, Gnome) and gathered a list of similar reports (see
> >> the bottom of the email). These reports were filed from simplified and
> >> traditional Chinese users and Japanese users (I believed Korean
> >> experienced the same problem).  So, one thing that can be said from
> >> this list is that the "contextual font selection" does seem to be
> >> bothering CJK users in text formatting. 
> >>     
> >
> > Yes, you have identified the problem very accurately.
> >
> >
> >   
> >> I understand that "contextual shaping" is one of the techniques for
> >> rendering complex scripts. I am not sure how tight is the connection
> >> between "contextual shaping" and the "contextual format propagation",
> >> but one thing that I think may put some light to the complains of the
> >> CJK users is that Chinese (maybe Japanese as well) scripts are not
> >> contextual sensitive. Chinese characters are relatively independent
> >> and self-consistent in shapes (while, this statement is not true for
> >> Chinese calligraphy, where strokes may connect between characters
> >> depending on layout direction, but the current OSs and font
> >> technologies are not ready to handle this IMO). The only complexities
> >> may come from the fact that Hanzi for printing are mostly equal-width,
> >> and the punctuations among the Hanzi are expected to match the width
> >> of the surrounding Hanzi. As the full-width punctuations being encoded
> >> separately by Unicode, together with the contextual punctuation
> >> support of the input-methods, this seems to be handled very well. So,
> >> in short, for Chinese text layout, users are generally not expected to
> >> see contextual-based changes, either encoding/glyph or font faces
> >> (this may not include some extreme cases). 
> >>     
> >
> > And Pango supports those all perfectly fine.  Even vertical writing
> > using the correct substituted punctuation glyphs.  See:
> >
> >   http://www.pango.org/ScriptGallery
> >
> >
> > The main font issue though, is that Chinese (Simplified, Traditional),
> > Korean, and Japanese share some Unicode code points, but they require
> > slightly different renderings.  Now if you don't tell Pango which
> > version is preferred, how can it know which font to choose?  It
> > explicitly doesn't prefer any one over the others to avoid cultural
> > problems.
> >
> > The symptoms of this problem are "multiple fonts used in the same line".
> > Solution is: Either run under a CJK locale, or give hints to Pango about
> > your preferred CJK locale using the env var PANGO_LANGUAGE.
> >
> > Note that theoretically Pango can do text analysis to come up with a
> > best guess, but doing that would then introduce another bug with
> > symptoms "changes font when typing a few characters on the same line".
> >
> >
> >   
> >> Now go back to pango, from what I read from the bug reports, pango
> >> uses PANGO_SCRIPT_COMMON to represent language-independent symbols. I
> >> have no complain about that. It is a good classification based on the
> >> semantics of the symbols.
> >>     
> >
> > Good.  Let me also note that there's no way to change that.  It's
> > hardcoded in the Unicode standard.
> >
> >
> >   
> >> What I, and most CJK users, are not satisfied with is the
> >> contextual-sensitivity of those common scripts when for mating text
> >> under cjk locales. I know that you have advocated to stick with the
> >> "face" meaning of SCRIPT_COMMON, which is supposedly to be rendered by
> >> local languages. But IMO, the face meaning is misleading here. From a
> >> Chinese user perspective, the difference between the SCRIPT_COMMON to
> >> Latin is negligible,
> >>     
> >
> > Lemme correct you here, "From a Chinese user perspective, the ASCII
> > digits are considered Latin".  There's sure a lot more than ASCII digits
> > to SCRIPT_COMMON.  Helps to be precise.
> >
> >
> >   
> >> compared with its difference to CJK characters. Therefore, using CJK
> >> fonts to render SCRIPT_COMMON is quite odd. Using Latin fonts for
> >> COMMON is most preferred; even specifying no face ( i.e. using system
> >> fall-back) is better than assigning Chinese fonts for these scripts
> >> for that most Chinese fonts have low-quality Latin/common glyphs, even
> >> the commercial ones.
> >>     
> >
> > And this problem has a name: "crappy glyphs and multiple scripts in a
> > font".  Tell me about it...
> >
> > I already pointed out a few solutions to it previously:
> >
> >   - Rip the crap out and everyone will feel better.
> >
> >   - Use TrueType containers (even for bitmap-only fonts) and put each
> > script's glyphs into its own face, with all faces having the same name
> > and put into the same TrueType Collection file.
> >
> >   - Finish patch for fontconfig to allow configuration to disable
> > certain Unicode codepoints per font.  The write such configuration for
> > the crappy glyphs.
> >
> > Pick whichever you prefer and just do it.
> >
> >
> > Another symptom, "digits change font after typing character" is in fact
> > a very cool Pango feature, just badmouthed by the above problem.  Fix
> > the problem.
> >
> >
> >   
> >> As you see from the bug lists, this problem has existed for many
> >> years, and I am pretty sure that it will come back again and again, as
> >> long as the expected rendering is not achieved. If the current pango
> >> formatting logic is not sufficient to handle the CJK preferences as
> >> said above, I think to refine the logic to take it into consideration
> >> is better than stick with a fixed but incomplete logic. 
> >>     
> >
> > I consider patches improving Pango's font selection algorithm, but none
> > that I've seen so far had been an improvement (from my point of view).
> > If it has words like CJK or "special case", I'm most probably not
> > interested.  Of the bugs you listed, only the one I opened myself is
> > valid IMO.  The rest is just left open because no matter how many times
> > I close them, they will be reopened... Oh well.
> >
> >
> >   
> >> please let me know your thoughts and reasoning on whether this is
> >> feasible or not, if yes, where to get start.
> >>     
> >
> > Does the above make sense?  I understand that it's easier to apply a two
> > line patch to Pango instead of doing what of the things I listed above,
> > but that just doesn't fit in the design, and it introduces other
> > problems you don't see right now.
> >
> >
> >   
> >> thank you for paying attention to this issue.
> >>
> >> Qianqian
> >>     
> >
> > Regards,
> >
> > behdad
> >
> >
> >   
> >> =============================================================== 
> >> Bug 321113 - Wrong glyph subsituation algorithm for digital characters
> >> and punctuations
> >> http://bugzilla.gnome.org/show_bug.cgi?id=321113
> >>
> >>
> >> Bug 345072 - changes font when typing different scripts on the same
> >> line 
> >> http://bugzilla.gnome.org/show_bug.cgi?id=345072
> >>
> >>
> >> Bug 345386 - Language and direction propagation in and between
> >> PangoLayouts
> >> http://bugzilla.gnome.org/show_bug.cgi?id=345386  (opened by yourself)
> >> https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=103679
> >>
> >>
> >> Bug 481210 - [All lang] [firefox] - Face of the number is changing
> >> when enter number + Char, in any Locale
> >> http://bugzilla.gnome.org/show_bug.cgi?id=481210
> >>
> >>
> >> Bug 481188 - ascii text space too narrow for Chinese encodings
> >> http://bugzilla.gnome.org/show_bug.cgi?id=481188
> >>
> >>
> >> Bugzilla Bug 129541: changes font when typing different scripts on the
> >> same line 
> >> https://bugzilla.redhat.com/show_bug.cgi?id=129541
> >>
> >>
> >> Bugzilla Bug 131218: [RHEL4] Characters get truncated in new pango
> >> https://bugzilla.redhat.com/show_bug.cgi?id=131218
> >>
> >>
> >> Bugzilla Bug 149991: [CJK pango] digits and punctuation in textbox
> >> give bad eol rendering and cursor placement
> >> https://bugzilla.redhat.com/show_bug.cgi?id=149991 (filed by Jens
> >> Petersen)
> >>
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=220885 (broken link)
> >>
> >>
> >> Bugzilla Bug 228804: [All lang] [firefox] - Face of the number is
> >> changing when enter number + Char, in any Locale
> >> https://bugzilla.redhat.com/show_bug.cgi?id=228804
> >>
> >>
> >> Bugzilla Bug 221361: [pango] ascii text space and punctuation is
> >> narrow for CJK
> >> https://bugzilla.redhat.com/show_bug.cgi?id=221361
> >>
> >>
> >> Bug 379125 - chinese punctuations after english letters are wrongly
> >> displayed
> >> https://bugzilla.mozilla.org/show_bug.cgi?id=379125
> >> https://bugzilla.mozilla.org/attachment.cgi?id=263185
> >> ===============================================================
> >>     
> >
> >   
> 
-- 
behdad
http://behdad.org/

...very few phenomena can pull someone out of Deep Hack Mode, with two
noted exceptions: being struck by lightning, or worse, your *computer*
being struck by lightning.  -- Matt Welsh

_______________________________________________
Fedora-fonts-list mailing list
Fedora-fonts-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/fedora-fonts-list