Regularizing contains operator semantics

keithp at keithp.com (Keith Packard) · Fri Jul 11 19:03:05 2003

"Contains" matching issues.

The contains operator is currently used in font listing and can be used in
match/edit rules.

LISTING FONTS

When listing fonts, contains should have "obvious" semantics, I suggest
that those semantics depend on the type of the value:

	string, number, boolean:

font has an equal value for every value in the pattern.  This means
that using 'times,courier' for the family will result in no fonts
being listed as no font has both times and courier family names.  In fact, I
can't see a good use for multiple values here as it would require multiple
values in the fonts; let's see if that is broken.  For strings, the change
here is that 'contains' does not mean sub string -- list 'courier' and you
won't see 'courier 10 pitch'.  I think strings should be treated as atomic
values in this context; fontconfig doesn't have string operators, which
is at least consistent.

	charset:

font contains listed Unicode codepoints, in otherwords, the charset provided
by the font 'contains' all of the glyphs requested by the application.

	lang:

(Remember that 'lang' is a composite value consisting of a language value and
 a territory value.  The list of lang values in a font is computed from
 Unicode coverage ranges based on orthographies.  Except for Chinese, all of
 these coverage ranges are (currently) assocated only with a language and not
 a territory.  Chinese is (currently) split into three territory groups
 (mainland China and Singapore, Hong Kong, Taiwan and Macau).  So, most
 language comparisons will be done with a language/territory pair supplied by
 the application (often from the current locale) against fonts which know
 only languages and not territories.  However, applications will also provide
 only languages at times to be matched against fonts which have languages and
 territories.)

The font supports all of the langs requested by the application.  I think
this means that the font 'contains' all of the langs requested by the
application (remember, we're talking about LISTING here).  Now, the tricky
part of defining what 'support' means for a specific lang entry.  When
the application provides a language/territory pair, then the font must
either provide a matching language/territory pair, or a bare language entry.
When the application provides a bare language, the font must either provide
a matching bare language entry or a language/territory pair with *any*
territory:

	application	font		"supports"
	-----------	----		----------
	zh		zh_cn		YES
	zh_tw		zh_cn		NO
	en_gb		en		YES
	en		en		YES

MATCHING

The LISTING algorithm is designed to sharply restrict the set of provided
fonts; an empty list is often the result of overspecified patterns; that
matches the expected usage of providing precise information to users about
what actual fonts are available, rather than what font will be used when a
specific pattern is matched.  In contrast, MATCHING is designed to always
provide a font, and in fact to provide a score measuring how accurate that
match is so that the set of available fonts can be sorted by this metric 
and returned to the application.

When matching fonts, we're not using the boolean 'contains' operators, but
rather measuring distance from the pattern to the font (in CS terms, LISTING
is a constraint satsifaction problem while MATCHING is an constraint
optimization problem)

	string, boolean:

Distance in these objects is measured with only two values -- matching and
nonmatching -- matching strings or booleans have distance 0 while
mismatching values have distance 1.

	number:

Distance between two numbers is just the absolute value of thier difference
(the obvious value).  This is used for things like weight and slant, the
numeric values for those constants was carefully chosen to prefer reasonable
substitutions (italic and oblique and closer together than either is to
roman).

	charset:

Distance between two charsets is the count of characters requested by
the pattern but not provided by the font.  This means that a font which
fully covers the requested characters has distance '0'.

	lang: 

Distance has three values:

	0:	pattern and font have equal language/country,
		or pattern has only language and font has language with
		any country.

	1:	Pattern and font have equal language and different
		country (zh_CN vs zh_TW)

	2:	Pattern and font have different language

EDITING

The EDITING algorithm needs a method for matching patterns for each edit
operation; this is another constraint satisfaction problem as the edit rules
are either applied or not applied.

Match rules in edit instructions can use many different operators to
constrain pattern selection:

	eq
	not_eq
	less
	less_eq
	more
	more_eq
	contains
	not_contains

Each of these opeators behave differently for each datatype.  For
datatypes which aren't ordered, I've defined the ordered operators to always
return false.

	string:

I think these should be treated as unordered objects so that collation
isn't visible to the user.  The remaining question is whether the 'contains'
operator should be used to detect sub-string presense.  The LISTING
operation above should not do this as the operator is not selectable, but
allowing 'contains' to do substring detection in an EDITING context means
that LISTING won't use Contains, but rather some Contains-like analog which
is actuall Equal for strings.  Hmm.  Permitting Contains for EDITING would
probably be useful, especially for FC_STYLE pattern elements.

	boolean, number:

These have obvious semantics for all of the operators if
contains/not_contains are allowed to be synonyms for eq/not_eq.

	charset, lang:

I think the semantics described above for LISTING should apply here.

PROPOSED CHANGES

I believe the only changes necessary to implement these semantics are:

1)	Use a Contains-alike operator for LISTING which does exact
	matching for strings, permit Contains for EDITING to do
	substring matching

2)	Change lang Contains semantics to make ll_xx contain ll and
	ll contain ll_xx (currently, I believe ll_xx does not contain ll)