dvb-apps: charset support

Mauro Carvalho Chehab <mchehab@xxxxxxxxxx> · Wed, 06 Apr 2011 09:27:57 -0300

Hi,

I added some patches to dvb-apps/util/scan.c in order to properly support EN 300 468 charsets.
Before the patch, scan were producing invalid UTF-8 codes here, for ISO-8859-15 charsets, as
scan were simply filling service/provider name with whatever non-control characters that were
there. So, if your computer uses the same character as your service provider, you're lucky.
Otherwise, invalid characters will appear at the scan tables.

After the changes, scan gets the locale environment charset, and use it as the output charset
on the output files. 

The TS info may provide the used charset on the first character of the provider name and service name,
if the first character is < 0x20. If not provided, the spec says that the character table 00 should be
assumed (a modified version of ISO 6937 charset). However, on my tests, local carriers here
don't fill it, but they use ISO-8859-15 charset, instead of ISO-6937. So, a new optional parameter 
allows to change the default charset.

Also, the spec provides 2 tables with control character codes, one for 1-byte character tables,
and another for 2-byte character tables. Before the patch, the 1-byte control character table
were applied for all character sets. Now, the table is applied only for ISO-8859* and ISO-6937,
as they don't seem to make sense for the other character sets. However, the 2-byte control
character table were not implemented yet, due to a few reasons:
	1) I'm not familiar with 2-byte charsets;
	2) I don't have any environment here that would allow me to test it;
	3) The spec is not very clear about what character tables use 2-byte control codes.

The EN 300 428 Annex A says, just before the 2-byte control code table:
	"For two-byte character tables, the codes in the range 0xE080 to 0xE09F 
	 are assigned to control functions as shown in table A.2."

So, it seems that the 2-byte control character table refers to character tables 0x11 to 0x14 
(iso-10646 + Korean Character Set + GB2312 + BIG5).

However, the table A.2 is described as just:
	"Table A.2: DVB codes within private use area of ISO/IEC 10646"

So, one may assume that it refers only to ISO-10646 (character table 0x11), or to this one
plus BIG5 (table 0x14), as BIG5 is a subset of ISO-10646.

The spec is even less clear about what should be done with character table 0x15 (ISO-10646/UTF-8),
as UTF-8 codes have a variable length from 1-byte to 4-bytes.

I _suspect_ that all character tables that are not ISO-8859 or ISO-6937 should be using table
A.2 (that means, character tables 0x11 to 0x15).

The code change to implement 2-byte control codes should be trivial trough. A placeholder for such
code is there at the scancode with a short comment.

It would be great to have some feedback about it. So, comments are welcome.

Thanks,
Mauro.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html