Re: advice on schema for multilingual text

Michael Glaesemann <grzm@xxxxxxxxxxxxx> · Mon, 10 Apr 2006 09:49:35 +0900

On Apr 9, 2006, at 0:31 , Daniel McBrearty wrote:

Hi

I have a website that has multilingual text stored in the database.  
Currently I just have a flat table (lets called it "translations"),  
one row per text item, one column per language. This works OK, for  
now, but I am looking at a redesign. Mostly I want to keep  
information about the languages in teh db as well, so that look  
like an extra table, one row per lang.

The problem that now arises is that there is an expected  
correlation between the "languages" and "translations" tables -  
there should be a row in languages for each column of translations.  
AFAIK (could well be wrong, I am no expert in db theory) there is  
no real way to express in the ddl. Not ideal.

An alternative layout would now be to lose the "translations"  
table, and have two tables in place; one called "base_text"  
containing the text to be translated, and another called, say,  
"tx_text" which contains the translations. Each row of "tx_text"  
references both "base_text" and also "languages".

This looks like a nice layout, as there is an abstract rep of the  
languages, and we lose the "translations" table which can get very  
wide. It's nice that the schema doesn't actually change to add a  
new language.

BUT there are certain invariants that need to be enforced. The main  
one is this:

There must only be one row in "site_text" for any given language  
referencing a given row of "base_text".

You'd want a unique index on (base_text, language), like:

create table tx_text
(
	base_text text not null
		references base_text(base_text)
	, language text not null
		references languages (language)
	, primary key (base_text, language)
	, tx_text text not null
);

I would also like to have a column in "languages", type boolean,  
called "is_base" - this says what the base language is. Here, only  
ONE row can have a true value. (Obviously it has default value of  
false and is not null).

Here you want a partial unique index on languages where is_base is true
create table languages
(
	language text primary key
	, is_base boolean not null
);

create unique index languages_only_one_true_base_idx
on languages (is_base)
where is_base;

Another invariant now comes in - the language referenced by every  
row of "site_text" MUST have "is_base" set to FALSE.

I can think of two ways to do this: one is to write a trigger to  
enforce this, something like:

create function non_base_language_translations_check
returns trigger()
language plpgsql as $$
begin
	if exists (
		select *
		from tx_text
		natural join languages
		where not is_base
		)
	then
		raise exception
		'Language of translation text must not be a base language.';
	end if;
end;
$$;

Then use this function on triggers that fire on insert and update on  
tx_text and on update on languages.

Another is to include the is_base column in tx_text (with both  
language and is_base referencing languages) and use a check  
constraint to make sure is_base is false:

create table tx_text
(
	base_text text not null
		references base_text(base_text)
	, language text not null
	, is_base boolean not null check not is_base
	, foreign key (language, is_base) references languages (language,  
is_base)
		on update cascade
	, primary key (base_text, language)
	, tx_text text not null
);

The latter method is denormalized, which is not something I usually  
recommend. I don't know how the constraint checking overhead of using  
a trigger compares with using the foreign key and check constraint.

A third idea would be to have two languages tables: one with a single  
row for the base language and another for the target languages (You  
might even do this using table inheritance, though I haven't thought  
this completely through). tx_text would reference the  
target_languages table (or child table, as the case may be).

I'm sure others have opinions on this as well.

Hope this helps.

Michael Glaesemann
grzm myrealbox com