Re: Postgresql as a dictionary coder backend?

Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> · Tue, 25 Jan 2011 23:13:39 +0100

2011/1/23 Attila Nagy <bra@xxxxxx>:
>  Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would
> have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit
> integer (the word:integer is always constant) and store the result
> (encoding)

ok. PostgreSQL allow to do that easily.

> - take the previous result and substitute the integers with words (decoding)

idem.

> - the words should be reference counted, so if a word can be no longer found
> in any of the encoded messages, delete it (and optionally free it's integer
> ID, but 64 bit is believed to be enough for a long time, although having
> smaller IDs result smaller encoded files). This could be achieved by
> informing the database of the words of a deleted message, so it could
> decrement those refcounts and delete the records if needed.

Yes, like what despez do :
http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/

>
> I can easily do this with any RDBMS, with a table of three columns: auto
> incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion
> of records. One scenario would be to store about 40 TBs of words and the
> average word length would be about 50-60 bytes (that's about 800*10^9
> records). It should work well both for inserting and searching (encoding and
> decoding) words.

I strongly suggest you to have a look at intarray contrib (it is
provided with PostgreSQL.

> - I need atomicity and durability, but having these on a word (record) level
> takes too much IOPS and have no use, so it would be good to have an
> interface for inserting about 1000-500000 words in one call, assign a unique
> ID to each unique words and store them (if the word has had already an ID,
> increment its refcount) and give back the IDs for each words. This
> transaction could be committed as one, so the transactions could be big,
> sparing IOPS.

Array allow a very good compression of the data per row. (still it is
not a RDBMS way to use array for that, but it is good for
performances)

> - I need concurrency, so when the above happens from two sources at the same
> time, the same word in the two transactions must get the same ID

one transaction will finish before the other to allow that. (but they
can start at the same time)

>
> Is postgresql a good choice for doing this and if yes, what would be the
> optimal (for both time and space efficiency at encoding and decoding) use
> case?

PostgreSQL should work for that, yes. You'll have to compensate the
size with good hardware and good SQL (and probably some optimization
like using arrays)

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/ ;    PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general