Hello,
I'm looking for a database backend for a dictionary coder project. It
would have three major tasks:
- take a text corpus, get their words and substitute each word by a 64
bit integer (the word:integer is always constant) and store the result
(encoding)
- take the previous result and substitute the integers with words (decoding)
- the words should be reference counted, so if a word can be no longer
found in any of the encoded messages, delete it (and optionally free
it's integer ID, but 64 bit is believed to be enough for a long time,
although having smaller IDs result smaller encoded files). This could be
achieved by informing the database of the words of a deleted message, so
it could decrement those refcounts and delete the records if needed.
I can easily do this with any RDBMS, with a table of three columns: auto
incremented ID, word and refcount, with a unique index on word.
The challenge could be:
- that it should scale to several TBs of size and several (hundred)
billion of records. One scenario would be to store about 40 TBs of words
and the average word length would be about 50-60 bytes (that's about
800*10^9 records). It should work well both for inserting and searching
(encoding and decoding) words.
- I need atomicity and durability, but having these on a word (record)
level takes too much IOPS and have no use, so it would be good to have
an interface for inserting about 1000-500000 words in one call, assign a
unique ID to each unique words and store them (if the word has had
already an ID, increment its refcount) and give back the IDs for each
words. This transaction could be committed as one, so the transactions
could be big, sparing IOPS.
- I need concurrency, so when the above happens from two sources at the
same time, the same word in the two transactions must get the same ID
Is postgresql a good choice for doing this and if yes, what would be the
optimal (for both time and space efficiency at encoding and decoding)
use case?
Thanks,
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general