Re: Postgresql as a dictionary coder backend?

Attila Nagy <bra@xxxxxx> · Mon, 24 Jan 2011 21:34:29 +0100

 On 01/24/2011 05:27 AM, Fredric Fredricson wrote:

I can easily do this with any RDBMS, with a table of three columns: 
auto incremented ID, word and refcount, with a unique index on word.
The challenge could be:
- that it should scale to several TBs of size and several (hundred) 
billion of records. One scenario would be to store about 40 TBs of 
words and the average word length would be about 50-60 bytes (that's 
about 800*10^9 records). It should work well both for inserting and 
searching (encoding and decoding) words.
50-60 bytes! Oh, so we are not talking about natural language here. 
Sorry, I just assumed that.
Yes, sorry.
Still, I think performance will be a big issue here. I have never 
tried postgresql on anything faster than a fast PC but in my humble 
experience an insert will take at least one ms. With this speed 
800*10^9 records would take 25 years to insert.
I think you have to think bigger than a single server (ok, that was 
stating the obvious).
Well, to be honest, that would be for a single server. And I would like 
to have more than a single server. :)
One ms is 1000 TPS, if you can't utilize the DB more with multiple 
threads. I haven't really got a fast PC yet, but I could easily achieve 
12k TPS with some multicore stuff and BBWC RAID controllers.
It's only 771 years then. :)
- I need atomicity and durability, but having these on a word 
(record) level takes too much IOPS and have no use, so it would be 
good to have an interface for inserting about 1000-500000 words in 
one call, assign a unique ID to each unique words and store them (if 
the word has had already an ID, increment its refcount) and give back 
the IDs for each words. This transaction could be committed as one, 
so the transactions could be big, sparing IOPS.
- I need concurrency, so when the above happens from two sources at 
the same time, the same word in the two transactions must get the 
same ID
Unless the ID is some kind of hash you will have to serialize inserts 
of new words.
I start to think this too, if the unique constraint is there...

Is postgresql a good choice for doing this and if yes, what would be 
the optimal (for both time and space efficiency at encoding and 
decoding) use case?
I might be wrong but this kind of project should probably not rely on 
a "standard" RDBMS. The data structure itself does not seem to complex 
and the performance requirements are quite demanding.
I've done some experiments with BerkeleyDB and Tokyo/Kyoto Cabinet, but 
I was curious about the opinion of the RDBMS masters.

Thanks for sharing.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general