2011/1/23 Attila Nagy <bra@xxxxxx>: > Hello, > > I'm looking for a database backend for a dictionary coder project. It would > have three major tasks: > - take a text corpus, get their words and substitute each word by a 64 bit > integer (the word:integer is always constant) and store the result > (encoding) ok. PostgreSQL allow to do that easily. > - take the previous result and substitute the integers with words (decoding) idem. > - the words should be reference counted, so if a word can be no longer found > in any of the encoded messages, delete it (and optionally free it's integer > ID, but 64 bit is believed to be enough for a long time, although having > smaller IDs result smaller encoded files). This could be achieved by > informing the database of the words of a deleted message, so it could > decrement those refcounts and delete the records if needed. Yes, like what despez do : http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/ > > I can easily do this with any RDBMS, with a table of three columns: auto > incremented ID, word and refcount, with a unique index on word. > The challenge could be: > - that it should scale to several TBs of size and several (hundred) billion > of records. One scenario would be to store about 40 TBs of words and the > average word length would be about 50-60 bytes (that's about 800*10^9 > records). It should work well both for inserting and searching (encoding and > decoding) words. I strongly suggest you to have a look at intarray contrib (it is provided with PostgreSQL. > - I need atomicity and durability, but having these on a word (record) level > takes too much IOPS and have no use, so it would be good to have an > interface for inserting about 1000-500000 words in one call, assign a unique > ID to each unique words and store them (if the word has had already an ID, > increment its refcount) and give back the IDs for each words. This > transaction could be committed as one, so the transactions could be big, > sparing IOPS. Array allow a very good compression of the data per row. (still it is not a RDBMS way to use array for that, but it is good for performances) > - I need concurrency, so when the above happens from two sources at the same > time, the same word in the two transactions must get the same ID one transaction will finish before the other to allow that. (but they can start at the same time) > > Is postgresql a good choice for doing this and if yes, what would be the > optimal (for both time and space efficiency at encoding and decoding) use > case? PostgreSQL should work for that, yes. You'll have to compensate the size with good hardware and good SQL (and probably some optimization like using arrays) -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ ; PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general