On Sun, Feb 15, 2015 at 06:48:36PM +0100, Francisco Olarte wrote: > One thing that strikes me is you are either at the beginning of your usage > of this or you have A LOT of already present lines in the path ( I mean, > path has one fourth of the lines of dictionary ). Which is the case? Right now I'm on the very beginning stage, yes, and I expect to have "cache miss" for the dictionary at ratio of at least 70%, which eventually will drop down to 5-10%. > When doing bulk operations ( like this, with a record count of 25% of the > whole file ) indexing is nearly always slowed than sorting and merging. As > I said, I do not know the distribution of your data size, but assuming the > worst case ( 200k 4k entries dict, which is 800Mb data, 50k 4k patches, > which is 200Mb data), you can do it with text files in 5 minutes by sorting > and merging in 8Mb RAM ( I've done similar things, in that time, and the > disk where much slower, I've even done this kind of things with half inch > tapes and it wasn't thar slow ). You just sort the dictionary by series, > sort patch by series, read both comparing keys and write a result file and > a new dictionary, and, as the new dictionary is already sorted, you do not > need to read it the next time. It was the usual thing to do for updating > accounts on the tape days. So you suggest to take this off the Postgres? Thats interesting. Simply put, I'll do a dump of the dictionary, sorted by series, to some file. Then sort the file with patch by series. Then merge the dictionary (left) and the patch (right) files. And during the merge if the (right) line doesn't have a corresponding (left) line, then put a nextval expression for sequence as an ID parameter. Then truncate existing dictionary table and COPY the data from the merged file into it. Is it what you've meant? Thank you! -- Eugene Dzhurinsky
Attachment:
pgpw0gi07mIFV.pgp
Description: PGP signature