Search Postgresql Archives

Fwd: Import large data set into a table and resolve duplicates?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Forwarding to the list, as I forgot ( as usual, think I did the same with my original message ). Sorry for the duplicates.
---------- Forwarded message ----------
From: Francisco Olarte <folarte@xxxxxxxxxxxxxx>
Date: Sun, Feb 15, 2015 at 6:48 PM
Subject: Re: Import large data set into a table and resolve duplicates?
To: Eugene Dzhurinsky <jdevelop@xxxxxxxxx>


Hi Eugene:

On Sun, Feb 15, 2015 at 6:29 PM, Eugene Dzhurinsky <jdevelop@xxxxxxxxx> wrote:
On Sun, Feb 15, 2015 at 01:06:02PM +0100, Francisco Olarte wrote:
> You state below 200k rows, 50k lines per path. That is not huge unless
> "series" really big, is it?

series data is in between of 100-4096 chars

​Well, then it depends on the average length. With 1k avg you'll have 200Mb for the series. Not having stats on your setup, that is easily manageable by today standard machines.

One thing that strikes me is you are either at the beginning of your usage of this or you have A LOT of already present lines in the path ( I mean, path has one fourth of the lines of dictionary ). Which is the case?​

That sounds cool, but I'm a bit worried about the performance of a lookup over
the series column and the time to create index for the "temp" table on
"series" column. But perhaps it's better to try this and if a performance will
go really bad - then do some optimizations, like partitioning etc.

Who said create an index on the temp tables?​​ They ​are small. And, anywhere, If i would be doing this with files I'll NEVER do it indexing, I'll sort, merge, go on ( this is Why I suggested temp files, postgres does sort and merges really well if it needs to and the optimizer has other tricks available ).

When doing bulk operations ( like this, with a record count of 25% of the whole file ) indexing is nearly always slowed than sorting and merging. As I said, I do not know the distribution of your data size, but assuming the worst case ( 200k 4k entries dict, which is 800Mb data, 50k 4k patches, which is 200Mb data), you can do it with text files in 5 minutes by sorting and merging in 8Mb RAM ( I've done similar things, in that time, and the disk where much slower, I've even done this kind of things with half inch tapes and it wasn't thar slow ). You just sort the dictionary by series, sort patch by series, read both comparing keys and write a result file and a new dictionary, and, as the new dictionary is already sorted, you do not need to read it the next time. It was the usual thing to do for updating accounts on the tape days.

And, as they always said, measure, then optimize. I would try it ( use some little tables, like say 20 entries dict and 5 lines patches to debug the queries ), then measure ( if you don't mind hogging the server, or have a test one launch the whole thing and measure times, if you get it done in a reasonable time, and you may be surprised, some times simple algorithms are fast ).

​The only other thing I would try maybe doing the things without temp tables, something like:

select series, id from dictionary join patches_in using series
UNION ALL ( faster, as you know there will be no duplicates do not bother checking them )
insert into dictionary (series) select p.series from patches where patches not in (select series from dictionary) returning series, id

( you may need to experiment a little with the queries and use some with expresions, but you see the logic ).

Francisco Olarte.






 

Thank you!

--
Eugene Dzhurinsky



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux