Re: break table into portions for writing to separate files

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Fri, 2 May 2014 10:14:06 +0200

Hi Seb:

On Thu, May 1, 2014 at 8:50 PM, Seb <spluque@xxxxxxxxx> wrote:
> Thanks, I'm glad to hear you've used this approach successfully.

Well, this is always successful if you are able to develop a
moderately complex script.

> It seems as though the best solution is to do a single SELECT to get the
> data out of the server (it is a view with a very complex query plan
> joining several other similar views), and then pipe the output through
> say awk to break down into chunks for writing the files, as you
> describe.

If you've got enough disk in the target machine ( several Gb, even a
hundred, shouldn't be a problem with todays, or even yesterdays
machines ) for two copies of the data, I'll recommend to just do a
COPY of the view to a file, then process the file. This lets you do
very simple scripts to chunk it, and normally workstation disk is way
cheaper than servers.

What I would normally do from what you describe is to spool the whole
table ( maybe gzipping it along the way if it's real big, in my
experience some current machines ( fast multicores with not so fast
disks ) are faster gziping and reading/writing a third of the data (
easily achievable with gzip if your data are numbers and timestamps )
than writing the full set, and then make a set of scripts which work
on it. I do not grok awk ( I began using perl in the 4.019 era and
substituted awk/sed with it for all kind of one liners and similars ),
but with perl, on a moderate machine ( like the one I'm using, core
duo with 2Gb, quite old by today standards ) you can filter the data
and extract it to several hundreds files at nearly full disk speed.

As I commented before, if your query results exhibit some locality on
the key, you can open several files keeping an LRU cache and split it
in one go. I wouldn't try to force order in the db, I've found disk
sorts are better for this kind of problems.

Ah, another hint. I've had to make somehow similar tasks ( dumping a
big query which joins a lot ) in the past. In some of this cases the
result was sweeping and concatenating a slew of really big tables
while joining a lot of small ones, something like sweeping 20Gb while
joining it with 15 more totalling 512Mb among them, generating a 100Gb
denormalized result. In these cases I developed a program which just
slurped all the small tables into RAM and did the join before writing,
which greatly improved the speed ( as, among other things, Postgres
only had to send me 20.5Gb over the net ). Sometimes you can use
things like this to improve performance as WS RAM is sometimes
plentiful and cheap, as you have all of it for a single task, while db
server ram is scarcer.

Francisco Olarte.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general