Re: break table into portions for writing to separate files

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Thu, 1 May 2014 20:20:23 +0200

Hi:

On Thu, May 1, 2014 at 7:50 PM, Seb <spluque@xxxxxxxxx> wrote:
> I've been looking for a way to write a table into multiple files, and am
> wondering if there are some clever suggestions.  Say we have a table
> that is too large (several Gb) to write to a file that can be used for
> further analyses in other languages.  The table consists of a timestamp
> field and several numeric fields, with records every 10th of a second.
> It could be meaningfully broken down into subsets of say 20 minutes
> worth of records.  One option is to write a shell script that loops
> through the timestamp, selects the corresponding subset of the table,
> and writes it as a unique file.  However, this would be extremely slow
> because each select takes several hours, and there can be hundreds of
> subsets.  Is there a better way?

As you mention looping and a shell, I suppose you are in something
unix like, with pipes et al. You can pipe  COPY ( either with the pipe
options for copy, or piping a psql command, or whichever thing you
like ) through a script which spits ecah data chunk into its
corresponding file. If your data is somehow clustered into the table (
by chance or by design ) you don't even need a to sort the data, just
use an open file pool, I did that once with call files, chunked them
into day sized files and it worked like a charm ( and if you need the
files sorted, you can then use sort on each of them, which normally is
quite fast ).

For your description of data, with a pipe, you could read a line,
extract a key for the record ( the timestamp rounded down to 20
minutes would be a good one ), get and open output file for append (
using a small caching layer ) , write it.

Depending on how many files you expect, how many RAM you have and how
many files your OS allows you to open, other solutions exists. And if
you do not have enough ram / openfiles / clustering for any of them
there are multitude of tricks ( if, say, you have 3 years worth, no
correlation, and can only open/buffer about 1000 files you could split
from the db into day sized chunks and then split each of them into 20
minutes ones.

Regards.

Francisco Olarte.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general