Re: Creating table with data from a join

Julien Rouhaud <julien.rouhaud@xxxxxxxxxx> · Wed, 15 Jul 2015 00:39:52 +0200

On 14/07/2015 18:50, Igor Stassiy wrote:
> Julien, I would gladly provide more information, I am just not sure what
> to add.
> 

Well, was your concern about why option #2 is the quickest, or is this
runtime with option #2 still too slow for you ?

> I would be willing to leave the server compromised for things like
> corrupts or data losses during the time of this import, but the server
> has to be up and running before and after the import, if it is
> successful (so I can't take it down then change some parameters and
> start it up with again).

Check http://www.postgresql.org/docs/current/static/sql-createtable.html
and the "UNLOGGED" part to check if an unlogged table is suitable for you.

> 
> On Tue, Jul 14, 2015 at 6:37 PM Julien Rouhaud
> <julien.rouhaud@xxxxxxxxxx <mailto:julien.rouhaud@xxxxxxxxxx>> wrote:
> 
>     On 14/07/2015 18:21, Igor Stassiy wrote:
>     > Julien, I have the following setting for WAL level: #wal_level =
>     minimal
>     > (which defaults to minimal anyway)
>     >
> 
>     Sorry, I sent my mail too early :/
> 
>     So, option #2 is winner by design. You didn't say anything about your
>     needs, so it's hard to help you much more.
> 
>     If you don't care about losing data on this table if your server
>     crashes, you can try option #3 with an unlogged table.
> 
> 
>     > On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud
>     > <julien.rouhaud@xxxxxxxxxx <mailto:julien.rouhaud@xxxxxxxxxx>
>     <mailto:julien.rouhaud@xxxxxxxxxx
>     <mailto:julien.rouhaud@xxxxxxxxxx>>> wrote:
>     >
>     >     On 14/07/2015 11:12, Igor Stassiy wrote:
>     >     > Hello,
>     >     >
>     >     > I am benchmarking different ways of putting data into table
>     on table
>     >     > creation:
>     >     >
>     >     > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id
>     <http://a.id> <http://a.id>
>     >     <http://a.id> = b.id <http://b.id> <http://b.id>
>     >     > <http://b.id>;
>     >     > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id
>     <http://a.id> <http://a.id>
>     >     <http://a.id> = b.id <http://b.id> <http://b.id>
>     >     > <http://b.id>;
>     >     > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id
>     <http://a.id> <http://a.id>
>     >     <http://a.id> = b.id <http://b.id> <http://b.id>
>     >     > <http://b.id>) TO STDOUT" |
>     >     > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM
>     STDIN";
>     >     >
>     >     > (the parallel command is available as part of parallel deb
>     package in
>     >     > Ubuntu for example, it splits the stdin by newline character
>     and feeds
>     >     > it to the corresponding command)
>     >     >
>     >     > Both tables a and b have ~16M records and one of the columns
>     in a is
>     >     > geometry (ranging from several KB in size to several MB).
>     Columns in b
>     >     > are mostly integers.
>     >     >
>     >     > The machine that I am running these commands on has the
>     following
>     >     > parameters:
>     >     >
>     >     > default_statistics_target = 50 # pgtune wizard 2012-06-06
>     >     > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
>     >     > constraint_exclusion = on # pgtune wizard 2012-06-06
>     >     > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
>     >     > effective_cache_size = 48GB # pgtune wizard 2012-06-06
>     >     > work_mem = 80MB # pgtune wizard 2012-06-06
>     >     > wal_buffers = 8MB # pgtune wizard 2012-06-06
>     >     > checkpoint_segments = 16 # pgtune wizard 2012-06-06
>     >     > shared_buffers = 16GB # pgtune wizard 2012-06-06
>     >     > max_connections = 400 # pgtune wizard 2012-06-06
>     >     >
>     >     > One would expect the 3rd option to be faster than 1 and 2,
>     however 2
>     >     > outperforms both by a large margin (sometimes x2). This is
>     especially
>     >     > surprising taking into account that COPY doesn't acquire a
>     global lock
>     >     > on the table, only a RowExclusiveLock
>     >     > (according
>     >     > to
>     >   
>      http://www.postgresql.org/message-id/10611.1014867684@xxxxxxxxxxxxx)
>     >     >
>     >
>     >     What is wal_level value? I think this is because of an
>     optimisation
>     >     happening with wal_level = minimal:
>     >
>     >     "In minimal level, WAL-logging of some bulk operations can be
>     safely
>     >     skipped, which can make those operations much faster"
>     >
>     >     see
>     >   
>      http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>     >
>     >     > So is option 2 a winner by design? Could you please suggest
>     other
>     >     > alternatives to try (if there are any)? And what might be
>     the reason
>     >     > that 3 is not outperforming the other 2?
>     >     >
>     >     > Thank you,
>     >     > Igor
>     >     >
>     >     >
>     >
>     >
>     >     --
>     >     Julien Rouhaud
>     >     http://dalibo.com - http://dalibo.org
>     >
> 
> 
>     --
>     Julien Rouhaud
>     http://dalibo.com - http://dalibo.org
> 

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general