Thanks! Will try.
Is there a way to speed up Spark ingestion (postgres -> spark -> snowflake) which does not have issues such as alignment or inefficient json file format? I favor spark since it simplifies the ingest path of postgres -> local json.gz -> s3 -> snowflake stage -> snowflake table. Hope this makes sense.On Tue, May 30, 2023 at 10:17 PM David Rowley <dgrowleyml@xxxxxxxxx> wrote:
On Wed, 31 May 2023 at 16:26, Lian Jiang <jiangok2006@xxxxxxxxx> wrote:
> I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large tables (e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as the postgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node timeout or out of memory.
>
> Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any hints.
This is probably slow due to psql's alignment. It needs to read the
entire result to know how much to whitespace to pad columns with
before it can output anything. Naturally, that requires quite a bit of
memory when large tables are queried and also lots of length checking.
As mentioned by Rob, you'll probably want to use COPY, or you could
look at using pg_dump. Something like:
pg_dump --inserts --rows-per-insert=100 --table=<name of your table to
export> --data-only <database name>
should give you something that you might have some luck directly
importing to Snowflake without any transformation tool required.
However, if you do need to do some transformation before loading, then
you might want to do that in PostgreSQL. For that, something like:
psql -c "copy (select col1,col2,col3+col4 from your_table) to stdout"
<name of database>
would allow you to run a query, which you could maybe do your
transformations in before importing into Snowflake
David
--