On 01/06/2012 03:55 PM, Phoenix Kiula wrote:
...
In general, when you have data scrubbing issues like this, grep/sed/awk/...
are your friends. Clean it up then import it.
Thanks Steve.
The file has 350 million lines. Sed, Awk etc are a little painful when
the file is 18GB witht hat many lines.
I'd want Postgresql to ignore the line altogether when something is
missing. Is this an option we can use, or are rules hoisted on us?
I've found grep, sed and friends to be quite effective and proper
pre-cleaning to have a relatively minor impact on performance. Done
properly, you will just be piping the data through a very simple
grep/sed/awk/... into psql. No extra disk-reads, minimal memory use and
a bit of CPU. And you will be in charge of deciding how suspect data is
handled.
If by "rules hoisted" you mean "will PostgreSQL make arbitrary and
possibly incorrect assumptions to attempt to force bad data into a
table" then the answer is "no". In fact, it has become more and more
picky over time. Trust me, at some point you will thank it for doing so.
For example the following used to "work":
select current_date < 2020-01-01;
But it returned "false" which was probably not what the user wanted.
(2020-01-01 is the integer 2018 which PostgreSQL interpreted as being
less than current_date). But it sure looks OK at first glance when you
really meant '2020-01-01'::date.
In current versions, that statement will throw an error just at
PostgreSQL does with dates like February 31 and a variety of other
things that certain other DBMS' deem good.
Cheers,
Steve
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general