This might also help:
Bulk load tables from text files in almost all RDMS are "log free" (Postgres' COPY is one of them).
The reason is that the database doesn't need to waste resources by writing the log because there's no risk of data loss. If the COPY operation fails, your data will still live in the text files you're trying to bulk load from.
2014-08-27 5:42 GMT-03:00 Albe Laurenz <laurenz.albe@xxxxxxxxxx>:
Alex Goncharov wrote:
> Thank you, Kevin -- this is helpful.
>
> But it still leaves questions for me.
>> Alex Goncharov <alex.goncharov.usa@xxxxxxxxx> wrote:You don't need to split the data at all if you make sure that they are
>
>>> The whole thing is aborted then, and the good 99 records are not
>>> making it into the target table.
>>
>> Right. This is one reason people often batch such copies or check
>> the data very closely before copying in.
>
> How do I decide, before starting a COPY data load, whether such a load
> protection ("complexity") makes sense ("is necessary")?
>
> Clearly not needed for 1 MB of data in a realistic environment.
>
> Clearly is needed for loading 1 TB in a realistic environment.
>
> To put it differently: If I COPY 1 TB of data, what criteria should I
> use for choosing the size of the chunks to split the data into?
>
> For INSERT-loading, for the database client interfaces offering the
> array mode, the performance difference between loading 100 or 1000
> rows at a time is usually negligible if any. Therefore 100- and
> 1000-row's array sizes are both reasonable choices.
>
> But what is a reasonable size for a COPY chunk? It can't even be
> measured in rows.
>
> Note, that if you have a 1 TB record-formatted file to load, you can't
> just split it in 1 MB chunks and feed them to COPY -- the file has to
> be split on the record boundaries.
>
> So, splitting the data for COPY is not a trivial operation, and if
> such splitting can be avoided, a reasonable operator will avoid it.
>
> But then again: when can it be avoided?
correct.
If you cannot be certain, and you want to avoid having to restart a huge
load with corrected data, the batch size is pretty much a matter of taste:
How much overhead does it generate to split the data in N parts?
How much time are you ready to wait for (re)loading a single part?
You'll probably have to experiment to find a solution that fits you.
I don't understand that question.
>>> My question is: Where are these 99 records have been living, on
>>> the database server, while the 100-th one hasn't come yet, and
>>> the need to throw the previous data accumulation away has not
>>> come yet?
>>
>> They will have been written into the table. They do not become
>> visible to any other transaction until and unless the inserting
>> transaction successfully commits. These slides may help:
>>
>> http://momjian.us/main/writings/pgsql/mvcc.pdf
>
> Yeah, I know about the MVCC model... The question is about the huge
> data storage to be reserved without a commitment while the load is not
> completed, about the size constrains in effect here.
You need the space anyway to complete the load.
If the load fails, you simply reclaim the space (VACUUM) and reuse it.
There is no extra storage needed.
Disk space can be measure with "df".
>>> There have to be some limits to the space and/or counts taken by
>>> the new, uncommitted, data, while the COPY operation is still in
>>> progress. What are they?
>>
>> Primarily disk space for the table.
>
> How can that be found? Is "df /mount/point" the deciding factor? Or
> some 2^32 or 2^64 number?
That means "you should consider", not "you should be worried".
>> If you are not taking advantage of the "unlogged load" optimization,
>> you will have written Write Ahead Log (WAL) records, too -- which
>> (depending on your configuration) you may be archiving. In that
>> case, you may need to be concerned about the archive space required.
>
> "... may need to be concerned ..." if what? Loading 1 MB? 1 GB? 1 TB?
>
> If I am always concerned, and check something before a COPY, what
> should I be checking? What are the "OK-to-proceed" criteria?
Unless you are loading into a table created in the same transaction,
"redo" information will be generated and stored in "WAL files", which
end up in your WAL archive.
This needs extra storage, proportional to the storage necessary
for the data itself.
Right.
>> If you have foreign keys defined for the table, you may get into
>> trouble on the RAM used to track pending checks for those
>> constraints. I would recommend adding any FKs after you are done
>> with the big bulk load.
>
> I am curious about the simplest case where only the data storage is to
> be worried about. (As an aside: the CHECK and NOT NULL constrains are
> not a storage factor, right?)
You can find something on the limitations here:
>> PostgreSQL does *not* have a "rollback log" which will impose a
>> limit.
>
> Something will though, right? What would that be? The available disk
> space on a file system? (I would be surprised.)
http://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F
Sure, why not?
>>> Say, I am COPYing 100 TB of data and the bad records are close
>>> to the end of the feed -- how will this all error out?
>>
>> The rows will all be in the table, but not visible to any other
>> transaction.
>
> I see. How much data can I fit there while doing COPY? Not 1 TB?
Yours,
Laurenz Albe
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance