Re: constant crashing

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Sun, 14 Apr 2024 19:19:16 +0200

Jack:

On Sun, 14 Apr 2024 at 18:20, jack <jack4pg@xxxxxxx> wrote:
> The full error reads:
> server closed the connection expectantly
> This probably means the server terminated abnormally
> before or while processing the request.
> error: connection to server was lost

This is a CLIENT-side error. What people are asking you is for the
corresponding error in the SERVER log. When the server has an severe
error, ( probably a resource exhaustion, which normally leads to
abnormal termination as it has been pointed out previously ) the
connection gets closed, and you should go to the server to look at the
eigen cause.

> PostgreSQL 16.2
> I also believe it is a resource issue which can be rectified with a setting, but which setting?

Not all resource issues can be solved, sometimes one asks for
something which is too much for the available hardware. Anyway, as
pointed out, posting the configuration, the server error, and maybe
monitoring memory ( the usual culprit ) with top may help some one.
Look for memory because a common cause of this kind of things in Linux
is having overcommit on ( overcommit is nice, but basically it lies to
postgres, it tells the server she can have X Gb and then when it tries
to use them kills -9 it )

> If you were updating 100 million records what settings would you adjust?

>From what you told earlier ( loading from CSV ) and with ....

> Here are the updates I am performing on the 100 million records:
> UPDATE table SET category_modified = UPPER(category);
> UPDATE table SET category_modified = REGEXP_REPLACE(REPLACE(REPLACE(category_modified, '''','-'), '`', '-'), '\s{2,}', ' ', 'g') WHERE  AND LENGTH(category_modified)>1 AND POSITION('--' IN category_modified)>0;
> UPDATE table SET category_modified = REPLACE(category_modified,' ','-');
> UPDATE table SET category_modified = CASE WHEN category_modified IS NOT NULL THEN regexp_replace(category_modified, '[^a-zA-Z]$', '') ELSE NULL END;
> UPDATE table SET category_modified = regexp_replace(category_modified, '-{2,}', '-', 'g');
> UPDATE table SET category_modified = SUBSTRING(category_modified FROM 1 FOR LENGTH(category_modified) - 1) WHERE LENGTH(category_modified)>1 AND category_modified LIKE '%-';

this kind of updates I would recommend stream-filtering the data on
load. I do not know how it does it presently, but those are six
hundred million updates ( unless some are supressed, which is what I
do not remember now ), which makes for huge transactions and is asking
for problems ( or maybe only 100_000_000 if you are autocomitting ).

If postgres does not supress redundant updates ( lots of people can
answer that ) I would start by conditioning the updates ( UPDATE table
SET category_modified = UPPER(category) where category_modified is not
null and category_modified <> UPPER(category); ), this can shave time
and resource usages.

But if your updates are of this kind ( transformation of data on the
current row ( your sample is on a single value of a row, even easier )
without touching other things in the database ), I would encourage to
use a small perl/awk/python/whatever program to filter and transform
the data before loading it. It has several advantages, one is speed,
other is minimizing server load, other is testability ( pipe sample
lines to the filter, check result, when satisfied pipe full file and
pipe result to psql ), and choosing the right language ( I would say
perl, it was dessigned practically for this sort of things ) makes the
filtering really easy. And it seems you are forcing it, I see null
handling in some updates, but not others, this hints off relying on
strict behaviour ( functions returning null on null input ). The only
apparent difficulty is parsing csv, but there are millions of modules
parsing the one understood by postgres ( if you generate your data,
the default text format for postgres is much easier to parse, i.e., in
perl normally a simple split// call does the trick ).

Doing it with a filtering stream would allow you to easily process
gigabytes of data using a few megabytes of ram in the client. I'm not
sure about the server, but stream filtering lends itself to very easy
batching of copies, and from what I read your server is beefy.

Francisco Olarte.