Thanks for your quick response.
We do have backups and a couple of spare servers running in parallel so we're safe in that sense. Thanks for your advice anyway.
To run memtest86 we have to go to the datacenter and that will take us a few days. Is there anything we can do remotely?
Could it be anything different than memory HW problem?
Btw, we are not using that server at the moment :)
Thanks
On Mon, Nov 15, 2010 at 12:03 PM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:
Stop using the system immediately, since many things inserted to the DB might simply be garbage.
Inspect your memory with memtest86.
I would even suggest moving to a new HW if available, and start working into two parallel directions:
a) try to bring your DB into a sane state
b) try to fix your HW server, which apparently has problems.
Files are not "disappearing" like that, in normal situations.
You cannot trust your current HW to run postgresql. Its a guarantee method to lose your data, if you
go on by this start/stop/restore/pray circle.
IMPORTANT: Handle your old (pre-memory upgrade) backups like expensive chinese vases.
Start from those good and trustworthy backups to do any database "replay" (or similar) work which
would lead to a viable situation.
ÎÏÎÏ Monday 15 November 2010 11:55:38 Î/Î Pablo Delgado DÃaz-Pache ÎÎÏÎÏÎ:
--> Hi all,
>
> We've been using postgres for 9 years without a problem until now! Two
> problems in a very short time!
> The first one is described in
> http://postgresql.1045698.n5.nabble.com/Autovacuum-seems-to-block-database-WARNING-worker-took-too-long-to-start-td3264261.html
> This is another one (not related I think) ...
>
> Postgres server is usually working fine. All of a sudden we start getting
> these errors ...
>
> * **2010-11-09 11:49:15.320
> CET|2|database1|10.19.0.51(18895)|20929|SELECT|4cd926fd.51c1|2010-11-09
> 11:48:29 CET|10/417796|1390150|postgres| LOG: Âduration: 1518.422 ms
> Âexecute <unnamed>: SELECT id_token_fk,xxxxxxxxx ORDER BY avadate*
> * **2010-11-09 11:52:25.364
> CET|1|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| ERROR: Âcould not open relation
> base/273198960/273198979: No such file or directory*
> * **2010-11-09 11:52:25.364
> CET|2|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430041|0|postgres| STATEMENT: ÂSELECT id_token_fkxxxxxxxxxxx
> ÂORDER BY avadate*
> * **2010-11-09 11:52:29.981
> CET|3|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430049|0|postgres| ERROR: Âcould not open relation
> base/273198960/273199235: No such file or directory*
> * **2010-11-09 11:52:30.988
> CET|6|database1|10.19.0.51(23286)|21566|PARSE|4cd927cf.543e|2010-11-09
> 11:51:59 CET|7/430050|0|postgres| STATEMENT: ÂSELECT max(avadate) xxxxxxxx
> 32036)*
> * **2010-11-09 11:53:36.346
> CET|16|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516004|0|postgres| STATEMENT: ÂSELECT * FROM "photos"
> xxxxxxxxxxxxxx LIMIT 1*
> * **2010-11-09 11:53:37.956
> CET|17|database2|10.19.0.42(44916)|22107|SELECT|4cd9280e.565b|2010-11-09
> 11:53:02 CET|94/516025|0|postgres| ERROR: Âcould not open relation
> base/271253899/271254075: No such file or directory*
> * **................*
> * **................*
> * **2010-11-09 11:53:55.560 CET|111|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR: Âcould not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:55.560 CET|112|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT: Âwriting block 8866 of relation
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|113|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| WARNING: Âcould not write block 8866 of
> base/273198960/273199235*
> * **2010-11-09 11:53:55.560 CET|114|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| DETAIL: ÂMultiple failures --- write error might be
> permanent.*
> * **2010-11-09 11:53:56.590 CET|115|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| ERROR: Âcould not open relation base/273198960/273199235:
> No such file or directory*
> * **2010-11-09 11:53:56.590 CET|116|||26090||4cc6e970.65ea|2010-10-26
> 16:45:04 CEST||0|| CONTEXT: Âwriting block 8866 of relation
> base/273198960/273199235*
>
> See that there are 2 different databases involved. (database1 and
> database2).
>
> Looking for distinct errors (among the many we have in the log) I find there
> are only 4 involved ...
>
> * **base/271253899/271254075*
> * **base/273198960/273198979*
> * **base/273198960/273199235*
> * **base/273198960/273199253*
>
> and those files are not in the postgres base directory.
>
> To fix it we have no option but to restart postgres (which restarts fine
> with a /etc/init.d/postgresql stop & start)
> However, once we restarted postgres some data was corrupted. Tables that
> used to have 4,5 million rows had only 60 rows. As a consequence we had to
> restore from file system backup.
> Once we did that, it worked fine for a few days until it happened again.
> We're worried it can happen again!
>
> Could this error be a hardware problem?
> We recently increased the memory from 8GB to 28GB, although it was working
> fine for more than 3 weeks.
> We also recently upgraded from postgres 8.3.6 to 8.4.5, althought it also
> worked fine for a few months.
> Upgrading to postgres 9 is easy for us. however, not sure that would help.
>
> Some info of our server:
>
> OS: Centos 5.5
> Kernel: 2.6.18-194.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)
> Server memory: 28GB
>
>
> Any help would be appreciated
>
> Pablo
>
Achilleas Mantzios
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin