Re: Transaction-Overflow

pingu.freak@xxxxxx · Wed, 08 Aug 2007 11:38:11 +0200

Hi,

first thanks for your answers.

Now I found some ECC-exceptions in the Kernel.:

EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow

This is on both servers, production and backup. Right know, I'm updating the Kernel
to 2.6.22.1. Hopefully this helps :/. But I think there is no hope.

There are also Traces in dmesg:

Code: f3 a5 89 c1 f3 a4 eb 21 89 c8 83 f9 07 76 18 89 f9 f7 d9 83 e1 07 29 c8 f3 a4 89 c1 c1 e9 02 83 e0 03 90 f3 a5 89 c1 f3 a4 5e 89 <c8> 5f c3 57 85 c9 56 89 c7 89
 d6 79 08 0f 0b 0a 03 71 ce 2c c0 
EIP: [<c01c3a2c>] __copy_from_user_ll_nozero+0xd7/0xda SS:ESP 0068:dca2fd94
 <4>EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000002
 printing eip:
c01c3a2c
*pde = 2f39e001
Oops: 0000 [#3]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
Modules linked in: nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables lp parport_pc parport af_packet joydev st sr_mod ipv6 button battery ac apparmor aamatch
_pcre loop dm_mod e1000 ide_cd cdrom i2c_i801 e7xxx_edac edac_mc i2c_core ext3 mbcache jbd edd fan sg gdth aic79xx scsi_transport_spi piix thermal processor sd_mod sc
si_mod ide_disk ide_core
CPU:    0
EIP:    0060:[<c01c3a2c>]    Tainted: G     U VLI
EFLAGS: 00010206   (2.6.18.2-34-bigsmp #1) 
EIP is at __copy_from_user_ll_nozero+0xd7/0xda
eax: e5f17dbc   ebx: 00000001   ecx: 00000006   edx: bff0ef9a
esi: 00000000   edi: 01de802f   ebp: 00000006   esp: e5f17d94
ds: 007b   es: 007b   ss: 0068
Process postmaster (pid: 13414, ti=e5f16000 task=e93710b0 task.ti=e5f16000)
Stack: c01a81f2 00000000 00003466 0000000e e5f17dbc 00000000 00000000 00000001 
       00000002 00000000 ffff0002 c0100000 00000000 b6b86840 b6b85000 c1d9df20 
       00001000 d4f6ae9c 21741707 46b611a0 c0125770 3b9aca00 00000163 80000000 
Call Trace:
 [<c01a81f2>] exit_sem+0x58/0x14c
 [<c0125770>] current_fs_time+0x4f/0x5b
 [<c014ca56>] get_page_from_freelist+0x2f1/0x371
 [<c01487f7>] find_lock_page+0x1a/0x77
 [<c015f3b5>] shmem_getpage+0x4f2/0x552
 [<c0160375>] shmem_nopage+0xa4/0xb6
 [<c0154076>] __handle_mm_fault+0x63e/0xb9c
 [<c01325aa>] autoremove_wake_function+0x0/0x35
 [<c0108567>] sys_ipc+0x5e/0x1bb
 [<c0103ddd>] sysenter_past_esp+0x56/0x79
Code: f3 a5 89 c1 f3 a4 eb 21 89 c8 83 f9 07 76 18 89 f9 f7 d9 83 e1 07 29 c8 f3 a4 89 c1 c1 e9 02 83 e0 03 90 f3 a5 89 c1 f3 a4 5e 89 <c8> 5f c3 57 85 c9 56 89 c7 89
 d6 79 08 0f 0b 0a 03 71 ce 2c c0 
EIP: [<c01c3a2c>] __copy_from_user_ll_nozero+0xd7/0xda SS:ESP 0068:e5f17d94
 <6>device eth0 left promiscuous mode

The hardware is 5 years old... It was not possible to get new hardware
for this project. :/

Regards,

Martin

-----Ursprüngliche Nachricht-----
Von: Tom Lane <tgl@xxxxxxxxxxxxx>
Gesendet: 07.08.07 20:41:08
An: pingu.freak@xxxxxx
CC: pgsql-admin@xxxxxxxxxxxxxx
Betreff: Re:  Transaction-Overflow 

pingu.freak@xxxxxx writes:
> On the top in the log file is this, do you know why the pid is killed with =
> 11? I'm a little bit confused :(.

>  LOG:  Serverprozess (PID 30399) wurde von Signal 11 beendet

SIG 11 (ie SIGSEGV) is pretty much the typical "generic crash"
indication.  It most likely means you ran into a software bug or
corrupted data.  There is no reason at all to think that it's got
anything to do with transaction ID wraparound --- that message is
only coming out because it always comes out at a database restart.

What you ought to look into is what *did* cause the crash.  Did it
produce a core file, and if so can you get a gdb stack trace from
the core?

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 3 Monate
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster