I am trying to debug a postgresql server crash and am out of my depth. The server is sent a segmentation fault when accessing certain data. I've tracked down one such place (though there may be others). I'm wondering if its possible for a more experienced person to see from the below if I'm looking at table corruption. pg_dump also crashes the server. The filesystem purports to be consistent and I haven't yet found bad blocks. Thanks much to anyone who reads on and the many whose questions and answers have been so helpful to this lurker in the past. Cheers, --clemens ====== Set up: ====== FreeBSD hog6 5.3-RELEASE dual amd64 amr(RAID10) MegaRAID postgresql 7.4.5 from ports ======= The table ======= catfang_de_broken=# \d imp_product Table "public.imp_product" Column | Type | Modifiers -------------+---------+------------------------------------------------------------- product_id | integer | not null default (nextval('product_id_seq'::text))::integer category_id | integer | not null name | text | not null created | date | not null current | boolean | active | boolean | not null default true action | integer | not null Indexes: "imp_product_pkey" primary key, btree (product_id) "imp_product_action_idx" btree ("action") "imp_product_category_idx" btree (category_id) Check constraints: "imp_product_category_id" CHECK (category_id <> 0) "imp_product_product_id" CHECK (product_id <> 0) Foreign-key constraints: "$1" FOREIGN KEY (category_id) REFERENCES imp_category(category_id) ===================================================== Selecting works for records where offset < 129833, but is broken for the next record. ===================================================== catfang_de_broken=# select * from imp_product offset 129832 limit 1; product_id | category_id | name | created | current | active | action ------------+-------------+----------+------------+---------+--------+-------- 761377 | 11996 | KGN 7060 | 2000-11-21 | | t | 0 (1 row) catfang_de_broken=# select * from imp_product offset 129831 limit 1; FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ====================================================== The same two queries from the log. No error messages prior to the crash. ====================================================== Feb 18 13:02:11 hog6 postgres[94548]: [130033-3] STATEMENT: select * from imp_product offset 129832 limit 1; Feb 18 13:02:11 hog6 postgres[94548]: [130034-1] LOG: 00000: statement: select * from imp_product offset 129833 limit 1; Feb 18 13:02:11 hog6 postgres[94548]: [130034-2] LOCATION: pg_parse_query, postgres.c:464 Feb 18 13:02:11 hog6 postgres[94548]: [130034-3] STATEMENT: select * from imp_product offset 129833 limit 1; Feb 18 13:02:34 hog6 postgres[94330]: [1-1] LOG: 00000: server process (PID 94548) was terminated by signal 11 Feb 18 13:02:34 hog6 postgres[94330]: [1-2] LOCATION: LogChildExit, postmaster.c:2087 Feb 18 13:02:34 hog6 postgres[94330]: [2-1] LOG: 00000: terminating any other active server processes Feb 18 13:02:34 hog6 postgres[94330]: [2-2] LOCATION: CleanupProc, postmaster.c:2008 Feb 18 13:02:34 hog6 postgres[94330]: [3-1] LOG: 00000: all server processes terminated; reinitializing ============================ Running truss shows the file it's reading when it crashes. ("/dbdata/pgsql/data/base/17142/17234") Is there anything to be infered from the mmap or munmap? ============================ 2431: open("/dbdata/pgsql/data/base/17142/17234",0x2,0600) = 35 (0x23) 2431: lseek(35,0x0,0x343332) = 144023552 (0x895a000) 2431: open("/dbdata/pgsql/data/base/17142/16652",0x2,0600) = 36 (0x24) 2431: read(0x24,0x801e83680,0x2000) = 8192 (0x2000) 2431: read(0x24,0x801e85680,0x2000) = 8192 (0x2000) 2431: break(0x8b0000) = 0 (0x0) 2431: open("/dbdata/pgsql/data/base/17142/1247",0x2,0600) = 37 (0x25) 2431: read(0x25,0x801e87680,0x2000) = 8192 (0x2000) 2431: read(0x25,0x801e89680,0x2000) = 8192 (0x2000) 2431: lseek(35,0x0,0x3734) = 0 (0x0) 2431: read(0x23,0x801e8b680,0x2000) = 8192 (0x2000) 2431: read(0x23,0x801e8d680,0x2000) = 8192 (0x2000) 2431: read(0x23,0x801e8f680,0x2000) = 8192 (0x2000) ... 2431: read(0x23,0x801c87680,0x2000) = 8192 (0x2000) 2431: read(0x23,0x801c89680,0x2000) = 8192 (0x2000) 2431: break(0x20cd5000) = 0 (0x0) 2431: mmap(0x0,1064960,(0x3)PROT_READ|PROT_WRITE,(0x1002)MAP_ANON| MAP_PRIVATE,4294967295,0x0) = 34394505216 (0x802128000) 2431: munmap(0x80075b000,0x1000) = 0 (0x0) SIGNAL 11 (SIGSEGV) SIGNAL 11 (SIGSEGV) Process stopped because of: 16 process exit, rval = 139 SIGNAL 20 (SIGCHLD) 2391: select(6,{3 4 5},0x0,0x0,{120 0}) ERR#4 'Interrupted system call' ======================= The system tables confirm the file above is related to the data we've isolated. (/dbdata/pgsql/data/base/17142/17234, imp_product, 129833) ======================= catfang_de_broken=# select relfilenode from pg_class where relname='imp_product'; relfilenode ------------- 17234 catfang_de_broken=# select ctid from imp_product offset 129832 limit 1; ctid ----------- (4743,60) (1 row) catfang_de_broken=# select ctid from imp_product offset 129833 limit 1; ctid ----------- (4743,61) (1 row) ================================= Copying the file to a new inode does not change anything. So am I right to conclude table corruption and start developing a deep mistrust of this relatively new server? Or do I need to look more widely? I tried to look at the block indicated above using pg_dump, but I don't know enough to see anything obviously wrong. I'm hoping one of you might. ================================= ./pg_filedump -R 4743 -i -f /dbdata/pgsql/data/base/17142/17234 Block 4743 ******************************************************** <Header> ----- Block Offset: 0x0250e000 Offsets: Lower 320 (0x0140) Block: Size 8192 Version 1 Upper 344 (0x0158) LSN: logid 19 recoff 0xca5ad618 Special 8192 (0x2000) Items: 75 Free Space: 24 Length (including item array): 324 0000: 13000000 18d65aca 0f000000 40015801 ......Z.....@.X. 0010: 00200120 b09f9800 d89ea801 70826001 . . ........p.`. 0020: 589e0001 089e9800 709d3001 289d8800 X.......p.0.(... 0030: b89cd800 689ca000 109ca800 d09b8000 ....h........... 0040: 689bd000 089bc000 989ad800 509a8800 h...........P... 0050: e899c800 9899a000 3899c000 c098f000 ........8....... 0060: 6898a800 8897b801 2097d000 b096d800 h....... ....... 0070: 4896c800 c0816001 e095c800 8895a800 H.....`......... 0080: 1095f000 b094b800 4894d000 d893e000 ........H....... 0090: 6893d800 0893b800 a092c800 3892c800 h...........8... 00a0: d891c000 8891a000 1091f000 b090c000 ................ 00b0: 4890d000 e08fc800 688fe800 008fd000 H.......h....... 00c0: a08ec000 388ec800 d08dc800 608dd800 ....8.......`... 00d0: 008dc000 a08cb800 388cd000 d08bc800 ........8....... 00e0: 708bc000 188bb000 d88a8000 908a8800 p............... 00f0: 508a8000 088a8800 c8898000 88898000 P............... 0100: 48898000 00898800 b8888800 78888000 H...........x... 0110: 0888d800 9887d800 00873001 50866001 ..........0.P.`. 0120: a8855001 3885e000 5881c800 b8840001 ..P.8...X....... 0130: 5084c800 d0830001 7883a800 2083b000 P.......x... ... 0140: 00000000 .... <Data> ------ Item 1 -- Length: 76 Offset: 8112 (0x1fb0) Flags: USED XID: min (733) CMIN|XMAX: 14342668 CMAX|XVAC: 1 Block Id: 14787 linp Index: 21 Attributes: 7 Size: 32 infomask: 0x0513 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_COMMITTED) t_bits: [0]: 0x6f 1fb0: dd020000 0cdada00 01000000 0000c339 ...............9 1fc0: 15000700 1305206f 00000000 10193d06 ...... o......=. 1fd0: 06740600 8e2e0000 16000000 41736d61 .t..........Asma 1fe0: 72202d20 59616972 2044616c 616c0000 r - Yair Dalal.. 1ff0: 9b060000 01000000 00000000 ............ ... Item 60 -- Length: 64 Offset: 2376 (0x0948) Flags: USED XID: min (733) CMIN|XMAX: 0 CMAX|XVAC: 0 Block Id: 4743 linp Index: 60 Attributes: 7 Size: 32 infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID) t_bits: [0]: 0x6f 0948: dd020000 00000000 00000000 00008712 ................ 0958: 3c000700 1309206f 00000000 4b193d06 <..... o....K.=. 0968: 219e0b00 dc2e0000 0c000000 4b474e20 !...........KGN 0978: 37303630 45010000 01000000 00000000 7060E........... Item 61 -- Length: 68 Offset: 2304 (0x0900) Flags: USED XID: min (733) CMIN|XMAX: 0 CMAX|XVAC: 0 Block Id: 4743 linp Index: 61 Attributes: 7 Size: 32 infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID) t_bits: [0]: 0x6f 0900: dd020000 00000000 00000000 00008712 ................ 0910: 3d000700 1309206f 00000000 4c193d06 =..... o....L.=. 0920: f29c0b00 dc2e0000 0d000041 46414220 ...........AFAB 0930: 33325853 33000000 52050000 01000000 32XS3...R....... 0940: 00000000 .... -- Clemens Wehrmann ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq