"Lane, Tom" <tgl@xxxxxxxxxxxxx> writes:> > "O'Shea, Brendan" <boshea@xxxxxxxxxx> writes: > >> "Lane, Tom" <tgl@xxxxxxxxxxxxx> writes: > >> Ugh. Does it work if you do > >> export PGOPTIONS="--ignore_system_indexes=1" > > > I tried that, but unfortunately pg_dump still fails to run > and the error > > message is identical to previous attempts. > > Huh. So it's not index corruption then. Table corruption is still a > possibility but it seems likely that you'd be getting other errors > during the seqscan that looks for the rows. This suggests > that the rows > actually disappeared from pg_attribute, which is a bit hard to credit. > The only mechanisms I can think of are that VACUUM decided they were > dead or something physically truncated the table. The latter would > probably have zapped a lot of other rows though. > The auto vacuum daemon is running pretty frequently on the database since one of our tables is very heavily updated, about 50-200 per second, and the rows themselves are large, about 1k in size on average and about 7000 rows in the table. We also run a nightly 'vacuum full' operation on this table since it can sometimes grow fairly large (around 50-100MB or more) if there are any long running transactions that last 20-30 minutes or more. Not sure if this is related but thought it was worth a mention. > Have you checked to see if pg_roles is the only relation with this > problem? Try > select c.relname from pg_class c left join pg_attribute a > on a.attrelid = c.oid and a.attnum > 0 > group by c.oid,c.relname,c.relnatts having count(*) != c.relnatts; > This SQL produces the output: "pg_roles" "pg_group" "pg_rules" "pg_user" "pg_shadow" Querying each table produces the error: "pg_roles" => ERROR: catalog is missing 9 attribute(s) for relid 10297 "pg_group" => ERROR: catalog is missing 3 attribute(s) for relid 10303 "pg_rules" => ERROR: invalid attribute number 0 for pg_rules "pg_user" => ERROR: catalog is missing 8 attribute(s) for relid 10306 "pg_shadow" => ERROR: catalog is missing 8 attribute(s) for relid 10300 > > Any suggestions for additional logging we might turn on to help > > determine the cause of this issue? > > Maybe VACUUM VERBOSE on pg_attribute? Although you'd have to > get lucky > enough to catch the time that it zapped the rows, if that's what the > problem is. > So are you saying that on our newly installed system we could periodically run "VACUUM VERBOSE pg_attribute", append the output to a log file, and then after the catalog error starts happening again we could go to this log file to look for potential clues? Running "VACUUM VERBOSE pg_attribute" now produces: INFO: vacuuming "pg_catalog.pg_attribute" INFO: scanned index "pg_attribute_relid_attnam_index" to remove 30 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.03 sec. INFO: scanned index "pg_attribute_relid_attnum_index" to remove 30 row versions DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_attribute": removed 30 row versions in 1 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_attribute_relid_attnam_index" now contains 2670 row versions in 48 pages DETAIL: 30 index row versions were removed. 11 index pages have been deleted, 11 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_attribute_relid_attnum_index" now contains 2670 row versions in 13 pages DETAIL: 30 index row versions were removed. 2 index pages have been deleted, 2 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_attribute": found 30 removable, 2670 nonremovable row versions in 62 pages DETAIL: 0 dead row versions cannot be removed yet. There were 965 unused item pointers. 20 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.04 sec. Query returned successfully with no result in 500 ms. Thanks for your help, Brendan ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster