Greetings! We've got a moderately busy DB host running PostgreSQL 8.3.7 from Debian Lenny with WAL archiving. It's been rsyncing its WAL files to a backup host for the last year or so, with the following archive_command: ssh pgbackup@backup test ! -f pgbackup-cirkus/%f && rsync -az %p pgbackup@backup:pgbackup-cirkus/%f Recently, WAL archiving begain failing on the test which checks wether the file exists. This first occured two hours after an incident where someone edited pg_hba.conf and left it with permissions denying Postgres read access to it. Upon SIGHUP the cluster naturally shut down. It was discovered promptly, and according to this person, there were "some processes named postgres still running". He ran "/etc/init.d/postgresql-8.3 start" anyway, which brought up the cluster: 2009-09-07 20:39:55 CEST 5782 LOG: received SIGHUP, reloading configuration files 2009-09-07 20:39:55 CEST 5782 FATAL: could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf": Permission denied 2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG: could not receive data from client: Connection reset by peer 2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG: unexpected EOF on client connection 2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG: could not receive data from client: Connection reset by peer 2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG: unexpected EOF on client connection 2009-09-07 20:41:29 CEST 16197 LOG: could not load root certificate file "root.crt": no SSL error reported 2009-09-07 20:41:29 CEST 16197 DETAIL: Will not verify client certificates. 2009-09-07 20:41:29 CEST 16197 FATAL: could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf": Permission denied 2009-09-07 20:42:04 CEST 16748 LOG: could not load root certificate file "root.crt": no SSL error reported 2009-09-07 20:42:04 CEST 16748 DETAIL: Will not verify client certificates. 2009-09-07 20:42:04 CEST 16749 LOG: database system was interrupted; last known up at 2009-09-07 20:37:38 CEST 2009-09-07 20:42:04 CEST 16749 LOG: database system was not properly shut down; automatic recovery in progress 2009-09-07 20:42:04 CEST 16749 LOG: redo starts at 65/F00718D0 2009-09-07 20:42:04 CEST 16749 LOG: record with zero length at 65/F1039488 2009-09-07 20:42:04 CEST 16749 LOG: redo done at 65/F1039458 2009-09-07 20:42:04 CEST 16749 LOG: last completed transaction was at log time 2009-09-07 20:39:52.010594+02 2009-09-07 20:42:04 CEST 16749 LOG: checkpoint starting: shutdown immediate 2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG: connection received: host=[local] 2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG: incomplete startup packet 2009-09-07 20:42:04 CEST 16749 LOG: checkpoint complete: wrote 43 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=0.001 s, sync=0.429 s, total=0.464 s 2009-09-07 20:42:04 CEST 16756 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1917 port=50140 2009-09-07 20:42:04 CEST 16749 LOG: recovering prepared transaction 809084 2009-09-07 20:42:04 CEST 16756 mdb2 uka_web FATAL: the database system is starting up 2009-09-07 20:42:04 CEST 16757 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1917 port=50141 2009-09-07 20:42:04 CEST 16757 mdb2 uka_web FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16762 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36523 2009-09-07 20:42:05 CEST 16762 greylisting exim FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16763 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36524 2009-09-07 20:42:05 CEST 16763 greylisting exim FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16764 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36525 2009-09-07 20:42:05 CEST 16764 greylisting exim FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16765 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36526 2009-09-07 20:42:05 CEST 16765 greylisting exim FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16768 [unknown] [unknown] LOG: connection received: host=[local] 2009-09-07 20:42:05 CEST 16768 postgres postgres FATAL: the database system is starting up 2009-09-07 20:42:05 CEST 16776 LOG: autovacuum launcher started 2009-09-07 20:42:05 CEST 16748 LOG: database system is ready to accept connections 2009-09-07 20:42:05 CEST 16786 [unknown] [unknown] LOG: connection received: host=[local] 2009-09-07 20:42:05 CEST 16786 postgres postgres LOG: connection authorized: user=postgres database=postgres The cluster recovered, which strikes me as odd - shouldn't Postgres at least rollback uncommited transactions and shut down cleanly upon an unreadable configuration file? Anyway, the cluster ran fine, handling several transactions per second, until one and a half hour later, where WAL archiving suddenly stopped. (The cluster runs on happily anyway, though.) 2009-09-07 22:12:53 CEST 2691 mdb2_ukadev uka_mdb LOG: statement: ABORT 2009-09-07 22:12:54 CEST 16774 LOG: checkpoint complete: wrote 279 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=48.665 s, sync=0.631 s, total=49.339 s 2009-09-07 22:12:54 CEST 16777 LOG: archive command failed with exit code 1 2009-09-07 22:12:54 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup@backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1 [..] 2009-09-07 22:12:55 CEST 16777 LOG: archive command failed with exit code 1 2009-09-07 22:12:55 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup@backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1 [..] 2009-09-07 22:12:56 CEST 16777 LOG: archive command failed with exit code 1 2009-09-07 22:12:56 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup@backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1 2009-09-07 22:12:56 CEST 16777 WARNING: transaction log file "0000000100000065000000F1" could not be archived: too many failures On the backup host (receiving the WAL files), named kolje: pgbackup@kolje:~$ ls -la pgbackup-cirkus/0000000100000065000000EE pgbackup-cirkus/0000000100000065000000EF pgbackup-cirkus/0000000100000065000000F0 pgbackup-cirkus/0000000100000065000000F1 pgbackup-cirkus/0000000100000065000000F2 pgbackup-cirkus/0000000100000065000000F4 -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:27 pgbackup-cirkus/0000000100000065000000EE -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:32 pgbackup-cirkus/0000000100000065000000EF -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:37 pgbackup-cirkus/0000000100000065000000F0 -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:47 pgbackup-cirkus/0000000100000065000000F1 -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:52 pgbackup-cirkus/0000000100000065000000F2 -rw------- 1 pgbackup pgbackup 16777216 2009-09-07 21:02 pgbackup-cirkus/0000000100000065000000F4 On the DB host, called cirkus: cirkus:~# ls -la /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1 /var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready -rw------- 1 postgres postgres 16777216 2009-09-07 22:03 /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1 -rw------- 1 postgres postgres 0 2009-09-07 22:12 /var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready There's no 0000000100000065000000F0 or 0000000100000065000000F2: cirkus:~# ls -lart /var/lib/postgresql/8.3/main/pg_xlog/ | head total 7365516 -rw------- 1 postgres postgres 16777216 2009-09-07 22:03 0000000100000065000000F1 -rw------- 1 postgres postgres 16777216 2009-09-07 22:17 000000010000006600000004 -rw------- 1 postgres postgres 16777216 2009-09-07 22:22 000000010000006600000005 -rw------- 1 postgres postgres 16777216 2009-09-07 22:27 000000010000006600000006 -rw------- 1 postgres postgres 16777216 2009-09-07 22:32 000000010000006600000007 -rw------- 1 postgres postgres 16777216 2009-09-07 22:37 000000010000006600000008 -rw------- 1 postgres postgres 16777216 2009-09-07 22:42 000000010000006600000009 -rw------- 1 postgres postgres 16777216 2009-09-07 22:47 00000001000000660000000A -rw------- 1 postgres postgres 16777216 2009-09-07 22:52 00000001000000660000000B The files in question do not share MD5 sum: pgbackup@kolje:~$ md5sum pgbackup-cirkus/0000000100000065000000F1 341f308f3269176873f93b19496b2923 pgbackup-cirkus/0000000100000065000000F1 cirkus:~# md5sum /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1 8010976a8a95480916ff2bae9203a2cf /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1 How do I continue debugging this? How did a file name collision occur? -B Server configuration: name | setting ---------------------------------+----------------------------------------------------------------------------------------------------- add_missing_from | off allow_system_table_mods | off archive_command | ssh pgbackup@backup test ! -f pgbackup-cirkus/%f && rsync -az %p pgbackup@backup:pgbackup-cirkus/%f archive_mode | on archive_timeout | 300 array_nulls | on authentication_timeout | 60 autovacuum | on autovacuum_analyze_scale_factor | 0.3 autovacuum_analyze_threshold | 100 autovacuum_freeze_max_age | 200000000 autovacuum_max_workers | 3 autovacuum_naptime | 60 autovacuum_vacuum_cost_delay | -1 autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | 0.4 autovacuum_vacuum_threshold | 400 backslash_quote | safe_encoding bgwriter_delay | 200 bgwriter_lru_maxpages | 100 bgwriter_lru_multiplier | 2 block_size | 8192 bonjour_name | check_function_bodies | on checkpoint_completion_target | 0.5 checkpoint_segments | 15 checkpoint_timeout | 1800 checkpoint_warning | 30 client_encoding | UTF8 client_min_messages | notice commit_delay | 0 commit_siblings | 5 config_file | /etc/postgresql/8.3/main/postgresql.conf constraint_exclusion | off cpu_index_tuple_cost | 0.005 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 custom_variable_classes | data_directory | /var/lib/postgresql/8.3/main DateStyle | ISO, MDY db_user_namespace | off deadlock_timeout | 1000 debug_assertions | off debug_pretty_print | on debug_print_parse | off debug_print_plan | on debug_print_rewritten | off default_statistics_target | 10 default_tablespace | default_text_search_config | pg_catalog.simple default_transaction_isolation | read committed default_transaction_read_only | off default_with_oids | off dynamic_library_path | $libdir effective_cache_size | 393216 enable_bitmapscan | on enable_hashagg | on enable_hashjoin | on enable_indexscan | on enable_mergejoin | on enable_nestloop | on enable_seqscan | on enable_sort | on enable_tidscan | on escape_string_warning | on explain_pretty_print | on external_pid_file | /var/run/postgresql/8.3-main.pid extra_float_digits | 0 from_collapse_limit | 8 fsync | on full_page_writes | on geqo | on geqo_effort | 5 geqo_generations | 0 geqo_pool_size | 0 geqo_selection_bias | 2 geqo_threshold | 12 gin_fuzzy_search_limit | 0 hba_file | /etc/postgresql/8.3/main/pg_hba.conf ident_file | /etc/postgresql/8.3/main/pg_ident.conf ignore_system_indexes | off integer_datetimes | on join_collapse_limit | 8 krb_caseins_users | off krb_realm | krb_server_hostname | krb_server_keyfile | /etc/sql.keytab krb_srvname | postgres lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 lc_messages | lc_monetary | C lc_numeric | C lc_time | C listen_addresses | * local_preload_libraries | log_autovacuum_min_duration | -1 log_checkpoints | on log_connections | on log_destination | stderr log_directory | pg_log log_disconnections | off log_duration | off log_error_verbosity | default log_executor_stats | off log_filename | postgresql-%Y-%m-%d_%H%M%S.log log_hostname | off log_line_prefix | %t %p %d %u log_lock_waits | on log_min_duration_statement | 100 log_min_error_statement | FATAL log_min_messages | INFO log_parser_stats | off log_planner_stats | off log_rotation_age | 1440 log_rotation_size | 10240 log_statement | ddl log_statement_stats | off log_temp_files | 1 log_timezone | localtime log_truncate_on_rotation | off logging_collector | off maintenance_work_mem | 16384 max_connections | 256 max_files_per_process | 1000 max_fsm_pages | 2000000 max_fsm_relations | 35000 max_function_args | 100 max_identifier_length | 63 max_index_keys | 32 max_locks_per_transaction | 64 max_prepared_transactions | 100 max_stack_depth | 2048 password_encryption | on port | 5432 post_auth_delay | 0 pre_auth_delay | 0 random_page_cost | 4 regex_flavor | advanced search_path | "$user",public seq_page_cost | 1 server_encoding | UTF8 server_version | 8.3.7 server_version_num | 80307 session_replication_role | origin shared_buffers | 20000 shared_preload_libraries | silent_mode | off sql_inheritance | on ssl | on ssl_ciphers | ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH standard_conforming_strings | off statement_timeout | 0 superuser_reserved_connections | 3 synchronize_seqscans | on synchronous_commit | on syslog_facility | LOCAL0 syslog_ident | postgres tcp_keepalives_count | 0 tcp_keepalives_idle | 0 tcp_keepalives_interval | 0 temp_buffers | 1024 temp_tablespaces | TimeZone | localtime timezone_abbreviations | Default trace_notify | off trace_sort | off track_activities | on track_counts | on transaction_isolation | read committed transaction_read_only | off transform_null_equals | off unix_socket_directory | /var/run/postgresql unix_socket_group | unix_socket_permissions | 511 update_process_title | on vacuum_cost_delay | 0 vacuum_cost_limit | 200 vacuum_cost_page_dirty | 20 vacuum_cost_page_hit | 1 vacuum_cost_page_miss | 10 vacuum_freeze_min_age | 100000000 wal_buffers | 8 wal_sync_method | open_sync wal_writer_delay | 200 work_mem | 81920 xmlbinary | base64 xmloption | content zero_damaged_pages | off (189 rows) -- Berge Schwebs Bjørlo Alegría! -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general