Hi All, This is my first postgres query to admin list, so if I am not following the right standards for asking the question, pls let me know
😊 The problem: I have a postgres cluster as A (primary)-> streaming replication -> B(hot_standby=on) We had a power outage in one of the data centers, and when we got back, one of the databases servers (B the standby node) seem to show weird errors and is not starting up. A recovered fine, and it running
fine. --------- logs 2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2018-06-25 10:57:04 UTC WARNING: terminating connection because of crash of another server process 2018-06-25 10:57:04 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared
memory. 2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2018-06-25 10:57:04 UTC LOG: database system is shut down 2018-06-27 16:59:28 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432 2018-06-27 16:59:28 UTC LOG: listening on IPv6 address "::", port 5432 2018-06-27 16:59:28 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2018-06-27 16:59:28 UTC LOG: database system was interrupted while in recovery at log time 2018-06-25 10:52:21 UTC 2018-06-27 16:59:28 UTC HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2018-06-27 16:59:28 UTC LOG: entering standby mode 2018-06-27 16:59:28 UTC LOG: recovered replication state of node 1 to 63/89B52A98 2018-06-27 16:59:28 UTC LOG: redo starts at 9A/6F1B3888 2018-06-27 16:59:28 UTC LOG: consistent recovery state reached at 9A/6F25BFF0 2018-06-27 16:59:28 UTC FATAL: invalid memory alloc request size 1768185856 2018-06-27 16:59:28 UTC LOG: database system is ready to accept read only connections 2018-06-27 16:59:28 UTC LOG: startup process (PID 11829) exited with exit code 1 2018-06-27 16:59:28 UTC LOG: terminating any other active server processes 2018-06-27 16:59:28 UTC LOG: database system is shut down pg_ctl: could not start server Examine the log output. -------what I tried I did a lot of searching around “invalid memory alloc request” it mostly points to some data error,
https://www.postgresql.org/message-id/AANLkTik88wGQ6h-xFHJBU_Fj_CnXK5LX7aBws_2oANKP%40mail.gmail.com mem checks and disks are ok. I took a filesystem snapshot, synced it onto another server and tried to reload it, it gave the same error. Now to ensure there was not real data causing this issue, I took a full dump of the primary db (A) using pg_dumpall and restored in onto the test server. I was able to load the db and was able to query the
tables without issues. So I am confused why the standby is behaving weird after the power outage. Just FYI, I spun up a new node ( C ) as a replacement for the problem standby(B) with pg_baseback and it is working fine. Can query all tables without problems. My only concern was do I need to worry about
this error showing up again? Most the googling points to some operation and finding out errors in some operation performed etc, https://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/ - 1 https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html - 2 I ran that function (in 1), could not find bad data rows. Also, I tried to bump up (memory to 32GB, cpu to 8 cpu and shmmax to 18446744073692774399 to figure out is there was any resource constraint that I could temp bump up and then debug what went wrong, but it
did not start up. $ /usr/lib/postgresql/10/bin/postgres -V postgres (PostgreSQL) 10.4 (Ubuntu 10.4-2.pgdg16.04+1) $ grep -v "#" /etc/postgresql/10/main/postgresql.conf | sed -e '/^$/d' ssl = on ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem' ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key' log_timezone = UTC stats_temp_directory = '/var/run/postgresql/10-main.pg_stat_tmp' datestyle = 'iso, mdy' timezone = UTC default_text_search_config = 'pg_catalog.english' log_directory = pg_log archive_mode = on maintenance_work_mem = 16MB max_wal_size = 1GB max_stack_depth = 2MB autovacuum_analyze_scale_factor = 0.05 autovacuum_vacuum_scale_factor = 0.1 shared_preload_libraries = pg_stat_statements work_mem = 4MB max_wal_senders = 10 temp_buffers = 16MB archive_command = '/bin/true' effective_cache_size = 256MB max_replication_slots = 10 hot_standby = on wal_keep_segments = 64 wal_level = hot_standby listen_addresses = '*' log_min_duration_statement = 10000 max_worker_processes = 18 sysctl: kernel.shmall = 18446744073692774399 kernel.shmmax = 18446744073692774399 $ free -m total used free shared buff/cache available Mem: 16047 134 14469 8 1442 15620 Swap: 975 0 975 $ nproc 4 Thanks, Vijay |