Postgres becoming slow, only full vacuum fixes it

Kiriakos Tsourapas <ktsour@xxxxxxxxx> · Mon, 24 Sep 2012 13:33:25 +0300

Hi,
The problem : Postgres is becoming slow, day after day, and only a full vacuum fixes the problem.

Information you may need to evaluate :

The problem lies on all tables and queries, as far as I can tell, but we can focus on a single table for better comprehension.

The queries I am running to test the speed are :
INSERT INTO "AWAITINGSTATUSSMPP" VALUES('143428', '1111', 1, '2012-06-16 13:39:19', '111');
DELETE FROM "AWAITINGSTATUSSMPP" WHERE "SMSCMSGID" = '1111' AND "CONNECTIONID" = 1;
SELECT * FROM "AWAITINGSTATUSSMPP" WHERE "SMSCMSGID" = '1111' AND "CONNECTIONID" = 1;

After a full vacuum, they run in about 100ms.
Today, before the full vacuum, they were taking around 500ms.

Below is an explain analyze of the commands AFTER a full vacuum. I did not run it before, so I can not post relevant info before the vacuum. So, after the full vacuum :

explain analyze INSERT INTO "AWAITINGSTATUSSMPP" VALUES('143428', '1111', 1, '2012-06-16 13:39:19', '111');
"Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)"
"Trigger for constraint FK_AWAITINGSTATUSSMPP_MESSAGES: time=0.131 calls=1"
"Trigger bucardo_add_delta: time=0.454 calls=1"
"Trigger bucardo_triggerkick_MassSMs: time=0.032 calls=1"
"Total runtime: 0.818 ms"

explain analyze DELETE FROM "AWAITINGSTATUSSMPP" WHERE "SMSCMSGID" = '1111' AND "CONNECTIONID" = 1;"Seq Scan on "AWAITINGSTATUSSMPP"  (cost=0.00..2.29 rows=1 width=6) (actual time=0.035..0.035 rows=0 loops=1)"
"  Filter: ((("SMSCMSGID")::text = '1111'::text) AND ("CONNECTIONID" = 1))"
"Trigger bucardo_triggerkick_MassSMs: time=0.066 calls=1"
"Total runtime: 0.146 ms"

explain analyze SELECT * FROM "AWAITINGSTATUSSMPP" WHERE "SMSCMSGID" = '1111' AND "CONNECTIONID" = 1;
"Seq Scan on "AWAITINGSTATUSSMPP"  (cost=0.00..2.29 rows=1 width=557) (actual time=0.028..0.028 rows=0 loops=1)"
"  Filter: ((("SMSCMSGID")::text = '1111'::text) AND ("CONNECTIONID" = 1))"
"Total runtime: 0.053 ms"

Below are the metadata of the table :
=====================================
CREATE TABLE "AWAITINGSTATUSSMPP"
(
  "MESSAGEID" bigint NOT NULL,
  "SMSCMSGID" character varying(50) NOT NULL,
  "CONNECTIONID" smallint NOT NULL,
  "EXPIRE_TIME" timestamp without time zone NOT NULL,
  "RECIPIENT" character varying(20) NOT NULL,
  "CLIENT_MSG_ID" character varying(255),
  CONSTRAINT "PK_AWAITINGSTATUSSMPP" PRIMARY KEY ("SMSCMSGID", "CONNECTIONID"),
  CONSTRAINT "FK_AWAITINGSTATUSSMPP_MESSAGES" FOREIGN KEY ("MESSAGEID")
      REFERENCES "MESSAGES" ("ID") MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
  OIDS=FALSE
);
ALTER TABLE "AWAITINGSTATUSSMPP" OWNER TO postgres;
GRANT ALL ON TABLE "AWAITINGSTATUSSMPP" TO "MassSMsUsers";

CREATE INDEX "IX_AWAITINGSTATUSSMPP_MSGID_RCP"
  ON "AWAITINGSTATUSSMPP"
  USING btree
  ("MESSAGEID", "RECIPIENT");

CREATE TRIGGER bucardo_add_delta
  AFTER INSERT OR UPDATE OR DELETE
  ON "AWAITINGSTATUSSMPP"
  FOR EACH ROW
  EXECUTE PROCEDURE bucardo."bucardo_add_delta_SMSCMSGID|CONNECTIONID"();

CREATE TRIGGER "bucardo_triggerkick_MassSMs"
  AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
  ON "AWAITINGSTATUSSMPP"
  FOR EACH STATEMENT
  EXECUTE PROCEDURE bucardo."bucardo_triggerkick_MassSMs"();
=====================================

The table only has about 200 records because it is being used a temporary storage and records are constantly inserted and deleted.
BUT please don't get hold on this fact, because as I already said, the speed problem is not restricted to this table. The same problems appear on the following query 
UPDATE "MESSAGES"  SET "SENT" = "SENT" + 1 WHERE "ID" = 143447;
and MESSAGES table has mainly inserts and few deletes...

My postgresql.conf file :
======================
port = 5433                             # (change requires restart)
max_connections = 100                   # (change requires restart)
shared_buffers = 256MB                  # min 128kB. DoubleIP - Default was 32MB
synchronous_commit = off                # immediate fsync at commit. DoubleIP - Default was on
effective_cache_size = 512MB            # DoubleIP - Default was 128MB
log_destination = 'stderr'              # Valid values are combinations of
logging_collector = on                  # Enable capturing of stderr and csvlog
silent_mode = on                        # Run server silently.
log_line_prefix = '%t %d %u '           # special values:
log_autovacuum_min_duration = 0         # -1 disables, 0 logs all actions and
autovacuum_naptime = 28800              # time between autovacuum runs. DoubleIP - default was 1min
autovacuum_vacuum_threshold = 100       # min number of row updates before
autovacuum_vacuum_scale_factor = 0.0    # fraction of table size before vacuum. DoubleIP - default was 0.2
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'                     # locale for system error message
lc_monetary = 'en_US.UTF-8'                     # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'                      # locale for number formatting
lc_time = 'en_US.UTF-8'                         # locale for time formatting
default_text_search_config = 'pg_catalog.english'
=======================

As you will see, I have altered the shared_buffers and synchronous_commit values.
The shared_buffers had the default value 32Mb. When I changed it to 256Mb the problem still appears but it takes more time to appear (3-4 days). With 32MB, it appeared faster, probably after 24 hours.
Also, I have changed the autovacuum daemon to work every 8 hours but I changed its values to make sure it vacuums pretty much all tables (the ones for which at least 100 rows have changed).
Please note, though, that my problem existed even before playing around with the autovacuum. This is why I tried to change its values in the first place.

The server is synchronized with another server using bucardo. Bucardo process is running on the other server.
The same problem appears on the 2nd server too... after 3-4 days, postgres is running slower and slower.

Our server configuration :
DELL PowerEdge T610 Tower Chassis for Up to 8x 3.5" HDDs
2x Intel Xeon E5520 Processor (2.26GHz, 8M Cache, 5.86 GT/s QPI, Turbo, HT), 1066MHz Max Memory
8GB Memory,1333MHz
2 x 146GB SAS 15k 3.5" HD Hot Plug
6 x 1TB SATA 7.2k 3.5" Additional HD Hot Plug
PERC 6/i RAID Controller Card 256MB PCIe, 2x4 Connectors
SUSE Linux Enterprise Server 10, SP2

The 2 HDs are set up with RAID-1
The 6 HDs are set up with RAID-5

Linux is running on the RAID-1 configuration
Postgres is running on the RAID-5 configuration

Finally a top before and after the full vacuum :
top - 11:27:44 up 72 days, 13:27, 37 users,  load average: 1.05, 1.31, 1.45
Tasks: 279 total,   3 running, 276 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.6%us,  0.8%sy,  0.0%ni, 95.5%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8166432k total,  7963116k used,   203316k free,   115344k buffers
Swap:  2097144k total,  2097020k used,      124k free,  2337636k cached

top - 11:30:58 up 72 days, 13:31, 38 users,  load average: 1.53, 1.59, 1.53
Tasks: 267 total,   2 running, 265 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.3%us,  0.4%sy,  0.0%ni, 98.0%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8166432k total,  6016268k used,  2150164k free,    61092k buffers
Swap:  2097144k total,  2010204k used,    86940k free,  2262896k cached

I hope I have provided enough info and hope that someone can point me to the correct direction.

Thank you very much even for reading up to here !

Best regards,
Kiriakos