In our Django app we have messages (currently about 7
million in table msgs_message) and labels (about 300), and a
join table to associate messages with labels (about 500,000 in
msgs_message_labels). Not sure you'll need them, but here are
the relevant table schemas:
CREATE TABLE msgs_message
(
id INTEGER PRIMARY KEY NOT NULL,
type VARCHAR NOT NULL,
text TEXT NOT NULL,
is_archived BOOLEAN NOT NULL,
created_on TIMESTAMP WITH TIME ZONE NOT NULL,
contact_id INTEGER NOT NULL,
org_id INTEGER NOT NULL,
case_id INTEGER,
backend_id INTEGER NOT NULL,
is_handled BOOLEAN NOT NULL,
is_flagged BOOLEAN NOT NULL,
is_active BOOLEAN NOT NULL,
has_labels BOOLEAN NOT NULL,
CONSTRAINT
msgs_message_contact_id_5c8e3f216c115643_fk_contacts_contact_id
FOREIGN KEY (contact_id) REFERENCES contacts_contact (id),
CONSTRAINT
msgs_message_org_id_81a0adfcc99151d_fk_orgs_org_id FOREIGN
KEY (org_id) REFERENCES orgs_org (id),
CONSTRAINT
msgs_message_case_id_51998150f9629c_fk_cases_case_id FOREIGN
KEY (case_id) REFERENCES cases_case (id)
);
CREATE UNIQUE INDEX msgs_message_backend_id_key ON
msgs_message (backend_id);
CREATE INDEX msgs_message_6d82f13d ON msgs_message
(contact_id);
CREATE INDEX msgs_message_9cf869aa ON msgs_message
(org_id);
CREATE INDEX msgs_message_7f12ca67 ON msgs_message
(case_id);
CREATE TABLE msgs_message_labels
(
id INTEGER PRIMARY KEY NOT NULL,
message_id INTEGER NOT NULL,
label_id INTEGER NOT NULL,
CONSTRAINT
msgs_message_lab_message_id_1dfa44628fe448dd_fk_msgs_message_id
FOREIGN KEY (message_id) REFERENCES msgs_message (id),
CONSTRAINT
msgs_message_labels_label_id_77cbdebd8d255b7a_fk_msgs_label_id
FOREIGN KEY (label_id) REFERENCES msgs_label (id)
);
CREATE UNIQUE INDEX
msgs_message_labels_message_id_label_id_key ON
msgs_message_labels (message_id, label_id);
CREATE INDEX msgs_message_labels_4ccaa172 ON
msgs_message_labels (message_id);
CREATE INDEX msgs_message_labels_abec2aca ON
msgs_message_labels (label_id);
Users can search for messages, and they are returned page
by page in reverse chronological order. There are several
partial multi-column indexes on the message table, but the one
used for the example queries below is
CREATE INDEX msgs_inbox ON msgs_message(org_id,
created_on DESC)
WHERE is_active = TRUE AND is_handled = TRUE AND
is_archived = FALSE AND has_labels = TRUE;
SELECT "msgs_message".*
FROM "msgs_message"
WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" < '2016-06-10T07:11:
06.381000+00:00'::timestamptz
) ORDER BY "msgs_message"."created_on" DESC LIMIT 50
But users can also search for messages that have one or
more labels, leading to queries that look like:
SELECT DISTINCT "msgs_message".*
FROM "msgs_message"
INNER JOIN "msgs_message_labels" ON ( "msgs_message"."id"
= "msgs_message_labels"."message_id" )
WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message_labels"."label_id" IN (127, 128,
135, 136, 137, 138, 140, 141, 143, 144)
AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" <
'2016-06-10T07:11:
06.381000+00:00'::timestamptz
) ORDER BY "msgs_message"."created_on" DESC LIMIT 50
And if you run it again, it'll be fast again. Am I correct
in interpreting that second explain as being slow because
msgs_message_pkey isn't cached? It looks like it read from
that index 3556 times, and each time took 18.559 (?) ms, and
that adds up to 65,996ms. The database server says it has lots
of free memory so is there something I should be doing to keep
that index in memory?
Generally speaking, is there a good strategy for optimising
queries like these which involve two tables?
- I tried moving the label references into an int array on
msgs_message, and then using btree_gin to create a
multi-column index involving the array column, but that
doesn't appear to be very useful for these ordered queries
because it's not an ordered index.
- I tried adding created_on to msgs_message_labels table
but I couldn't find a way of avoiding the in-memory sort.
- Have thought about dynamically creating partial indexes
for each label using an array column on msgs_message to
hold label ids, and index condition like WHERE label_ids
&& ARRAY[123] but not sure what other problems
I'll run into with hundreds of indexes on the same table.
Server is an Amazon RDS instance with default settings
and Postgres 9.3.10, with one other database in the
instance.
All advice very much appreciated, thanks
--