I've been thinking hard on this one, and I tried to keep the details of the application requirements to a minimum in my question, to focus on the postgres performance aspect.
Though it may be off topic, let me try to describe the reason I'm trying not to delete rows. Originally, my idea was exactly what you've described: to move data to an archive table, but here is the problem:
The rows contain metadata for subgraphs (partitions) of a huge graph structure. Each partition has its local coordinate system for the nodes, and a parallel processing framework goes through partitions looking for patterns. When patterns are found, the matching coordinates need to be converted to global coordinates, and partition metadata allows that. However, partition metadata is build via calculations based on the previous partitions' metadata. If I delete a single row, or update it, it means that the graph structure has changed, and I have to update metadata for all partitions that follow the deleted/updated one, which means updating lots of rows in db.
If I leave the dead/disabled partition in place, and simply add a copy of it as if it is a new partition, then all is fine. All I have to do is not to use the dead/disabled partition during graph pattern search.
I am not happy about having enabled/disabled status leaking into other layers, and a view may be a good solution; thanks for that!. I am not happy about hiding dirt under the carpet rather than putting it in the rubbish bin, but the performance penalty is simply too big.
Best regards
Seref
On Thu, Aug 30, 2012 at 8:09 PM, David Johnston <polobo@xxxxxxxxx> wrote:
From: pgsql-general-owner@xxxxxxxxxxxxxx
[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Seref Arikan
Sent: Thursday, August 30, 2012 12:48 PM
To: PG-General Mailing List
Subject: Performance implications of adding a "disabled" column to
a table
============================================================================
Greetings,
I have a large number of rows (up to 3-4 millions) that I'll either be
fetching into ram (if it is a few thousand rows), or scrolling through a
cursor.
Deletions or updates on content of these rows lead to expensive operations
in my business logic, so I simply need to mark them as disabled.
I have two options here: to exclude the disabled rows from my query results
via ..where not disabled or to not to process disabled rows in business
logic.
There will be very few disabled rows compared to total rows I'll have to
return. This table will have very few queries, so including disabled = false
or enabled = true in every query would not be an issue
I have a feeling that including this criteria in the queries may add
unnecessary processing to my queries, since the table is expected to get up
to 100 million or more rows. So should I deal with this in DB, or at the
application layer?
Kind regards
Seref
=
So you are willing to live with the fact that whatever, apparently
important, business logic processing occurs when a row is updated or deleted
is in-fact not important since now you are considering a class of data that
is not useful but still present and reflected elsewhere in the system.
If the proportion of rows that are disabled is small, and you don't intend
to query "where disabled = TRUE", then adding the column is sufficient. No
index required since it will not be used. Whether you alter queries to add
the "disabled = FALSE" or read and deal with it in the application doesn't
really matter (unless individual records are large). Personally I would
suggest creating VIEWs that do not output "disabled" and that always use
"disabled = FALSE". Call it something like "tablename_active" and then use
that view thus only sending enabled records to the application layer.
I would also make sure I am very confident on the implications of doing this
versus deleting the data outright (though possibly moving it to an "archive"
table)
David J.