Hi,
We have a system build on PG 8.1 that is constantly running and adding data to a bunch of tables, and can grow by over a Gig a day. We deliver this as a packaged solution on a server that runs unattended. So, we need to make sure we delete some of the data periodically (we have auto-vacuum running to minimize the amount of obsolete tuples). What I am working on is a solution to delete the appropriate amount of historical data at any time to keep the free disk space above a certain threshold. This will allow us to maximize the length of historical data we can keep without running out of disk space.
So, I have built a function that returns an approximation of the space taken up by the non-obsolete tuples, I call this used_size, and the physical disk space taken up by the data files - I call this disk_size.
The following sql query is what I use to return these values:
SELECT sum((bytes_per_row + row_overhead_bytes) * tuples) as used_size,
sum(pages)::float * 8192 as disk_size,
sum(pages) as total_pages
from
(
select c.relname,
sum(case when a.attlen = -1 then _var_col_bytes
else a.attlen
end) as bytes_per_row,
max(CASE WHEN c.relkind = 'i' THEN 4 ELSE 32 END) as row_overhead_bytes,
max(c.reltuples) as tuples,
max(c.relpages) as pages
from pg_class c,
pg_attribute a,
pg_namespace n
where c.oid = a.attrelid
and c.relnamespace = n.oid
and c.relkind in ('i','r')
and a.atttypid not in (26,27,28,29)
group by c.relname) by_table;
A few notes:
1) I have used 32 bytes for the row tuple header overhead and 4 bytes for index tuple overhead
2) The attribute types 26,27,28,29 are oid,tid,xid,cid - which, I believe are already counted in the row overhead
3) _var_col_bytes is an input parameter to the function which measures the assumed size of any variable length columns
I then use the function to actively manage the used_size (after statistics are computed via ANALYZE, of course). Through a process I track the daily growth of used_size and this tells me how many days of data I need to remove to stay within my limits. The disk_size is not actively managed, but just represents the "high-water-mark" of the used_size.
Questions:
1) I have found the 32 bytes overhead mentioned in a few places, but have not seen any specific reference to the byte overhead of an index header row. Does know the best number to use here for an assumption?
2) Are there any other holes in my logic/query?
3) Has anyone solved this entire problem in another fashion (e.g. auto-pruning - only delete what's necessary to stay within limits).
Any feedback is greatly appreciated,
Mark