Re: Schedule pg_repack job with pg_cron

Rui DeSousa <rui.desousa@xxxxxxxxxx> · Wed, 7 Aug 2024 16:52:25 -0400

On Aug 7, 2024, at 3:39 PM, Ron Johnson <ronljohnsonjr@xxxxxxxxx> wrote:

On Wed, Aug 7, 2024 at 3:29 PM Rui DeSousa <rui.desousa@xxxxxxxxxx> wrote:
On Aug 7, 2024, at 9:24 AM, jacob ndinkwa <jndinkwa@xxxxxxxxx> wrote:

To schedule a pg_repack job using pg_cron on an Amazon RDS for PostgreSQL instance, you need to follow a few steps. However, it’s important to note that pg_cron is only supported on certain versions of Amazon RDS, and pg_repack is also a separate extension that must be installed and enabled.

Is scheduling pg_repack just a bad idea and just introducing just more bloat? Why not just tune auto vacuum?

80/20 rule… most schemas are going to have their large/hot tables, etc and data has a natural life cycle.  If you have a heathy application then bloat is not an issue as free space is used by new tuples.  Each database has a data flow to it depending on the maturity and nature of the application/database. Exiting tuples make room for new tuples, etc.

If your have to vacuum full / pg_repack your tables on a scheduled bases then I think there is something very wrong with your application.

Pg_repack will do more harm in the long run.  i.e. the entire time pg_repack is running xmin is frozen thus creating more bloat everywhere else! 

Bloat is overrated; especially in a transaction system where all your data access patterns should be well defined and not doing full table scans. Just focus on identifying bloated indexes periodically and rebuilding those.  There should be no need to vacuum full tables under normal circumstances.

Part of a properly-maintained system is regularly archive/purging (whether that be dropping date-based partitions, or deleting old data from unpartitioned tables or tables partitioned by something other than a date).
For example, I gave a list of tables (all intertwined via FK constraints) to the application support people, and they returned the list stating how many weeks or months of data to retain in each table.  Every Saturday night a cron job goes through and deletes the old data from, and then "manually" vacuum-analyzes them.

No bloat...

-- 
Death to America, and butter sauce.Iraq lobster!

It really comes down to life cycle of the given data and I don’t think I would say regularly archive/purging data is a requirement.  I have worked on all these type of systems;

i.e.:

1. Regulatory applications where the transaction needed to be keep in the system for at least 7 to 10 years. 
2. Application where legally the data needed to be removed from the system within 3 months (licensed data); new data was entering the system just as fast as it was exiting daily (total of 2TB of data).
3. A highly transactional system where the data was purged within 30 days; however, it was steamed in realtime to a datamart where was keep forever and no purge cycle was defined.

Having a cron job to run vacuum-analyze manually is a security blanket.  It is not needed if auto vacuum is tuned; however, it doesn’t hurt and it is a good CYA plan.  

Needing to run Vacuum FULL on a regular bases is a symptom to a larger underling issue with the system and that’s my entire point.