Are vacuum and analyze happening regularly on the live system?
Yes. There's a nightly cron job which vacuums those it thinks needs it (though it's INSERT-heavy), and ditto ANALYZE.
Specifically, I ran ANALYZE on the prod table just before running the query.
Also, the sampling rate on all tables in both prod and test is 60000 rows.
What's an example query that uses indexes on test and does not on live?
SELECT COUNT(*) FROM sep_info_report_extract;
On prod, there's a list of "Parallel Seq Scan on xxxx_partname" records in the EXPLAIN output, while the test system has a list of "Parallel Index Only Scan using ..._idx" records.
(Yes, this is a simple COUNT(*) but it's a simple reproducer, which demonstrates the kind of problems we're having on much more complex queries.)
Does the live system show poor estimates when executing 'explain analyze select...' and the test system show semi-accurate row estimates?
They're within 2% of each other.
50 million seems to be a fairly low row count to be partitioned. What version is this on?
As explained in the OP, each record has a (sometimes large) XML record; months can have up to 240GB. Besides, partitioning makes for easy archiving.
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.