On 2015-06-05 16:56:18 -0400, Tom Lane wrote: > Andres Freund <andres@xxxxxxxxxxx> writes: > > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas <robertmhaas@xxxxxxxxx> wrote: > >> I think we would be foolish to rush that part into the tree. We > >> probably got here in the first place by rushing the last round of > >> fixes too much; let's try not to double down on that mistake. > > > My problem with that approach is that I think the code has gotten significantly more complex in the least few weeks. I have very little trust that the interactions between vacuum, the deferred truncations in the checkpointer, the state management in shared memory and recovery are correct. There's just too many non-local subtleties here. > > > I don't know what the right thing to do here is. > > My gut feeling is that rushing to make a release date is the wrong thing. > > If we have confidence that we can ship something on Monday that is > materially more trustworthy than the current releases, then let's aim to > do that; but let's ship only patches we are confident in. We can do > another set of releases later that incorporate additional fixes. (As some > wise man once said, there's always another bug.) I've tortured hardware a fair bit with HEAD. So far it looks much better than 9.4.2+ et al. I've noticed a bunch of, to me at least, new issues: 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with autovacuum=off you can get into situations where emergency vacuums aren't started when necessary. This is particularly likely to happen if either very large multixacts are used, or if the server has been shut down while emergency autovacuum where happening. No corruption ensues, but it's not easy to get out of. 2) I've managed to corrupt a cluster when a standby performed restartpoints less frequently than the master performed checkpoints. Because truncations happen in the checkpointer it's not that hard to end up with entirely full multixact slrus. This is a problem on several fronts. We can IIUC end up truncating away the wrong data, and we can be in a bad state upon promotion. None of that is new. 3) It's really confusing that truncation (and thus the limits in shared memory) happens in checkpoints. If you hit a limit and manually do all the necessary vacuums you'll see a "good" limit in pg_database.datminmxid, but you'll still into the error. You manually have to force a checkpoint for the truncation to actually happen. That's particularly problematic because larger installations, where I presume wraparound issues are more likely, often have a large checkpoint_timeout setting. Since none of these are really new, I don't think they should prevent us from doing a back branch release. While I'm still not convinced we're better of with 9.4.4 than with 9.4.1, we're certainly better of than with 9.4.[23] et al. If we want to go ahead with the release I plan to do a bit more testing today and tomorrow. If not I'm first going to continue working on fixing the above. I've a "good" fix for 1). I'm not 100% sure I'll feel confident with pushing if we wrap today. I am wondering if we shouldn't at least apply the portion that unconditionally sends a signal in the ERROR case. That's still an improvement. One more thing: Our testing infrastructure sucks. Without writing C code it's basically impossible to test wraparounds and such. Even if not particularly useful for non-devs, I really think we should have functions for creating burning xids/multixacts in core. Or at least in some extension. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general