Yes, we do see statements being canceled from time to time, just not when we experience a lag situation with WALs not being applied. Is it just entirely possible that there is too much work for one of the slaves to do (hence the degraded query throughput we see) that it is unable to apply WALs? In other words, there are no conflicts that occur, but WALs cannot be applied an a timely manner? There's a noticeable pattern that in times of lag, the overall number of slow queries dramatically increases (i.e. from a few hundred to tens of thousands).
1. If SQL statements are getting cancelled time-to-time - then max_standby_*_delay parameters are working as expected and there is no issue with them.
2. You need to look at the lag issue, which can occur for number of other reasons.
How do you come to a conclusion that there is a lag between master and slave ? Did you check the last xlog position on master and slave databases ?
Lag occurs even when the load is very high. Do you see any high IO wait on Master or Slave ?
I did not understand when you said "from few hundred to tens of thousands" - These are number of WALs or SQLs you are referring to ?
Do you see any other messages related to lag in Standby logfile (like " WAL archive file not found.. cannot be restored..." etc)
Same is the situation across all the slaves ?
Venkata Balaji N
Sr. Database Administrator
Fujitsu Australia