On Fri, Aug 31, 2012 at 1:22 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Jeff Janes <jeff.janes@xxxxxxxxx> writes: >> I wonder if should be trying to drop duplicates at all. I think that >> doing that made a lot more sense before payloads existed. > > Perhaps, but we have a lot of history to be backwards-compatible with. > >> The docs said that the system "can" drop duplicates, so making it no >> longer do so would be backwards compatible. > > Maybe compatible from a language-lawyerly point of view, but the > performance characteristics would be hugely different - and since this > complaint is entirely about performance, I don't think it's fair to > ignore that. We'd be screwing people who've depended on the historical > behavior to accommodate people who expect something that never worked > well before to start working well. > > The case that I'm specifically worried about is rules and triggers that > issue NOTIFY without worrying about generating lots of duplicates when > many rows are updated in one command. Would those ones generally have an empty payload? I would think they do, but I have not yet used NOTIFY in anger. > >> Maybe drop duplicates where the payload was the empty string, but keep >> them otherwise? > > Maybe, but that seems pretty weird/unpredictable. (In particular, if > you have a mixed workload with some of both types of notify, you lose > twice: some of the inserts will need to scan the list, so that cost > is still quadratic, but you still have a huge event list to dump into > the queue when the time comes.) But only empties would need to do searches, and you would only need to search a list of channels that have already seen an empty in that same sub-transaction (with an auxiliary data structure), so it would only be quadratic if you used a huge number of channels. Perhaps an adaptive heuristic could be used, where if you sent 1000 notices in this subtransaction and the resulting queue is more than, say, 900, we stop looking for any more duplicates because there don't seem to be many. And even then, still check the most recent one in the queue, just not the whole list. I think it would virtually take an act of malice to start out sending all distinct messages, then to switch to sending mostly replicates, but with the replicates interleaved. ... > > The other thing we'd need to find out is whether that's the only problem > for generating bazillions of notify events per transaction. It won't > help to hack AsyncExistsPendingNotify if dropping the events into the > queue is still too expensive. I am worried about the overall processing > cost here, consumers and producers both. AsyncExistsPendingNotify is head and shoulders above anything else, as long as we can assume someone can do something meaningful with the messages in constant and reasonable time. The time in milliseconds to send x notifies in one subtrans is about (valid only for large x): t = 1.9e-05 * x^2 So to send 10,000,000 would take about 22 days just on the sending side. If I knock out the AsyncExistsPendingNotify with the attached patch, I can send 10,000,000 notifies in 50.1 seconds, about 10 seconds on the notifier side pre-commit and the rest post-commit for the notifier and on the listen side. (According to "top", most of the time seems to be CPU time of psql serving as the listener. According to opreport, it is not.) I used this as the listener: perl -le 'print "listen foo;"; print "select pg_sleep(1), now();" foreach 1..10000000'| psql > foo And this as the notifier: select now(); select count(pg_notify('foo', gen::text)) from generate_series(1,10000000) as gen; Time was measured from when the sender started sending to when the listener finished writing out async messages and hit the next now(). Cheers, Jeff
Attachment:
notice_dup.patch
Description: Binary data
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance