On 2020-02-18 19:07:44 +0000, Tom Mercha wrote: > On 23/12/2016 13:41, Peter J. Holzer wrote: > > On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: > >> I'm new to PG and want to implement my domain-specific system based on PG. I > >> wish to arrange my data as several tables in database and translate my DSL into > >> SQL statements for query. Since one DSL statement may be mapped to several SQL > >> statements, it's better to push the DSL server as close to the PG server as > >> possible. I found PG's backgroud worker meet my needs. I can setup a background > >> worker bounded to PG server and listen to a port for network requests. > >> > >> But I encounter a problem that the Server Programing Interfaces are not THREAD > >> SAFE. There are some global variables defined like: SPI_processed, > >> SPI_tuptable, etc. This limit to my DSL server to work in single thread mode > >> which is quite inefficient. > > > > I had a similar requirement. I solved it by moving the application logic > > out of the stored procedures. All the stored procedure does is an RPC > > call (I use ØMQ for that) to a server process and send the result back > > to the client. The server process converts the request into multiple SQL > > queries which can be processed in parallel. > > > > The downside is of course that the communication overhead is much > > higher (A minimum of 4 network messages per request). That's not a > > problem in my case, but you mileage may vary. > > > > The advantages in my opinion are: > > > > * A standalone server process is easier to test and debug than a bunch > > of stored procedures. > > * I can easily scale out if necessary: Currently my database and server > > process run on the same machine, but I could distribute them over > > several machines with (almost) no change in logic. > > Sorry to revive such an old topic. I am facing a similar requirement > where I am running multiple queries concurrently. Like Qiu Xiafei, I am > looking at SPI, and dynamic background workers. In particular, I am > using SPI_execq(...) on each dynamic background workers I spawn. What I > am experiencing is that I am not seeing a speedup, and I am beginning to > wonder if I have done something wrong, if the overheads are too big, or > if there are some limitations I am not aware of. > > As I see that none of the comments here make much of a reference to > performance/speedup, would you be so kind as to tell me how satisfied > you were with performance? Any insights would be greatly appreciated. The speedup depends very much on how evenly you can divide up your workload between worker processes. Let's assume that we have a query which takes 10 seconds and 8 worker processes. If you can't split that query, there is no speedup. 1 worker will be busy for 10 seconds, the others will be idle. If you can split it into 20 queries which will run for 0.5 seconds each, we will be finished in 1.5 seconds (run 8 queries in the first 0.5 seconds, another 8 in the second, and then the last 4). If you can split the query, but the runtimes of the subqueries will be very different (e.g. they will run for 10, 5, 2, 1, 0.5 and 15 times 0.1 seconds), then the total run time will be close to the longest-running subquery - in this case 10 seconds. All the workers will start busy but at some point they run out of work and have to wait for that single slow subquery. So the speedup really depends a lot on how smartly you can divide up your workload. And this is unfortunately something which varies a lot: Firstly, our data is skewed. We have a lot more data about Austria than Afghanistan, so simply partitioning by country doesn't give us equal cost queries. Secondly, even with SSDs, access to permanent storage is still a lot slower than access to RAM, so your access times may change unpredictably depending on whether the data you access is hot or cold. Which brings me to another blessing and/or curse of this approach: Having a central server process outside of postgresql makes it easy to cache (partial) results. So very often we already have the data in our application cache and can generate a response in milliseconds. Which is good. However, when we don't have the data and have to ask the database, it is quite likely that it isn't in the cache of the database or even the OS either, so we have to hit the disks - which is slow. So in conclusion: We are seeing a substantial speedup. But it is definitely not linear in the number of worker processes, and the effect of caching is much greater than that of parallelization. And we pay for that by a large variation in query times (between milliseconds and tens of seconds). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@xxxxxx | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment:
signature.asc
Description: PGP signature