On 6 January 2012 20:38, Samuel Gendler <sgendler@xxxxxxxxxxxxxxxx> wrote: > On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma@xxxxxxxxxxxxxx> > wrote: >> On 6 January 2012 20:02, Samuel Gendler <sgendler@xxxxxxxxxxxxxxxx> wrote: >> > Have you considered doing the insert by doing a bulk insert into a temp >> > table and then pulling rows that don't exist across to the final table >> > in >> > one query and updating rows that do exist in another query? I did a >> > very >> > brief scan of the SO thread and didn't see it suggested. Something like >> > this: >> > >> > update stats_5mn set count = count + t.count >> > from temp_table t >> > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and >> > stats_5mn.output_id = t.output_id; >> > >> > insert into stats_5mn >> > select * from temp_table t >> > where not exists ( >> > select 1 from stats_5mn s >> > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id >> > = >> > t.output_id >> > ); >> > >> > drop table temp_table; >> >> Am I right to assume that the update/insert needs to be placed into a >> begin / end transaction block if such batch uploads might happen >> concurrently? Doesn't seem to be the case for this question here, but >> I like the solution and wonder if it works under more general >> circumstances. > > > yes, assuming you are concerned about making the insertion atomic. > Obviously, a failure in the second query after success in the 1st query > would be problematic outside of a transaction, since any attempt to repeat > the entire operation would result in repeated updates. True, but I was more concerned about concurrency, where a second upsert inserts an element between update/insert from the first. That would then skip the element in the first upsert as it is neither updated (doesn't exist at that point in time) nor inserted (does exists at that later point). Or would that be impossible anyway? >> What's the overhead of creating and dropping a temporary table? Is it >> only worth doing this for a large number of inserted/updated elements? >> What if the number of inserts/updates is only a dozen at a time for a >> large table (>10M entries)? > > pretty minimal, but enough that doing a handful of rows at a time probably > wouldn't be worth it. You'd surely get index usage on a plain insert in > such a case, so I'd probably just use an upsert stored proc for doing small > numbers of rows - unless you are doing large numbers of inserts, just a few > at a time. In that case, I'd try to accumulate them and then do them in > bulk. Those are tough questions to answer without a specific context. My > real answer is 'try it and see.' You'll always get an answer that is > specific to your exact circumstance that way. It's a fairly tricky problem. I have a number of sensors producing energy data about every 5 minutes, but at random times between 1 and 15 minutes. I can't change that as that's the way the hardware of the sensors works. These feed into another unit, which accumulates them and forwards them in batches over the Internet to my PostgreSQL database server every few minutes (again at random times outside my control and with random batch sizes). To make things worse, if the Internet connection between the unit and the database server fails, it will send the latest data first to provide a quick update to the current values and then send the backlog of stored values. Thus, data do not always arrive in correct time order. At the moment I only look at the latest data for each sensor and these should be as close to real time as possible. Thus, collecting data for some time to get a larger size for a batch update isn't preferable. What I want to do, and this is where the upsert problem starts, is to build a table with energy values at fixed times. These should be calculated as a linear interpolation between the nearest reported values from the sensors. Please note each sensor is reporting a measured energy value (not instant power), which always increases monotonically with time. To compare the performance of the different devices that are measured, I need to have the energy values at the same time and not at the random times when the sensors report. This also allows the calculation of average power for the devices by taking the difference of the energy values over longer periods, like 30 minutes. What I simply haven't got my head around is how to do this in an efficient way. When new values arrive, the table of interpolated values needs to be updated. For some times, there will already be values in the table, but for other times there won't. Thus, the upsert. If there was a communication failure, the reported sensor times will go backwards as the most recent is transmitted first until the backlog is cleared. In that case the interpolation table will be populated with intermediate values from the first insert with the latest timestamp and then these values will be refined by the backlog data as they trickle in. Under normal circumstances, reported timestamps will be monotonically increasing and the interpolation table will simply extend to later times. There are more reads from the interpolation table than updates as there are many clients watching the data live via a COMET web frontend (or better will be once I get this working). I could try to code all of this in the application code (Tomcat servlets in my case), but I'd much rather like to find an elegant way to let the database server populated the interpolation table from the inserted sensor values. I can find the nearest relevant entries in the interpolation table to be upserted by using date_trunc() on the timestamp from the sensor value. But I then also need to find out the closest sensor value in the database with an earlier and possibly later timestamp around the fixed times in the interpolation table. Sometimes a new value will result in an update and sometimes not. Sometimes a new value needs to be added to the interpolation table and sometimes not. I know I'm pushing SQL a bit hard with this type of problem, but doing it in the application logic would result in quite a few round trips between the database server and the application code. It's sort of an intellectual challenge for me to see how much I can offload onto the database server. Thus, my interest in batch upserts. Another reason is that I don't want to hold any state or intermediate data in the application code. I want this in the database as it is much better in storing things persistently than my own code could ever be. It was designed to do that properly after all! > By the way, there is definitely a difference between creating a temp table > and creating a table temporarily. See the postgres docs about temp tables Yes, I'm aware of that and meant a temporary/temp table, but being old fashioned I prefer the long form, which is also valid syntax. >From the docs (v9.1.2): CREATE ... { TEMPORARY | TEMP } ... TABLE Thanks, Marc -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance