Re: High update activity, PostgreSQL vs BigDBMS

Dave Cramer <pg@xxxxxxxxxxxxx> · Fri, 5 Jan 2007 23:05:42 -0500

On 5-Jan-07, at 9:51 PM, Guy Rouillier wrote:

I've got back access to my test system.  I ran another test run  
with the same input data set.  This time I put pg_xlog on a  
different RAID volume (the unused one that I suspect is a software  
RAID), and I turned fsync=off in postgresql.conf.  I left the rest  
of the configuration alone (all foreign keys removed), etc.   
Unfortunately, this only dropped elapsed time down to about 28000  
seconds (from 30000), still significantly more than BigDBMS.   
Additional info inline below.

Shoaib Mir wrote:
Here are my few recommendations that might help you:
- You will need to do table partitioning (http:// 
www.postgresql.org/docs/current/static/ddl-partitioning.html  
<http://www.postgresql.org/docs/current/static/ddl- 
partitioning.html>) as you are storing quite a lot of data in one  
table per day.

I'm focusing on the detailed perspective for now.  The 144 files  
I'm processing represent not even two hours of data, so that surely  
wouldn't be split up.

- You are using a RAID5 setup which is something that can also  
affect performance so switching to RAID1 might help you there, but  
again you have a RAID5 with 12 disks so hmm that shouldn't be that  
much of a problem.

Agreed.

- Have you done the tuning for postgresql.conf parameters? if not  
then you really need to do this for like checkpoint segments,  
random page cost, shared buffers, cache size, fsm pages, vacuum  
cost delay, work_mem, bgwriter etc etc. You can get good advice  
for tuning these parameters at --> http://www.powerpostgresql.com/ 
PerfList/

The box has 3 GB of memory.  I would think that BigDBMS would be  
hurt by this more than PG.  Here are the settings I've modified in  
postgresql.conf:

As I said you need to set shared_buffers to at least 750MB this is  
the starting point, it can actually go higher. Additionally effective  
cache should be set to 2.25 G turning fsync is not a real world  
situation. Additional tuning of file systems can provide some gain,  
however as Craig pointed out some queries may need to be tweaked.

autovacuum=on
stats_row_level = on
max_connections = 10
listen_addresses = 'db01,localhost'
shared_buffers = 128MB
work_mem = 16MB
maintenance_work_mem = 64MB
temp_buffers = 32MB
max_fsm_pages = 204800
checkpoint_segments = 30
redirect_stderr = on
log_line_prefix = '%t %d'

- For autovacuuming you need to properly tune the thresholds so  
that the vacuum and analyze is done at the right time not  
affecting the database server performance. (You can find help for  
this at http://www.postgresql.org/docs/current/static/routine- 
vacuuming.html under "22.1.4. The auto-vacuum daemon")

The real-life load on this database would be fairly constant  
throughout the day.  Stats from network devices are received every  
15 minutes from each device, but they are staggered.  As a result,  
the database is almost constantly being updated, so there is no  
dead time to do vacuums.

- You will need to separate your transactional logs i.e. pg_xlog  
folder to a different drive other then your database server drive.  
This can be done by creating symlinks for pg_xlog folder.

Done, see opening remarks.  Unfortunately minor impact.

- I hope you are doing proper connection pool management, because  
good use of database connections can be really effect the overall  
performance, connections can be expensive to create, and consume  
memory if they are not properly exited.

I probably should have mentioned this originally but was afraid of  
information overload.  The application runs on JBoss and uses JBoss  
connection pools.  So connections are pooled, but I don't know how  
they would compare to native PG connection pools.  Essentially,  
JBoss gets native JDBC connections, and the pools simply allow them  
to be re-used without opening and closing each time.  So if the  
native PG connection pools provide any pooling optimizations beyond  
that, those advantages are not being realized.

the PG Connection pools will not help, they do not currently provide  
any extra optimization.

Dave

Hope that helps your tests...

Thanks to everyone for providing suggestions, and I apologize for  
my delay in responding to each of them.

----------------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com <http://www.enterprisedb.com>)
On 12/28/06, *Guy Rouillier* <guyr-ml1@xxxxxxxxxxxxx <mailto:guyr- 
ml1@xxxxxxxxxxxxx>> wrote:
    I don't want to violate any license agreement by discussing
    performance,
    so I'll refer to a large, commercial PostgreSQL-compatible  
DBMS only as
    BigDBMS here.
    I'm trying to convince my employer to replace BigDBMS with  
PostgreSQL
    for at least some of our Java applications.  As a proof of  
concept, I
    started with a high-volume (but conceptually simple) network data
    collection application.  This application collects files of 5- 
minute
    usage statistics from our network devices, and stores a raw  
form of
    these stats into one table and a normalized form into a second  
table.
    We are currently storing about 12 million rows a day in the  
normalized
    table, and each month we start new tables.  For the normalized  
data, the
    app inserts rows initialized to zero for the entire current  
day first
    thing in the morning, then throughout the day as stats are  
received,
    executes updates against existing rows.  So the app has very  
high update
    activity.
    In my test environment, I have a dual-x86 Linux platform  
running the
    application, and an old 4-CPU Sun Enterprise 4500 running  
BigDBMS and
    PostgreSQL 8.2.0 (only one at a time.)  The Sun box has 4 disk  
arrays
    attached, each with 12 SCSI hard disks (a D1000 and 3 A1000,  
for those
    familiar with these devices.)  The arrays are set up with  
RAID5.  So I'm
    working with a consistent hardware platform for this  
comparison.  I'm
    only processing a small subset of files (144.)
    BigDBMS processed this set of data in 20000 seconds, with all  
foreign
    keys in place.  With all foreign keys in place, PG took 54000  
seconds to
    complete the same job.  I've tried various approaches to  
autovacuum
    (none, 30-seconds) and it doesn't seem to make much  
difference.  What
    does seem to make a difference is eliminating all the foreign  
keys; in
    that configuration, PG takes about 30000 seconds.  Better, but  
BigDBMS
    still has it beat significantly.
    I've got PG configured so that that the system database is on  
disk
    array
    2, as are the transaction log files.  The default table space  
for the
    test database is disk array 3.  I've got all the reference  
tables (the
    tables to which the foreign keys in the stats tables refer) on  
this
    array.  I also store the stats tables on this array.  Finally,  
I put the
    indexes for the stats tables on disk array 4.  I don't use  
disk array 1
    because I believe it is a software array.
    I'm out of ideas how to improve this picture any further.  I'd
    appreciate some suggestions.  Thanks.
    --
    Guy Rouillier
    ---------------------------(end of  
broadcast)---------------------------
    TIP 5: don't forget to increase your free space map settings

--
Guy Rouillier

---------------------------(end of  
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings