Re: Database design - best practice

Willem Leenen <willem_leenen@xxxxxxxxxxx> · Wed, 28 Nov 2012 13:27:18 +0000

 In relation to question 2, I read somewhere in the documentation that 
because of MVCC, the whole row has to be rewritten even though I just 
update one single column in that row. Hence if the table is wider (has 
more columns), the update will be slower. Does this match your 
understanding?

No. I don't count the number of rows, but number of blocks (pages) that are modified, which are 8K each. 

My advice would be to first establish a solutiondirection via diagnosing the problem. My experience is that most solutions are not obscure at all.

Subject: Re:  Database design - best practice
From: nielskristian@xxxxxxxxxxxxx
Date: Wed, 28 Nov 2012 14:20:27 +0100
CC: pgsql-performance@xxxxxxxxxxxxxx
To: willem_leenen@xxxxxxxxxxx

Thanks for the advice.
Currently I see a lot of I/O related to update/inserts, so I'm trying to track down these guys at first. In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding?

Den 28/11/2012 kl. 14.10 skrev Willem Leenen <willem_leenen@xxxxxxxxxxx>:

Niels,

" I can't see why it would make sense to put that into a separate table and join in the values " 
You don't normalize for performance. People DEnormalize for performance.

Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column. 

2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during hash joins. 

4) People DEnormalize for performance. 

5) " Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?" 

I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference.... ). 
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of  testing for learning.

My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right.

Regards,
Willem

> From: nielskristian@xxxxxxxxxxxxx
> Subject:  Database design - best practice
> Date: Wed, 28 Nov 2012 13:41:14 +0100
> To: pgsql-performance@xxxxxxxxxxxxxx
> 
> Hi,
> 
> I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background:
> 
> The setup:
> I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
> A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller.
> The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc.
> 
> So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-)
> The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.
> 
> The questions:
> Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following:
> 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans?
> 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
> 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the "width" of the table does that affect the performance when adding new columns?
> 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables?
> 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?
> 
> Hope there is some good answers out there :-)
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance