Re: PG 9.3 materialized view VS Views, indexes, shared memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(sorry for top-posting, gmail does not help.)

Thanks to your advice Jim, I have done an other test :
No partial indexes, just a partial index on boolean columns does the job.  (I get same perfs as MV)
CREATE INDEX ..ON (BoolColumnX) WHERE BoolColumnX IS TRUE

Then VIEW = 
SELECT colA....colZ 
FROM huge_table
WHERE BoolColumnX IS TRUE

Then this only index is used 800times (for each bool col)  and saves place as it does'nt indexes NULL values, and does no replicate. subsets. Moreover the huge indexes are allways loaded in cache memory.



Nicolas PARIS

2015-03-06 21:26 GMT+01:00 Nicolas Paris <niparisco@xxxxxxxxx>:
NULL values do not take place if only one other column are null for that row.
Boolean takes 1 byte wheras smallint 2bytes.
Then the space problem is not anymore a problem with boolean columns 95% empty

One thing that is really great with postgresql is transaction for drop table cascade, that allow te restore all stuf index, views on a rollback if problem in loading appears.
I hope using one transaction to drop/load many table is not a performance issue ?

Nicolas PARIS

2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco@xxxxxxxxx>:
Thanks Jim,

My suggestion is to test using nothing but plain views and plain indexes on the base table
Actualy the way I choose subset rows for views is complicated in terms of query. Then using simple views without partial indexes is terrible in terms of performance (I have tested that).

You mean the materialized views, right?
​Well I have tested matviews, views without partial indexes, views with hashjoin on a key, ..
 
I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

​Since each subset is about 5% (this number is decreasing when number of views increase) of the fact table, most boolean rows are null. This means 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk load decrease csv size (true VS 1).

For the views that are too slow, look at what the expensive part of the view and materialize *only that*.
It would be great if I could, but all will be automatic, then It will be difficult to apply such rules that demands human analyse, and manual database modification, for one subset


Hope I have well anderstand you 
 

Nicolas PARIS

2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim.Nasby@xxxxxxxxxxxxxx>:
On 3/6/15 2:16 AM, Nicolas Paris wrote:
    If you have that many different views I doubt you want that many
    indexes anyway.

​It's a datawarehouse, then each view is used by many user for each query.
Those views must be subset of the huge material table. All indexes are
needed

Yes, but they don't have to be partial.

    ​Have you tried just hitting the base table and indexes directly,
    either through plain views or just direct SQL?

​ I have tried each. The performances are worst querying on a subset
(the views) than querying on whole huge table when using the huge indexes

You mean the materialized views, right? If so, that makes sense: Instead of having all your users hitting one common set of data (your fact table) you had them hitting a bunch of other data (the mat views). But you still had other stuff hitting the fact table. So now you were dealing with a lot more data than if you just stuck to the single fact table.

=> this is the solution I am implementing. (800 is not true, but in 10
years it maybe will be)

In 10 years we'll all be using quantum computers anyway... ;P

​ Actually, I have added a boolean column on the huge table for each
views​. This is the way each view is a subset of huge table (Create View
as Select  * FROM hugeTable WHERE columnX is true --etc 800 times). Then
I create 800partials indexes on that column(create index...WHERE columnX
is TRUE), for each view.
This works great as the query planer chooses the partials indexes when
querying the little subset of the terrific table (potential 20bilion rows)

This is better than material views for some reasons :
- saves places on hard drive (columnX is boolean +same indexes - data
for MatViews)
- saves time generating materialised views

But this isn't better than the mat views because of a bunch of booleans; it's better because it means less stain on the disk cache.

This is quite more complicated because in the project, the number of
view is increasing, and dynamic then :
- then adding new mat views is simple
- adding new views => adding new column on the huge table. It can take
long time to update boolean for each tuple. Then I need to truncate/bulk
load all data each time I add a new View. Other problem is dynamic
number column table was a bit tricky to implement in an ETL soft such
Talend, but the benefits are I hope great.

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

My suggestion is to test using nothing but plain views and plain indexes on the base table. I expect that some of those views will not perform adequately, but many (or most) of them will be fine. For the views that are too slow, look at what the expensive part of the view and materialize *only that*. I suspect you'll find that when you do that you'll discover that several views are slow because of the same thing, so if you materialize that one thing one time you can then use it to speed up several views.

Using that approach means you'll have a lot less data that you have to read.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux