Re: BRIN index worse than sequential scan for large search set

Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> · Fri, 24 Feb 2023 18:51:00 +0100

Hello Justin,
Thanks for the quick response!

The table may be dense, but the tuples aren't.  You're asking to return
1/1000th of the tuples, across the entire table.  Suppose there are ~100
tuples per page, and you need to read about every 10th page.  It makes
sense that it's slow to read a large amount of data nonsequentially.

Ah, of course, you're right!
I forgot that the BRIN indexes store ranges that are not fully covered by the row values and that PostgreSQL has to double-check (bitmap heap scan) ...
Would you thus advise to only use BRIN indexes for columns who's values are (1) monotonically increasing but also (2) close to each other?

I guess that in my use case, something like a roaring bitmap would have been perfect but I do not believe that those exist in PostgreSQL by default.
Btrees work well performance wise but are simply too large (> 400MB for 20M rows) even when the index fill factor is 100% (+/- 380 MB) for my use case as I need to index around 6B rows partitioned in roughly 3K buckets which would result in ~120GB of  Btree indexes which seems a bit much for simple existence checks.

Because it's necessary to check if the tuple is visible to the current
transaction.  It might be from an uncommited/aborted transaction.

Interesting, that's something I did not consider indeed.
I'm guessing that this is a cost brought by MVCC that you can't get around no matter the isolation level?

Mickael

On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:

> Hello everyone,

> 

> I'm playing around with BRIN indexes so as to get a feel for the feature.

> During my tests, I was unable to make BRIN indexes perform better than a

> sequential scan for queries searching for large value sets (20K values in

> the example down below).

> And now let's query 20K random rows from our 20M total rows:

I didn't try your test, but I think *random* is the problem/explanation.

> By default, this query will not use the BRIN index and simply run a 1.5s

> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total

> 8.7s query time:

> https://explain.dalibo.com/plan/46c3191g8a6c1bc7

> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =

> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index

> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan

> (hitting 20 GB of data!):

> https://explain.dalibo.com/plan/7f73bg9172a8b226

That means the planner's cost model correctly preferred a seq scan.

> So I had the following two questions:

> 

>    1. Why is the BRIN index systematically worse than a sequential scan,

>    even when the table is x1000 larger than the search set, physically

>    pre-sorted, dense (fillfactor at 100%) and the search rows are themselves

>    sorted?

The table may be dense, but the tuples aren't.  You're asking to return

1/1000th of the tuples, across the entire table.  Suppose there are ~100

tuples per page, and you need to read about every 10th page.  It makes

sense that it's slow to read a large amount of data nonsequentially.

That's why random_page_cost is several times higher than seq_page_cost.

I would expect brin to win if the pages to be accessed were dense rather

than distributed across the whole table.

>    2. Since we only select the "idx" column, why does the BRIN index not

>    simply return the searched value if included in one of it's ranges?

>    Hitting the actual row data stored in the table seems to be unnessary no?

Because it's necessary to check if the tuple is visible to the current

transaction.  It might be from an uncommited/aborted transaction.

-- 

Justin

-- 
Mickael van der Beek
Web developer & Security analyst
mickael.van.der.beek@xxxxxxxxx