Re: How to get good performance for very large lists/sets?

Jeff Janes <jeff.janes@xxxxxxxxx> · Mon, 6 Oct 2014 12:17:36 -0700

On Mon, Oct 6, 2014 at 1:02 AM, Richard Frith-Macdonald <richard.frith-macdonald@xxxxxxxxxxxxxxxx> wrote:
I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.

I have a database which uses multiple  lists of items roughly like this:

CREATE TABLE List (

  ID SERIAL,

  Name VARCHAR ....

);

and a table containing individual entries in the lists:

CREATE TABLE ListEntry (

  ListID INT, /* Reference the List table */

  ItemID INT /* References an Item table */

) ;

CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an unpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer actions.

How is concurrency handled?  If one person adds something to a list at the same time as someone else removes something from the same list, is there any interaction or do they just pass through each other cleanly?  Are lists ever rewritten in their entirety, or just incrementally have items added and removed one at a time?

What is the distribution of the number of lists any given entry is in?

Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like find all the items in list A but not list B things can get very slow (particularly when both lists contain millions of common items).

One possibility would be to have an Entry table where each one has an array of lists which contain it, with a gin index.

Then "in A but not B" could use the gin index to visit each Item in A, and then could immediately see if it was also in B without needing to do additional index look ups.  However, maintaining the gin index could easily become a substantial bottleneck.

I think that server won't use index-only scans because, even in cases where a particular list has not had any recent changes, the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum.

Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be thousands (possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict with things like nightly backups.

Reasoning about this situation would be a lot easier if you gave example query, with the explain (analyze, buffers), and a description of what plan you want it to use instead.  It sounds like you might want a single index-only scan to supply both branches of a hash semi join simultaneously, which I don't think is implemented even if the visibility map is in good shape.

I don't think PostgreSQL will ever demote an index-only scan to a plain index scan just because it thinks the index-only part won't be useful enough.  But it might switch to an entirely different plan which is not eligible to use an index-only scan in the first place because, given the reduced usefulness of the index only scan, the other one looks cheaper.  

So what you could do is vacuum a quiescent test database (do it several times just for good measure as sometimes on some versions the first pass isn't enough to get the vm all set), and then run the query.  If you now get a index only scan, that means you might need to vacuum your production database more aggressively.  

Cheers,

Jeff