Re: Perfomance of IN-clause with many elements and possible solutions

Jeff Janes <jeff.janes@xxxxxxxxx> · Tue, 1 Aug 2017 09:13:33 -0700

On Mon, Jul 31, 2017 at 12:29 PM, Dmitry Lazurkin <dilaz03@xxxxxxxxx> wrote:

    On 31.07.2017 19:42, Jeff Janes wrote:

        I think it is simply because no one has
          gotten around to implementing it that way.  When you can just
          write it as a values list instead, the incentive to make the
          regular in-list work better is not all that strong.

        Cheers,

        Jeff

    I see from explain that IN-clause uses just array with function ANY.
    I think for efficient implementation of this task I should implement
    new datatype "hashset". Am I wrong?

I think that HashSet is a Java-specific term.  It is just a hash table in which there is no data to store, just the key itself (and probably a cash of the hashcode of that key), correct?  PostgreSQL already has a template for in-memory hash tables, src/include/lib/simplehash.h (and also one for possibly-shared in-memory tables, src/backend/utils/hash/dynahash.c) , and you should be able to specialize it for the case there there is no data associated with the key.  I think the harder part would be to get the planner to use the hash table you implement.  You would also have to include code to fall back onto the array scanning for data types which do not have a hash method defined.

I think a more general solution would be to get the planner and executor to run the in-list query using the Hash Join, the same way it runs the in-VALUES one.

I was impressed at how well the JSON and hstore worked, you might want to look at how they do it.  It is must be using an internal hash table of some sort.  But those only support strings as keys, while the in-list has to support every data type, including user-defined-ones, so they have more opportunities for optimization.

Cheers,

Jeff