Re: n_distinct off by a factor of 1000

Ron <ronljohnsonjr@xxxxxxxxx> · Tue, 23 Jun 2020 07:51:23 -0500



    Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?

    
    On 6/23/20 7:42 AM, Klaudie Willis
      wrote:

    
      Friends,

      
      I run Postgresql 12.3, on Windows. I have just discovered a
        pretty significant problem with Postgresql and my data.  I have
        a large table, 500M rows, 50 columns. It is split in 3
        partitions by Year.  In addition to the primary key, one of the
        columns is indexed, and I do lookups on this.

      
      Select * from bigtable b where b.instrument_ref in
        (x,y,z,...)

      
      limit 1000

      
      It responded well with sub-second response, and it uses the
        index of the column.  However, when I changed it to:

      
      Select * from bigtable b where b.instrument_ref in (x,y,z,)

      
      limit 10000 -- (notice 10K now)

      
      The planner decided to do a full table scan on the entire
        500M row table! And that did not work very well.  First I had no
        clue as to why it did so, and when I disabled sequential scan
        the query immediately returned.  But I should not have to do so.

      
      I got my first hint of why this problem occurs when I looked
        at the statistics.  For the column in question, "instrument_ref"
        the statistics claimed it to be:

      
      The default_statistics_target=500, and analyze has been run.

      
      select * from pg_stats where attname like 'instr%_ref'; --
        Result: 40.000

      
      select count(distinct instrumentid_ref) from bigtable --
        Result: 33 385 922 (!!)

      
      That is an astonishing difference of almost a 1000X.  

      
      When the planner only thinks there are 40K different values,
        then it makes sense to switch to table scan in order to fill the
        limit=10.000.  But it is wrong, very wrong, an the query returns
        in 100s of seconds instead of a few.

      
      I have tried to increase the statistics target to 5000, and
        it helps, but it reduces the error to 100X.  Still crazy high.

      
      I understand that this is a known problem.  I have read
        previous posts about it, still I have never seen anyone reach
        such a high difference factor. 

      
      I have considered these fixes:

      
      - hardcode the statistics to a particular ratio of the total
        number of rows

      
      - randomize the rows more, so that it does not suffer from
        page clustering.  However, this has probably other implications
      

      Feel free to comment :)
      

        K
      
      
    -- 

      Angular momentum makes the world go 'round.