Re: Extremely slow intarray index creation and inserts.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tom Lane wrote:
> Ron Mayer <rm_pg@xxxxxxxxxxxxxxxxxxxxxxx> writes:
>> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>> CREATE INDEX
>> Time: 2069836.856 ms
> 
>> Is that expected, or does it sound like a bug to take over
>> half an hour to index 70000 rows of mostly 5 and 6-element
>> integer arrays?
> 
> I poked at this example with oprofile.  It's entirely CPU-bound AFAICT,

Oleg pointed out to me (off-list I now see) that it's not totally
unexpected behavior and I should have been using gist__intbig_ops,
since the "big" refers to the cardinality of the entire set (which
was large, in my case) and not the length of the arrays.

Oleg Bartunov wrote:
OB:> it's not about short or long arrays, it's about small or big
OB:> cardinality of the whole set (the number of unique elements)

I'm re-reading the docs and still wasn't obvious to me.   A
potential docs patch is attached below.

> and the CPU utilization is approximately
> 
> 	55%	g_int_compress
> 	35%	memmove/memcpy (difficult to distinguish these)
> 	 1%	pg_qsort
> 	<1%	anything else
> 
> Probably need to look at reducing the number of calls to g_int_compress
> ... it must be getting called a whole lot more than once per new index
> entry, and I wonder why that should need to be.

Perhaps that's a separate issue, but we're working
fine with gist__intbig_ops for the time being.



Here's a proposed docs patch that makes this more obvious.

*** a/doc/src/sgml/intarray.sgml
--- b/doc/src/sgml/intarray.sgml
***************
*** 239,245 ****
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing large arrays.
    </para>

    <para>
--- 239,247 ----
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing high-cardinality data sets - where there
!    are a large number of unique elements across all rows being
!    indexed.
    </para>

    <para>


-
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux