Re: stored proc and inserting hundreds of thousands of rows

Jeff Janes <jeff.janes@xxxxxxxxx> · Sat, 30 Apr 2011 17:12:15 -0700

On Sat, Apr 30, 2011 at 3:29 PM, Joel Reymont <joelr1@xxxxxxxxx> wrote:
>
> On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:
>
>> But what exactly are you inserting?  The queries you reported below
>> are not the same as the ones you originally described.
>
> I posted the wrong query initially. The only difference is in the table that holds the probability array.
>
> I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has a primary key on the serial document id.

Having the (doc_id, ad_id) pair be missing from the table is a far
more efficient way to show that the ad is not linked to the document
(i.e. that it is below the threshold).  Provided that you are careful
that there are no other reasons that the pair could be missing--but if
you are not careful about that, then I don't see how storing the full
matrix will save you anyway.

>
>> In particular, they do not seem to use the "threshold" parameter that
>> the original ones did, whose job is presumably to cut the 2 million
>> down to a much smaller number that meet the threshold.  But how much
>> smaller is that number?
>
> The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply the threshold to cut down the number of rows.
>
>> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
>> anything.  Are you sure the parameter shouldn't be <=0.50 instead?
>
> No, ignore the threshold for now.

OK, but it seems to me that you are starting out by ruling out the one
optimization that is most likely to work.

>> Also, you previously said you didn't mind of this process took a
>> couple minutes, as long as it didn't interfere with other things going
>> on in the database.  So you probably need to describe what those other
>> things going on in the database are.
>
> Those other things are ad serving which boils down to a lookup of ad ids linked to the document.
>
> This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of <doc,ad> mappings are being inserted into the same table.

What numbers do you get for lookups per second when inserts are also
going on, versus when they are not going on?

The way I would approach this is by making two independent tasks, one
that insert records at your anticipated rate "insert into foo select
generate_series  from generate_series(1,100000);" in a loop, and
another than generates select load against a separate table (like
pgbench -S) and see how the two interact with each other by competing
for CPU and IO.

You could throttle the insert process by adding pg_sleep(<some
fraction of a second>) as a column in one of your selects, so it
pauses at every row.  But due to granularity of pg_sleep, you might
have to put it in a CASE expression so it is invoked on only a random
subset of the rows rather than each row.    But once throttled, will
it be able to keep up with the flow of new docs and ads?

>
> I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability vectors (one per doc) to calculate the distance between current document and existing ads.

gist indices are designed to make this type of thing fast, by using
techniques to rule out most of those comparisons without actually
performing them.  I don't know enough about the
guts of either your distance function or the gist indexes to know if
you can do it this way, but if you can it would certainly be the way
to go.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance