Search Postgresql Archives

Re: where in (select array)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Richard Huxton wrote:
Marcus Engene wrote:
Hi List,

I have the might_like table that contains products a user might like if
he likes the present one (item).

CREATE TABLE might_like
(
item                       INTEGER NOT NULL
,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child                      INTEGER NOT NULL
)
WITHOUT OIDS;

CREATE INDEX might_like_x1 ON might_like(item);

Since there are (will be) houndreds of thousands of items, and 20+ might
like items, i thought it would be nice to reduce the set to 1/20th by
using a vector.

CREATE TABLE might_like_vector
(
item                       INTEGER NOT NULL
,created_at                 TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
,child_arr                  INTEGER[]
)
WITHOUT OIDS;

You haven't reduced the set at all, you've just turned part of it
sideways. You might gain something on your search, but I'm guessing
you've not tested it.

Hmm - the attached script generates 100,000 items and 10 liked ones for
each (well for the first 99,990 it says you like the next 10 items).
They're all given different timestamps at day intervals which means
you'll end up with 6 or seven matches for you sample query.
Sorry, I was a bit unclear. This is run on a web server. The might like rows are generated if they are not available for an item when the corresponding page is generated. The one row per might-like-item is online since yesterday and has when this is written 1/2 M rows in it. The caching is primarily initiated by a google indexer agent.

This might-like generation is expensive so caching them in the db is a must and the used CPU of the web-server halfed when this caching had been put live and cached the most commonly used items.

When the might-like data is read from the database, the generated presentation html is put in memcached with a timeout of 1h (presently). Memcached here is probably way overkill, but using it in situations like this makes the site more scaleable to add webservers and postpones the problem of clustering pg.

So with memcached I care less about saving a few mS in select latency and more about postponing other approaching problems like having the dbdump manageble. Right now it's a 100MB gzipped dump, which is very manageable, so where it's possible I'd like to keep the data compact. I imagine it's cheaper disk & dump wise to do the array thing in this and some other similar usages, and therefore it would be nice to have a non-ugly usage pattern. Also, we're going to collect usage statistics to further optimize the behavior of the site, and I'm really worried about these millions of rows.

If this is a bad approach to the problem I'm very interested to hear this. Regardless of the best approach, a "proper" solution to the subquery in would be super appreciated too :)

Thanks for your answer!

Best regards,
Marcus

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux