Re: Working with huge amount of data.

Erik Jones <erik@xxxxxxxxxx> · Mon, 11 Feb 2008 10:07:24 -0600

On Feb 11, 2008, at 9:37 AM, Mario Lopez wrote:

Hi guys :-), I am working on a personal project in which I am  
trying to make sense on a huge (at least for me) amount of data. I  
have approximately 150 million rows of unique words (they are not  
exactly words it is just for explaining the situation).

The table I am inserting this is a quite simple table, something  
like this:

CREATE TABLE "public"."names" (
"id" SERIAL,
"name" VARCHAR(255)
) WITHOUT OIDS;

It is a requirement that I can make searches on the varchar with  
queries that look the following way:

SELECT * FROM names WHERE name LIKE ‘keyword%’
Or
SELECT * FROM names WHERE name LIKE ‘%keyword%’

I optimized the first type of queries making partitions with every  
letter that a name can begin with:

CREATE TABLE "public"."names_a" (
CONSTRAINT "names_a_check" CHECK (("name")::text ~~ 'a%'::text)
) INHERITS ("public"."names")
WITHOUT OIDS;

The problem arises with the second type of queries, where there are  
no possible partitions and that the search keywords are not known,  
I have tried making indexes on the letter it ends with, or indexes  
that specify that it contains the letter specified but none of them  
work the planifier only make sequential scans over the table.

For the moment the quickest scan I have being able to make is using  
grep!!, surprisingly enough grep searches on an average of 20  
seconds a whole plain text file of 2 GB one name per line and  
PostgreSQL on the fist type of queries takes like 50 seconds while  
the second type of queries con take up to two minutes which is  
completely unacceptable for an online search engine that has to  
attend a user querying this information.

How does this big search engines let’s say Google make this up? I  
am amazed of the quickness on searching this amount of information  
in so little time. Any approach I could take? I am open minded so  
anything is acceptable not necessarily only PostgreSQL based  
solutions (although I would prefer it). By the way Textual Search  
in PostgreSQL is discarded because what I am looking at are not  
names that can be decomposed on lexems, let's say that this varchar  
is composed of random garbage.

Actually, a friend of mine actually did exactly what you've tried:   
grep.  He had a cron job that would update the txt file from the  
table's data every five minutes and then his app would shell out to  
run those kinds of queries.  Of course, with a setup like that your  
results can be a little out of date (the period between runs of the  
cron job) but, if you can deal with that, that's actually a pretty  
simple solution that doesn't take too much setup.

Erik Jones

DBA | Emma®
erik@xxxxxxxxxx
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly