I am trying to understand how to use the full-text search parser
for URLS and hostnames to filter results from a text field
containing URLS based on domain, and also how to index text
columns for fast lookup/matching based on domain. I have a PostgreSQL database containing documents and links
downloaded by a web crawler, with the following tables: pages ---------- id: Integer (primary key) url: String (unique) title: String text: String html: String last_visit: DateTime word_pos: TSVECTOR links ---------- id Integer (primary key) source: String target: String link_text: String UNIQUE(source,target) crawls --------- id: Integer (primary key) query: String crawl_results ------------- id: Integer (primary key) score: Integer (constraint 0<=score<=1) crawl_id: Integer (foreign key, crawls.id) page_id: Integer (foreign key, pages.id)
WITH top_results AS (SELECT page_id, score FROM crawl_results WHERE crawl_id=$1 ORDER BY score LIMIT 100) SELECT top_results.score, l.target FROM top_results JOIN pages p ON top_results.page_id=p.id JOIN links l on p.url="" WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6; token ------------- www.foo.com (1 row)
|