>> I am trying to understand how to use the full-text search parser for >> URLS and hostnames to filter results from a text field containing URLS >> based on domain, and also how to index text columns for fast >> lookup/matching based on domain. >> >> I have a PostgreSQL database containing documents and links downloaded >> by a web crawler, with the following tables: >> >> pages >> >> ---------- >> >> id: Integer (primary key) >> >> url: String (unique) >> >> title: String >> >> text: String >> >> html: String >> >> last_visit: DateTime >> >> word_pos: TSVECTOR >> >> >> >> links >> >> ---------- >> >> id Integer (primary key) >> >> source: String >> >> target: String >> >> link_text: String >> >> UNIQUE(source,target) >> >> >> >> crawls >> >> --------- >> >> id: Integer (primary key) >> >> query: String >> >> >> >> crawl_results >> >> ------------- >> >> id: Integer (primary key) >> >> score: Integer (constraint 0<=score<=1) >> >> crawl_id: Integer (foreign key, crawls.id) >> >> page_id: Integer (foreign key, pages.id) >> >> >> The `source` and `target` fields in the `links` table contain URLs. I am >> running the following query to extract scored links from the top-ranking >> search results, for pages that haven't been fetched yet: >> >> WITH top_results AS >> >> (SELECT page_id, score FROM crawl_results >> >> WHERE crawl_id=$1 >> >> ORDER BY score LIMIT 100) >> >> SELECT top_results.score, l.target >> >> FROM top_results >> >> JOIN pages p ON top_results.page_id=p.id >> >> JOIN links l on p.url=l.source >> >> WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url) >> >> >> However, *I would like to filter these results so that only one row is >> returned for a given domain (the one with the lowest score)*. So for >> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, >> 'http://www.foo.com/zor'), I only want the first because it has same >> domain `foo.com` and has the lower score. >> >> I was able to find documentation for the builtin full text search >> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>, >> which can parse URLS and extract the hostname. For instance, I can >> extract the hostname from a URL as follows: >> Hi, I have no real idea about solving the complete problem, and would probably try something with a temp table first. For extracting the hostname from a url you could use select regex_replace('https?://(.*=)/.*', '\\1', url) instead of the fulltext parser Best regards Wolfgang