Re: How to use full-text search URL parser to filter query results by domain name?

hamann.w@xxxxxxxxxxx · 7 Apr 2019 06:42:06 -0000

>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >>         pages
>> >>         ----------
>> >>         id:          Integer (primary key)
>> >>         url:         String  (unique)
>> >>         title:       String
>> >>         text:        String
>> >>         html:        String
>> >>         last_visit:  DateTime
>> >>         word_pos:    TSVECTOR
>> >>         >> >>         links
>> >>         ----------
>> >>         id         Integer (primary key)
>> >>         source:    String
>> >>         target:    String  >> >>         link_text: String
>> >>         UNIQUE(source,target)
>> >>         >> >>         crawls
>> >>         ---------
>> >>         id:         Integer (primary key)
>> >>         query:      String
>> >>         >> >>         crawl_results
>> >>         -------------
>> >>         id:       Integer (primary key)
>> >>         score:    Integer (constraint 0<=score<=1)
>> >>         crawl_id: Integer (foreign key, crawls.id)
>> >>         page_id:  Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >>         WITH top_results AS >> >>             (SELECT page_id, score FROM crawl_results >> >>             WHERE crawl_id=$1 >> >>             ORDER BY score LIMIT 100)
>> >>         SELECT top_results.score, l.target
>> >>         FROM top_results >> >>             JOIN pages p ON top_results.page_id=p.id
>> >>             JOIN links l on p.url=l.source >> >>         WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>> 
Hi,

I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use

select regex_replace('https?://(.*=)/.*', '\\1', url)

instead of the fulltext parser

Best regards
Wolfgang