How to use full-text search URL parser to filter query results by domain name?

Jess Wren <jess.wren@xxxxxxxxxxxxxxx> · Sat, 6 Apr 2019 21:06:26 -0700



    I am trying to understand how to use the full-text search parser
      for URLS and hostnames to filter results from a text field
      containing URLS based on domain, and also how to index text
      columns for fast lookup/matching based on domain.

    
    I have a PostgreSQL database containing documents and links
      downloaded by a web crawler, with the following tables:

      
            pages
            ----------
            id:          Integer (primary key)
            url:         String  (unique)
            title:       String
            text:        String
            html:        String
            last_visit:  DateTime
            word_pos:    TSVECTOR
            
            links
            ----------
            id         Integer (primary key)
            source:    String
            target:    String  
            link_text: String
            UNIQUE(source,target)
            
            crawls
            ---------
            id:         Integer (primary key)
            query:      String
            
            crawl_results
            -------------
            id:       Integer (primary key)
            score:    Integer (constraint 0<=score<=1)
            crawl_id: Integer (foreign key, crawls.id)
            page_id:  Integer (foreign key, pages.id)
    

      The `source` and `target` fields in the `links` table contain
      URLs. I am running the following query to extract scored links
      from the top-ranking search results, for pages that haven't been
      fetched yet:

      
            WITH top_results AS 
                (SELECT page_id, score FROM crawl_results 
                WHERE crawl_id=$1 
                ORDER BY score LIMIT 100)
            SELECT top_results.score, l.target
            FROM top_results 
                JOIN pages p ON top_results.page_id=p.id
                JOIN links l on p.url="" 
            WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
    

      However, I would like to filter these results so that only one
        row is returned for a given domain (the one with the lowest
        score). So for instance, if I get (0.3,
      'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I
      only want the first because it has same domain `foo.com` and has
      the lower score. 

      
      I was able to find documentation
        for the builtin full text search parsers, which can parse
      URLS and extract the hostname. For instance, I can extract the
      hostname from a URL as follows:

      
            SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;
            
                token    
            -------------
            www.foo.com
            (1 row)
    

      However, I can't figure out how I would integrate this into the
      above query to filter out duplicate domains from the results. And
      because this is the docs for "testing
        and debugging text search", I don't know if this use of
      `ts_parse()` is even related to how the URL parser is intended to
      be used in practice.

      
      How would I use the "host" parser in my query above to return one
      row per domain? Also, how would I appropriately index the "links"
      table for "host" and "url" token lookup?

      
      Thanks!