Search Postgresql Archives

Re: How to use full-text search URL parser to filter query results by domain name?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/8/19 4:50 AM, Arthur Zakirov wrote:
I think it is normal to use ts_parse(). And I suppose you might use windows functions.

For example, you have table links:

=# create table links (score int, link text);
=# insert into links values
  (1, 'http://www.foo.com/bar'),
  (2, 'http://www.foo.com/foo'),
  (2, 'http://www.bar.com/foo'),
  (1, 'http://www.bar.com/bar');

You can use the following query:

=# with l as (
  select score, token, link,
    rank() over (partition by token order by score) as rank
  from links,
    lateral ts_parse('default', link)
  where tokid = 6)
select score, token, link from l where rank = 1;
 score |    token    |          link
-------+-------------+------------------------
     1 | www.bar.com | http://www.bar.com/bar
     1 | www.foo.com | http://www.foo.com/bar


Thank you very much Arthur. Your suggestion led me to a query that is at least returning correct result set. I could not figure out how to get your rank() function to work with my query, but building on your answer (and others from IRC etc), I ended up with the following solution:

First I created the following views:
        CREATE VIEW scored_pages AS (
            SELECT crawl_results.crawl_id, crawl_results.score, 
                crawl_results.page_id, pages.url
            FROM crawl_results 
            JOIN pages ON crawl_results.page_id = pages.id 
        );


        CREATE VIEW scored_links AS (
            SELECT scored_pages.score, links.source, 
                links.target, links.link_text
            FROM links
            JOIN scored_pages ON scored_pages.url = ""
        );

Then, using these views, I did the following query to extract the links from the lowest scored pages in the results:

SELECT score, host, target 
FROM (
        SELECT DISTINCT ON (token) token AS host, score, target 
        FROM scored_links,
        LATERAL ts_parse('default', target) WHERE tokid = 6
        ORDER BY token, score
) as x
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE target=pp.url)
ORDER BY score;

Does this seem like a reasonable approach? When running EXPLAIN on this query, I get the following:

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1252927.46..1252927.47 rows=1 width=100)
   Sort Key: crawl_results.score
   ->  Hash Anti Join  (cost=1248297.18..1252927.45 rows=1 width=100)
         Hash Cond: ((links.target)::text = (pp.url)::text)
         ->  Unique  (cost=1247961.08..1252591.28 rows=5 width=100)
               ->  Sort  (cost=1247961.08..1250276.18 rows=926040 width=100)
                     Sort Key: ts_parse.token, crawl_results.score
                     ->  Gather  (cost=1449.79..1054897.20 rows=926040 width=100)
                           Workers Planned: 2
                           ->  Hash Join  (cost=449.79..961293.20 rows=385850 width=100)
                                 Hash Cond: ((links.source)::text = (pages.url)::text)
                                 ->  Nested Loop  (cost=0.00..955091.41 rows=378702 width=144)
                                       ->  Parallel Seq Scan on links  (cost=0.00..4554.40 rows=75740 width=112)
                                       ->  Function Scan on ts_parse  (cost=0.00..12.50 rows=5 width=32)
                                             Filter: (tokid = 6)
                                 ->  Hash  (cost=404.67..404.67 rows=3609 width=63)
                                       ->  Hash Join  (cost=336.10..404.67 rows=3609 width=63)
                                             Hash Cond: (crawl_results.page_id = pages.id)
                                             ->  Seq Scan on crawl_results  (cost=0.00..59.09 rows=3609 width=12)
                                             ->  Hash  (cost=291.60..291.60 rows=3560 width=59)
                                                   ->  Seq Scan on pages  (cost=0.00..291.60 rows=3560 width=59)
         ->  Hash  (cost=291.60..291.60 rows=3560 width=55)
               ->  Seq Scan on pages pp  (cost=0.00..291.60 rows=3560 width=55)
(23 rows)


I am wondering if there is a more efficient way to do things? Some people on IRC mentioned that it might be better to declare a scalar function to return the host from ts_parse instead of the LATERAL query ... but I couldn't figure out how to do that, or if it was even preferable to the above from a performance standpoint ... any ideas on how I could improve the above.



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux