Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent <robjsargent@xxxxxxxxx> wrote:On 1/27/22 10:03, Merlin Moncure wrote: On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@xxxxxxxxx> wrote: with s as (select 'Hello World Hello World' as sentence) select phrase, array_upper(string_to_array((select sentence from s), phrase), 1) - 1 as occurrances from ( select array_to_string(x, ' ') as phrase from ( select distinct v[a:b] x from regexp_split_to_array((select sentence from s), ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b ) q ) q; Simplified to: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from regexp_split_to_array('Hello World Hello World', ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1; phrase │ occurances ─────────────────────────┼──────────── World Hello │ 1 Hello World Hello │ 1 Hello World │ 2 Hello World Hello World │ 1 World Hello World │ 1 merlin And since we're looking for repeated phrases maybe add having count(*) > 1thanks. also, testing on actual data, I noticed that a couple other things are mandatory, mainly doing a bit of massaging before tokenizing: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from ( select array_agg(t) v from ( select trim(replace(unnest(v), E'\n', '')) t from regexp_split_to_array(<sentence>, ' ') v ) q where length(t) > 1 ) q cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1 having count(*) > 1; We are definitely in N^2 space here, so look for things to start breaking down for sentences > 1000 words. merlin
(for better complexity) you may search about "Ukkonen suffix tree"Similar problem as yours : https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature