On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent <robjsargent@xxxxxxxxx> wrote: > > On 1/27/22 10:03, Merlin Moncure wrote: > > On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@xxxxxxxxx> wrote: > > with s as (select 'Hello World Hello World' as sentence) > select > phrase, > array_upper(string_to_array((select sentence from s), phrase), 1) - > 1 as occurrances > from > ( > select array_to_string(x, ' ') as phrase > from > ( > select distinct v[a:b] x > from regexp_split_to_array((select sentence from s), ' ') v > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > ) q > ) q; > > Simplified to: > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences > from regexp_split_to_array('Hello World Hello World', ' ') v > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > group by 1; > > phrase │ occurances > ─────────────────────────┼──────────── > World Hello │ 1 > Hello World Hello │ 1 > Hello World │ 2 > Hello World Hello World │ 1 > World Hello World │ 1 > > merlin > > > And since we're looking for repeated phrases maybe add > > having count(*) > 1 thanks. also, testing on actual data, I noticed that a couple other things are mandatory, mainly doing a bit of massaging before tokenizing: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from ( select array_agg(t) v from ( select trim(replace(unnest(v), E'\n', '')) t from regexp_split_to_array(<sentence>, ' ') v ) q where length(t) > 1 ) q cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1 having count(*) > 1; We are definitely in N^2 space here, so look for things to start breaking down for sentences > 1000 words. merlin