Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

mariusz <marius@xxxxxxx> · Tue, 27 Feb 2018 10:03:02 +0100

On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
>         On Fri, Feb 23, 2018 at 6:10 AM,
>         mariusz <marius@xxxxxxx> wrote: 
>         
>         
>         i guess, you can easily get max continuous range for each row
>         with
>         something like this:
>         
>         CREATE OR REPLACE FUNCTION
>                 append_daterange(d1 daterange, d2 daterange)
>                 RETURNS daterange
>                 LANGUAGE sql
>                 AS
>         $$
>                 SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
>         ELSE d2 END;
>         $$;
>         
>         CREATE AGGREGATE agg_daterange (daterange) (
>                 sfunc = append_daterange,
>                 stype = daterange
>         );
>         
>         SELECT dr,
>                lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
>                upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
>           FROM ...
>         
>         above example is simplified to selecting only daterange column
>         "dr" for
>         readability, which in your case should be something like
>         
>         daterange(staff_assign_date,staff_assign_date_end,'[)')
>         
>         please note that daterange would be normalized to [) format so
>         upper()
>         above returns exactly your max "staff_assign_date_end" for
>         each
>         continuous range when dateranges are created with '[)' format.
>         
>         the key point is ... ELSE d2 in append_daterange() which
>         starts with new
>         value each time that new value is discontinuous with agg's
>         state value
>         and order in which rows are processed (ASC for lower of
>         daterange, DESC
>         for upper of daterange).
>         
>         unfortunately this involves reading all rows for "client_id"
>         and
>         additional sorting for each window.
>         i recall reading that you already pay the price of reading all
>         rows for
>         client_id anyway, so the only question is the cost of two
>         additional
>         sorts (maybe extracting dateranges to subset on which to do
>         windows and
>         rejoining result of continuous ranges to original set would
>         help to
>         lower the cost).
>         
> 
> Thank you, and I wanted to follow up on this.  I couldn't quite get
> your example working as described, but I also ended up trying
> something very similar that got me very close but not quite there.
> Basically, I can see that it is correctly calculating the ranges (in
> the notices), but it is only returning the last range for each client.
> (Because I said PARTITION BY client_id).
> 
> 
sorry for late replay, i was offline from sat to mon inclusive.
i may have previously added some confusion, so i'll try to explain what
i had in my mind. see below what my suggested query is.

> So I'm not sure if I should be calling this differently, or if the
> function needs to work differently, or if this just isn't possible.
> Do I need to partition by something else, and if so what?  I don't see
> what I could specify that would indicate a new episode.
> 
definitely you want to partition by client_id if you are calculating
this for multiple client_ids, but what matters here for each client_id
is order of dateranges in over()

> 
> Also, it's not clear to me how an aggregate might define/return
> different values within a partition.  Although this must be possible,
> since functions like rank() and row_number() seem to do it.
> 
that is because function is defined like agg, but what happens is that
its current state value (return value of agg function) is returned for
every row within given window without grouping the resultset
> 
> Hoping there is something easy that can be tweaked here.  See below
> for copy/pastable test stuff.  It includes output from both functions.
> Both look to be returning the same results, which makes me wonder if
> my passing in a start date was a waste of time, though it seems to me
> it would be necessary.
> 
> Cheers,
> Ken
> 
> 
> BEGIN;
> CREATE TEMP TABLE sample_data (
> client_id INTEGER,
> start_date DATE,
> end_date DATE,
> episode INTEGER -- Just a label, for clarity
> );
> 
> INSERT INTO sample_data VALUES
> (1,'1990-01-01','1990-12-31',1),
> (1,'1991-01-01','1991-12-31',1),
> 
> (1,'2000-01-01','2000-12-31',2),
> (1,'2001-01-01','2001-12-31',2),
> (1,'2002-01-01','2002-12-31',2),
> (1,'2003-01-01','2003-12-31',2),
> (1,'2004-01-01','2004-12-31',2),
> (1,'2005-01-01','2005-12-31',2),
> (1,'2006-01-01','2006-12-31',2),
> 
> (1,'2014-01-01','2014-12-31',3),
> (1,'2015-01-01','2015-12-31',3),
> (1,'2017-06-30','2017-12-31',4),
> (1,'2018-01-01',NULL,4),
> 
> (2,'2014-02-01','2015-01-31',1),
> (2,'2015-02-01','2015-12-31',1),
> (2,'2017-09-30','2018-01-31',2),
> (2,'2018-02-01','2018-02-14',2)
> ;
> 
> CREATE OR REPLACE FUNCTION
>         append_daterange(d1 daterange, d2 daterange)
>         RETURNS daterange
>         LANGUAGE sql
>         AS
> $$
>         SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2
> END;
> $$;
> 
> CREATE AGGREGATE agg_daterange (daterange) (
>         sfunc = append_daterange,
>         stype = daterange
> );
> 

based on your sample data see the query:

select client_id, d,
       daterange(lower(test_agg_daterange(d) 
                          over (partition by client_id
                                order by d)),               
                 upper(test_agg_daterange(d)
                         over (partition by client_id
                               order by d desc)),
                 '[)'
                ) as cont_range
  from (select *, daterange(start_date,end_date,'[]') as d
        from sample_data
       ) q
order by 1, 2;

 client_id |            d            |       cont_range        
-----------+-------------------------+-------------------------
         1 | [1990-01-01,1991-01-01) | [1990-01-01,1992-01-01)
         1 | [1991-01-01,1992-01-01) | [1990-01-01,1992-01-01)
         1 | [2000-01-01,2001-01-01) | [2000-01-01,2007-01-01)
         1 | [2001-01-01,2002-01-01) | [2000-01-01,2007-01-01)
         1 | [2002-01-01,2003-01-01) | [2000-01-01,2007-01-01)
         1 | [2003-01-01,2004-01-01) | [2000-01-01,2007-01-01)
         1 | [2004-01-01,2005-01-01) | [2000-01-01,2007-01-01)
         1 | [2005-01-01,2006-01-01) | [2000-01-01,2007-01-01)
         1 | [2006-01-01,2007-01-01) | [2000-01-01,2007-01-01)
         1 | [2014-01-01,2015-01-01) | [2014-01-01,2016-01-01)
         1 | [2015-01-01,2016-01-01) | [2014-01-01,2016-01-01)
         1 | [2017-06-30,2018-01-01) | [2017-06-30,)
         1 | [2018-01-01,)           | [2017-06-30,)
         2 | [2014-02-01,2015-02-01) | [2014-02-01,2016-01-01)
         2 | [2015-02-01,2016-01-01) | [2014-02-01,2016-01-01)
         2 | [2017-09-30,2018-02-01) | [2017-09-30,2018-02-15)
         2 | [2018-02-01,2018-02-15) | [2017-09-30,2018-02-15)
(17 rows)

and time to explain confusion i added previously regarding daterange
input format (i'm working with something similar, but with dateranges in
db, not add hoc generated from start and end dates).
you need to generate daterange(start_date,end_date,'[]') for -|-
operator to work correctly, but output of daterange will be normalized
to '[)' format so it's up to you to extract end_date of continuous range
(if you prefer dates over dateranges) which gonna be something like
(upper(cont_range)-'1 day'::interval)::date

above example is for showing how it works, returns max continuous range
containing given row,
you probably want:
select distinct on (client_id,cont_range)
or something like that to extract continuous ranges

first window (within lower() func) extends its initial (earliest) range
to right for every next row continuous with current range (and jumps to
new start on discontinuity), thus over(order by d ASC)
second window (within upper() func) extends its initial (latest) range
to left for every next row continuous with current range (and jumps to
new end on discontinuity), thus over(order by d DESC)
partition by client_id within each over() is to not mix client_ids'
dateranges

i guess that is more or less what you wanted, at least as i understood
you originally, if not than sorry for even more confusion or
unnecessarily explaining simple and obvious things you might already
know

regards,
mariusz jadczak

> CREATE OR REPLACE FUNCTION range_continuous_merge( daterange,
> daterange, date ) RETURNS daterange AS $$
> 
> DECLARE
>     res daterange;
> 
> BEGIN
> res:=  CASE
>     WHEN $1 IS NULL AND NOT $2 @> $3 THEN NULL
>     WHEN $1 IS NULL AND $2 @> $3 THEN $2
>     WHEN ($1 && $2) OR ($1 -|- $2) THEN $1 + $2
>     WHEN NOT $1 @> $3 THEN $2
>     ELSE $1
> END;
> RAISE NOTICE 'Inputs: %,%,%.  Returning %',$1::text,$2::text,
> $3::text,res;
> RETURN res;
> END;
> $$ LANGUAGE plpgsql STABLE;
> 
> CREATE AGGREGATE range_continuous( daterange, date ) (
> 
>     sfunc =  range_continuous_merge,
>     stype = daterange
> --  initcond = '{0,0,0}'
> 
> );
> 
> SELECT
> client_id,episode,start_date,end_date,range_continuous(daterange(start_date,end_date,'[]'),start_date) OVER (PARTITION by client_id) FROM sample_data ;
> 
> SELECT
> client_id,episode,start_date,end_date,agg_daterange(daterange(start_date,end_date,'[]')) OVER (PARTITION by client_id) FROM sample_data ;
> 
> 
> 
> -- RANGE_CONTINUOUS_MERGE
> 
> NOTICE:  Inputs: <NULL>,[1990-01-01,1991-01-01),1990-01-01.  Returning
> [1990-01-01,1991-01-01)
> NOTICE:  Inputs:
> [1990-01-01,1991-01-01),[1991-01-01,1992-01-01),1991-01-01.  Returning
> [1990-01-01,1992-01-01)
> NOTICE:  Inputs:
> [1990-01-01,1992-01-01),[2000-01-01,2001-01-01),2000-01-01.  Returning
> [2000-01-01,2001-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2001-01-01),[2001-01-01,2002-01-01),2001-01-01.  Returning
> [2000-01-01,2002-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2002-01-01),[2002-01-01,2003-01-01),2002-01-01.  Returning
> [2000-01-01,2003-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2003-01-01),[2003-01-01,2004-01-01),2003-01-01.  Returning
> [2000-01-01,2004-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2004-01-01),[2004-01-01,2005-01-01),2004-01-01.  Returning
> [2000-01-01,2005-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2005-01-01),[2005-01-01,2006-01-01),2005-01-01.  Returning
> [2000-01-01,2006-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2006-01-01),[2006-01-01,2007-01-01),2006-01-01.  Returning
> [2000-01-01,2007-01-01)
> NOTICE:  Inputs:
> [2000-01-01,2007-01-01),[2014-01-01,2015-01-01),2014-01-01.  Returning
> [2014-01-01,2015-01-01)
> NOTICE:  Inputs:
> [2014-01-01,2015-01-01),[2015-01-01,2016-01-01),2015-01-01.  Returning
> [2014-01-01,2016-01-01)
> NOTICE:  Inputs:
> [2014-01-01,2016-01-01),[2017-06-30,2018-01-01),2017-06-30.  Returning
> [2017-06-30,2018-01-01)
> NOTICE:  Inputs: [2017-06-30,2018-01-01),[2018-01-01,),2018-01-01.
> Returning [2017-06-30,)
> NOTICE:  Inputs: <NULL>,[2014-02-01,2015-02-01),2014-02-01.  Returning
> [2014-02-01,2015-02-01)
> NOTICE:  Inputs:
> [2014-02-01,2015-02-01),[2015-02-01,2016-01-01),2015-02-01.  Returning
> [2014-02-01,2016-01-01)
> NOTICE:  Inputs:
> [2014-02-01,2016-01-01),[2017-09-30,2018-02-01),2017-09-30.  Returning
> [2017-09-30,2018-02-01)
> NOTICE:  Inputs:
> [2017-09-30,2018-02-01),[2018-02-01,2018-02-15),2018-02-01.  Returning
> [2017-09-30,2018-02-15)
> 
>  client_id | episode | start_date |  end_date  |    range_continuous
> -----------+---------+------------+------------+-------------------------
>          1 |       1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
>          1 |       1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
>          1 |       2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
>          1 |       2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
>          1 |       2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
>          1 |       2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
>          1 |       2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
>          1 |       2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
>          1 |       2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
>          1 |       3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
>          1 |       3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
>          1 |       4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
>          1 |       4 | 2018-01-01 |            | [2017-06-30,)
>          2 |       1 | 2014-02-01 | 2015-01-31 |
> [2017-09-30,2018-02-15)
>          2 |       1 | 2015-02-01 | 2015-12-31 |
> [2017-09-30,2018-02-15)
>          2 |       2 | 2017-09-30 | 2018-01-31 |
> [2017-09-30,2018-02-15)
>          2 |       2 | 2018-02-01 | 2018-02-14 |
> [2017-09-30,2018-02-15)
> 
> (17 rows)
> 
> 
> -- AGG_DATERANGE
> 
>  client_id | episode | start_date |  end_date  |      agg_daterange
>    
> -----------+---------+------------+------------+-------------------------
>          1 |       1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
>          1 |       1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
>          1 |       2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
>          1 |       2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
>          1 |       2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
>          1 |       2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
>          1 |       2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
>          1 |       2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
>          1 |       2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
>          1 |       3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
>          1 |       3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
>          1 |       4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
>          1 |       4 | 2018-01-01 |            | [2017-06-30,)
>          2 |       1 | 2014-02-01 | 2015-01-31 |
> [2017-09-30,2018-02-15)
>          2 |       1 | 2015-02-01 | 2015-12-31 |
> [2017-09-30,2018-02-15)
>          2 |       2 | 2017-09-30 | 2018-01-31 |
> [2017-09-30,2018-02-15)
>          2 |       2 | 2018-02-01 | 2018-02-14 |
> [2017-09-30,2018-02-15)
> (17 rows)
> 
> 
> 
> 
> 
> 
> -- 
> 
> AGENCY Software  
> A Free Software data system
> By and for non-profits
> http://agency-software.org/
> https://demo.agency-software.org/client
> 
> ken.tanzer@xxxxxxxxxxxxxxxxxxx
> (253) 245-3801
> 
> 
> Subscribe to the mailing list to
> learn more about AGENCY or
> follow the discussion.