Search Postgresql Archives

Re: Select ranges based on sequential breaks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Joel,

Window functions appear to be the best solution for this style of problem, and I'm looking forward to their applications. However, I'm sticking with 8.3 for at least a year, so I'm not able to explore this solution yet. For now, I can only post-process the output in a non-SQL environment. I also need to do other fun stuff, like cumulative sums, which is also challenging with SQL, but much easier and intuitive with R.

An excellent book that I recently stumbled on is /Joe Celko's SQL for Smarties/ (recommended by someone on this list), which is a heavy read, but has an amazing depth to ANSI SQL. Of particular interest is "Chapter 24: Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this.

-Mike

Joel Nothman wrote:
Hi Mike,

I happened upon your query, which is related to some stuff I've been
playing with.

Firstly, David's solution below doesn't work. I haven't yet tried to work
out why.

Secondly, I was hoping to be able to solve your problem nicely with
Postgres 8.4's window functions [1,2], which can provide access to
data in sequentially-related rows.

Given the following setup:
CREATE TABLE foo (dt date, bin varchar(4));
INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
('2009-01-09', 'red'), ('2009-01-10', 'red');

I had hoped the following would suffice:
SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
w AS (ORDER BY dt PARTITION BY bin);

Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
which (PARTITION BY bin ORDER BY dt) would not give me.

I was able to produce the required result with:

SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
      FROM (SELECT dt, bin,
                   CASE WHEN newbin IS NULL
                   THEN 0
                   ELSE SUM(newbin) OVER (ORDER BY dt)
                   END AS binno
              FROM (SELECT *,
                           (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A=
S
newbin
                      FROM foo
                   ) AS newbins
              ) AS binnos
GROUP BY binno
ORDER BY first;

This relies on a middle step in which I create an enumeration of the bins
in sequence:
SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
(ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE=
R
BY dt))::int AS newbin FROM foo) AS newbins;
         dt     | bin  | binno
------------+------+-------
     2009-01-01 | red  |     0
     2009-01-02 | red  |     0
     2009-01-03 | blue |     1
     2009-01-04 | blue |     1
     2009-01-05 | blue |     1
     2009-01-06 | red  |     2
     2009-01-07 | blue |     3
     2009-01-08 | blue |     3
     2009-01-09 | red  |     4
     2009-01-10 | red  |     4

I would hope there is a neater way to do this with window functions.

The best way to solve your problem may be with PL/SQL, which is also good
at dealing with sequences (it's not as complicated as it looks!):

CREATE TYPE bindates AS (first date, last date, bin varchar(5));
CREATE OR REPLACE FUNCTION bindates()
RETURNS SETOF bindates
AS $$
DECLARE
      row record;
      res bindates;
BEGIN
      FOR row IN SELECT * FROM foo ORDER BY dt
      LOOP
        IF res.bin IS NULL OR res.bin !=3D row.bin THEN
          IF res.bin IS NOT NULL THEN
            RETURN NEXT res;
          END IF;
          res.bin :=3D row.bin;
          res.first :=3D row.dt;
        END IF;
        res.last :=3D row.dt;
      END LOOP;
      IF res.bin IS NOT NULL THEN
        RETURN NEXT res;
      END IF;
END;
$$ LANGUAGE plpgsql;


Finally, I'll try to sort out David's solution. Once we correct his typo
(t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-03 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-09 |            | red

This includes correct output, but it fails on both edge cases. The
non-appearance of the first row is due to the WHERE clause on the main
SELECT statement:
WHERE (SELECT bin
           FROM foo t2
           WHERE t2.dt < t1.dt
           ORDER BY dt DESC LIMIT 1) <> t1.bin

If we drop this WHERE clause, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-01 | 2009-01-02 | red
     2009-01-02 | 2009-01-02 | red
     2009-01-03 | 2009-01-05 | blue
     2009-01-04 | 2009-01-05 | blue
     2009-01-05 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-08 | 2009-01-08 | blue
     2009-01-09 |            | red
     2009-01-10 |            | red

We can therefore get the result including the first row by selecting from
this table with 'GROUP BY last, bin'.

And we can hack in a value for those final NULLs as a special case. The
following statement works:

SELECT MIN(first),
           CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
END,
           bin
    FROM (
      SELECT dt AS first,
             (SELECT dt
              FROM foo t3
              WHERE t3.dt < (
                  SELECT dt
                  FROM foo t5
                  WHERE t5.dt > t1.dt
                    AND t5.bin <> t1.bin
                  ORDER BY dt ASC
                  LIMIT 1)
              ORDER BY dt DESC
              LIMIT 1) AS last,
              bin
      FROM foo t1
) t0
GROUP BY last, bin
ORDER BY last;


Finally, what's efficient? With 1,000,000 random rows, we get:
Enumeration: 13s
PL/SQL: 12s
Modified David: minutes.

[I used the following to create test data:
CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
import random
for i in xrange(n):
      m =3D random.randint(1,12)
      d =3D random.randint(1,28)
      b =3D random.choice(('red', 'blue'))
      yield '2009-%d-%d' % (m, d), b
$$ LANGUAGE plpythonu;
DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]

I hope that helps you in considering various approaches to the problem.

- Joel


[1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
[2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html


On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
<david.t.wilson@xxxxxxxxx> wrote:



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux