Re: Could postgres12 support millions of sequences? (like 10 million)

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Fri, 20 Mar 2020 10:26:15 -0700

On 3/20/20 9:59 AM, Adrian Klaver wrote:
On 3/19/20 10:31 PM, pabloa98 wrote:
I see.

Any suggestion? It should behave like a sequence in the sense that 
concurrent transitions will get different numbers from this 
alternative sequence like solution.

In our case, we will need to do a call nextval('some_seq') (or 
similar) from different processes no more than twice every minute.

It would be nice to have a sequence data type. Correct me if I am 
wrong, but It seems to me that a sequence data type would cost the 
same or less than the current sequence implementation.

The schema would be more clear too. We could have a table like:

CREATE TABLE pair(
group INT NOT NULL,
element INT NOT NULL,
seq SEQUENCE INCREMENT 1
               START 1
               CACHE 1
               MINVALUE 1
               MAXVALUE 99999999
         NOT NULL,
CONSTRAINT PRIMARY KEY (group, element)
);

And then:

INSERT INTO event(group, element, code)
VALUES (
          1,
          1,
          ( SELECT seq.nextval('code_1_1_seq') FROM pair p WHERE 
p.group=1 and p.code=1 )
);

Or perhaps storing all the sequences in the same table as rows will 
have the same behavior.

If code is just something to show the sequence of insertion for group, 
element combinations then maybe something like below:

CREATE TABLE event(
group_id INT NOT NULL, --changed as group is reserved word
element INT NOT NULL,
insert_ts timestamptz NOT NULL DEFAULT clock_timestamp(),
PRIMARY KEY(group_id, element, insert_ts)
);

insert into event(group_id, element) VALUES
(1, 1),
(1, 1),
(1, 1),
(2, 1),
(1, 1),
(1, 3),
(1, 1),
(1, 3),
(2, 1),
(2, 1);

select * from event ;
  group_id | element |           insert_ts
----------+---------+--------------------------------
         1 |       1 | 03/20/2020 09:51:12.675926 PDT
         1 |       1 | 03/20/2020 09:51:12.675985 PDT
         1 |       1 | 03/20/2020 09:51:12.675991 PDT
         2 |       1 | 03/20/2020 09:51:12.675996 PDT
         1 |       1 | 03/20/2020 09:51:12.676 PDT
         1 |       3 | 03/20/2020 09:51:12.676004 PDT
         1 |       1 | 03/20/2020 09:51:12.676008 PDT
         1 |       3 | 03/20/2020 09:51:12.676012 PDT
         2 |       1 | 03/20/2020 09:51:12.676016 PDT
         2 |       1 | 03/20/2020 09:51:12.67602 PDT
(10 rows)

select group_id, element, row_number() OVER (partition by (group_id, 
element) order by (group_id, element)) AS code from event;
  group_id | element | code
----------+---------+------
         1 |       1 |    1
         1 |       1 |    2
         1 |       1 |    3
         1 |       1 |    4
         1 |       1 |    5
         1 |       3 |    1
         1 |       3 |    2
         2 |       1 |    1
         2 |       1 |    2
         2 |       1 |    3
(10 rows)

Oops the above actually returned:

select group_id, element, row_number() OVER (partition by (group_id, 
element) order by (group_id, element)) AS code, insert_ts from event;
 group_id | element | code |           insert_ts
----------+---------+------+--------------------------------
        1 |       1 |    1 | 03/20/2020 09:51:12.675926 PDT
        1 |       1 |    2 | 03/20/2020 09:51:12.675985 PDT
        1 |       1 |    3 | 03/20/2020 09:51:12.675991 PDT
        1 |       1 |    4 | 03/20/2020 09:51:12.676008 PDT
        1 |       1 |    5 | 03/20/2020 09:51:12.676 PDT
        1 |       3 |    1 | 03/20/2020 09:51:12.676004 PDT
        1 |       3 |    2 | 03/20/2020 09:51:12.676012 PDT
        2 |       1 |    1 | 03/20/2020 09:51:12.67602 PDT
        2 |       1 |    2 | 03/20/2020 09:51:12.676016 PDT
        2 |       1 |    3 | 03/20/2020 09:51:12.675996 PDT
(10 rows)

Needs to be:

select group_id, element, row_number() OVER (partition by (group_id, 
element) order by (group_id, element, insert_ts)) AS code, insert_ts 
from event;
 group_id | element | code |           insert_ts
----------+---------+------+--------------------------------
        1 |       1 |    1 | 03/20/2020 09:51:12.675926 PDT
        1 |       1 |    2 | 03/20/2020 09:51:12.675985 PDT
        1 |       1 |    3 | 03/20/2020 09:51:12.675991 PDT
        1 |       1 |    4 | 03/20/2020 09:51:12.676 PDT
        1 |       1 |    5 | 03/20/2020 09:51:12.676008 PDT
        1 |       3 |    1 | 03/20/2020 09:51:12.676004 PDT
        1 |       3 |    2 | 03/20/2020 09:51:12.676012 PDT
        2 |       1 |    1 | 03/20/2020 09:51:12.675996 PDT
        2 |       1 |    2 | 03/20/2020 09:51:12.676016 PDT
        2 |       1 |    3 | 03/20/2020 09:51:12.67602 PDT
(10 rows)

Pablo

On Thu, Mar 19, 2020 at 7:56 PM Tom Lane <tgl@xxxxxxxxxxxxx 
<mailto:tgl@xxxxxxxxxxxxx>> wrote:

    Michael Lewis <mlewis@xxxxxxxxxxx <mailto:mlewis@xxxxxxxxxxx>> 
writes:
     > On Thu, Mar 19, 2020, 5:48 PM David G. Johnston
    <david.g.johnston@xxxxxxxxx <mailto:david.g.johnston@xxxxxxxxx>>
     > wrote:
     >> However, one other consideration with sequences: do you care that
     >> PostgreSQL will cache/pin (i.e., no release) every single
    sequence you
     >> touch for the lifetime of the session? (I do not think DISCARD
    matters here
     >> but I'm just guessing)

     > Would you expand on this point or is there someplace specific 
in the
     > documentation on this?

    I think what David is worried about is that a sequence object is a
    one-row table in PG's implementation.  Thus

    (1) each sequence requires a dozen or two rows in assorted system
    catalogs (not sure exactly how many offhand).

    (2) each sequence uses up 8KB on disk for its table file.

    (3) each sequence you actually access within a particular session
    results in creation of relcache and catcache entries in that
    session's local memory.  I'm not sure offhand how big those are
    either, but a few KB per sequence would be a reasonable guess.

    (4) each sequence competes for space in the shared-buffer arena,
    since its 8K block has to be swapped into there whenever you try
    to access/increment the sequence's value.

    This is all mighty inefficient of course, and there's been talk
    of trying to reduce the per-sequence overhead; but I don't know
    of anyone actively working on that.  As things stand, I think
    having millions of sequences would be quite painful performance-
    wise, especially if your workload were actively hitting a lot
    of them concurrently.  It would work, for some value of "work",
    but it wouldn't perform very well.

    Also, as multiple people mentioned already, this still wouldn't
    guarantee gap-free sequences of ID values.

                             regards, tom lane

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx