Re: Multimaster

Craig Ringer <craig@xxxxxxxxxxxxxxx> · Tue, 19 Apr 2016 20:56:28 +0800

On 18 April 2016 at 16:28, Konstantin Knizhnik <k.knizhnik@xxxxxxxxxxxxxx> wrote:

          I intend to make the same split in
            pglogical its self - a receiver and apply worker split.
            Though my intent is to have them communicate via a shared
            memory segment until/unless the apply worker gets too far
            behind and spills to disk.

    In case of multimaster  "too far behind" scenario can never happen.

I disagree. In the case of tightly coupled synchronous multi-master it can't happen, sure. But that's hardly the only case of multi-master out there.

I expect you'll want the ability to weaken synchronous guarantees for some commits anyway, like we have with physical replication's synchronous_commit = remote_write, synchronous_commit = local, etc. In that case lag becomes relevant again.

You might also want to be able to spool a big tx to temporary storage even as you apply it, if you're running over a WAN or something. That way if you crash during apply you don't have to transfer the data over the WAN again. Like we do with physical replication, where we write the WAL to disk then replay from disk.

I agree that spilling to disk isn't needed for the simplest cases of synchronous logical MM. But it's far from useless.

    It seems to me that pglogical plugin is now becoming too universal,
    trying to address a lot of different issues and play different
    roles.

I'm not convinced. They're all closely related, overlapping, and require much of the same functionality. While some use cases don't need certain pieces of functionality, they can still be _useful_. Asynchronous MM replication doesn't need table mapping and transforms, for example ... except that in reality lots of the flexibility offered by replication sets, table mapping, etc is actually really handy in MM too.

We may well want to move much of that into core and have much thinner plugins, but the direction Andres, Robert etc are talking about seems to be more along the lines of a fully in-core logical replication subsystem. It'll need to (eventually) meet all theses sorts of needs.

Before you start cutting or assuming you need something very separate I suggest taking a closer look at why each piece is there,  whether there's truly any significant performance impact, and whether it can be avoided without just cutting out the functionality entirely.

1. Asynchronous replication (including georeplication) - this is
    actually BDR.

Well, BDR is asynchronous MM. There's also the single-master case and related ones for non-overlapping multimaster where any given set of tables are only written on one node.

    2. Logical backup: transfer data to different database (including
    new version of Postgres)

I think that's more HA than logical backup. Needs to be able to be synchronous or asynchronous, much like our current phys.rep.

Closely related but not quite the same is logical read replicas/standbys.

    3. Change notification: there are many different subscribers which
    can be interested in receiving notifications about database changes.

Yep. I suspect we'll want a json output plugin for this, separate to pglogical etc, but we'll need to move a bunch of functionality from pglogical into core so it can be shared rather than duplicated.

    4. Synchronous replication: multimaster

"Synchronous multimaster". Not all multimastrer is synchronous, not all synchronous replication is multimaster. 

We are not enforcing order of commits as Galera does. Consistency is
    enforces by DTM, which enforce that transactions at all nodes are
    given consistent snapshots and assigned same CSNs. We have also
    global deadlock detection algorithm which build global lock graph
    (but still false positives are possible because  this graphs is
    build incrementally and so it doesn't correspond to some global
    snapshot).

OK, so you're relying on a GTM to determine safe, conflict-free apply orderings.

I'm ... curious ... about how you do that. Do you have a global lock manager too? How do you determine ordering for things that in a single-master case are addressed via unique b-tree indexes, not (just) heavyweight locking?

    Multimaster is just particular (and simplest) case of distributed
    transactions. Specific of multimaster is that the same transaction
    has to be applied at all nodes and that selects can be executed at
    any node.

The specification of your symmetric, synchronous tightly-coupled multimaster design, yes. Which sounds like it's intended to be transparent or near-transparent multi-master clustering.

                   The only exception is recovery of multimaster
                    node. In this case we have to apply transaction
                    exactly in the same order as them were applied at
                    the original node performing recovery. It is done by
                    applying changes in recovery mode by
                    pglogical_receiver itself.

            I'm not sure I understand what you area saying here.

    Sorry for unclearness.

    I just said that normally transactions are applied concurrently by
    multiple workers and DTM is used to enforce consistency.

    But in case of recovery (when some node is crashed and then
    reconnect to the cluster), we perform recovery of this node
    sequentially, by single worker. In this case DTM is not used
    (because other nodes are far ahead) and to restore the same state of
    node we need to apply changes exactly in the same order and at the
    source node. In this case case content of target (recovered) node
    should be the same as of source node.

OK, that makes perfect sense.

Presumably in this case you could save a local snapshot of the DTM's knowledge of the correct apply ordering of those tx's as you apply, so when you crash you can consult that saved ordering information to still parallelize apply. Later.

                  We are now replicating DDL in the way similar
                  with one used in BDR: DDL statements are inserted in
                  special table and are replayed at destination node as
                  part of transaction. 

                We have also alternative implementation done by Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx> 
                which is using
                  custom WAL records: https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
                Patch for custom WAL records was committed in 9.6,
                  so we are going to switch to this approach.

            How does that really improve anything over using a
              table?

    It is more straightforward approach, isn't it? You can either try to
    restore DDL from low level sequence of updates of system catalogue.

    But it is difficult and not always possible.

Understatement of the century ;) 

 Or need to add to
    somehow add original DDL statements to the log.

Actually you need to be able to add normalized statements to the xlog. The original DDL text isn't quite good enough due to issues with search_path among other things. Hence DDL deparse.

I agree, that custom WAL adds no performance or functionality
    advantages over using a table.

    This is why we still didn't switch to it. But IMHO approach with
    inserting DDL (or any other user-defined information) in special
    table looks like hack.

Yeah, it is a hack. Logical WAL messages do provide a cleaner way to do it, though with the minor downside that they're opaque to the user, who can't see what DDL is being done / due to be done anymore. I'd rather do it with generic logical WAL messages in future, now that they're in core. 

                  Also now pglogical plugin contains a lot of code
                    which performs mapping between source and target
                    database schemas. So it it is assumed that them may
                    be different.

                  But it is not true in case of multimaster and I
                    do not want to pay extra cost for the functionality
                    we do not need.

            All it's really doing is mapping upstream to downstream
              tables by name, since the oids will be different.

    Really? 

    Why then you send all table metadata (information about attributes)
    and handle invalidation messages?

Right, you meant columns, not tables.

See DESIGN.md.

We can't just use attno since column drops on one node will cause attno to differ even if the user-visible table schema is the same.

BDR solves this (now) by either initalizing nodes from a physical pg_basebackup of another node, including dropped cols etc, or using pg_dump's binary upgrade mode to preserve dropped columns when bringing a node up from a logical copy.

That's not viable for general purpose logical replication like pglogical, so we send a table attribute mapping.

I agree that this can be avoided if the system can guarantee that the upstream and downstream tables have exactly the same structure including dropped columns. Which it can only guarantee when it has DDL replication and all DDL is either replicated or blocked from being run. That's the approach BDR tries to take, and it works - with problems. One of the problems you won't have because it's caused by the need to sync up the otherwise asynchronous cluster so there are no outstanding committed-but-not-replayed changes for the old table structure on any node before we change the structure on all nodes. But others, with coverage of DDL replication, problems with full table rewrites, etc, you will have.

I think it would be reasonable for pglogical to offer the option of sending a minimal table metadata message that simply says that it expects the downstream to deal with the upstream attnos exactly as-is, either by having them exactly the same or managing its own translations. In this case column mapping etc can be omitted. Feel free to send a patch.

Multimater really  needs to map local or remote OIDs.  We do not
    need to provide any attribute mapping and handle catalog
    invalidations.

For synchronous tightly-coupled multi-master with a GTM and GLM that doesn't allow non-replicated DDL, yes, I agree.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services