Re: Very slow 101-feeling design/query..

Doug Reynolds <mav@xxxxxxxxxxxxx> · Sat, 11 Dec 2021 22:50:03 +0000

Keep in mind that using UUIDs as an ID on large data sets will use double the memory and index size.  Obviously, if that is set in stone, you probably can't readily change it, but I'd think twice before using that for billions of rows for joining.

Sent from my iPhone

On Dec 11, 2021, at 2:38 PM, Wells Oliver <wells.oliver@xxxxxxxxx> wrote:

Yeah, I can take a look at that. The other issue is simply doing SELECT * FROM vw_joints WHERE game_id = 123 is slow because joints -> plays on play_uuid, and the index is on game_id in plays, but that query is slower than I want it to be.

On Sat, Dec 11, 2021 at 11:22 AM Mladen Gogala <gogala.mladen@xxxxxxxxx> wrote:

    On 12/11/21 13:32, Wells Oliver wrote:

      We have hundreds of millions of joint locations
        across hundreds of thousands of plays across thousands of games.
        We also have a few hundred games with plays in them, but no
        joint measurements. The joint data is a measurement of body
        joints through time (up to 300x a second)

        So my idea was the plays table contains the game identifier
          (there is no root game table), and the joints table refers to
          the plays table by a play UUID.

        As an example: One game contains 231 plays. There
          are 396,144 rows of joint data.

        I partitioned the joints data by month because of the
          volume.

        The end result is using the joints view (which joins to
          plays) is just dead, grinding slow, and I'm trying to think of
          alternatives that stay within Postgres, i.e. maybe BigQuery
          and/or RedShift is more of an appropriate solution, but I
          don't want to go there quite yet.

        Just trying to optimize this design and open to ideas.
          Thanks.

        On Sat, Dec 11, 2021 at 7:58
          AM Michel SALAIS <msalais@xxxxxxx>
          wrote:

              Hi
              Using view which
                  does a join in two tables to access things in one of
                  them is not a good idea. You pay for a join even
                  though it is not necessary. So if you just want
                  distinct game_id and you don’t care if its play_uuid
                  is in joints, then you can accelerate using the table
                  plays directly.

              Michel
                    SALAIS

                De : Wells Oliver <wells.oliver@xxxxxxxxx>

                  Envoyé : vendredi 10 décembre 2021 23:53

                  À : pgsql-admin <pgsql-admin@xxxxxxxxxxxxxx>

                  Objet : Very slow 101-feeling design/query..

                This feels very 101 but I feel like
                  it should be much faster:

                  A table "joints" with a PK of
                    play_uuid, target_id, joint_seq, joint_timestamp.

                  "joints" is partitioned using
                    RANGE on joint_timestamp for monthly partitions 1/1
                    - 2/1, 2-1 - 3/1, etc.

                  "joints" has an FK where
                    play_uuid refers to table "plays" and
                    column "play_uuid" where "play_uuid" is the PK.

                  "plays" additionally has an
                    indexed column game_id.

                  "joints" has 1133,932,391 rows
                    across 12 monthly partitions for 2021, and "plays
                    has 585,627 rows. We made a view called "vw_joints"
                    which just does:

                  SELECT * FROM joints JOIN plays
                    USING (play_id);

                  Then doing:

                  SELECT DISTINCT game_id FROM
                    vw_joints

                  Takes 35-45 minutes. Which seems
                    nuts. We do this kind of design in a few different
                    plays to normalize things, but it comes at the cost
                    of these agonizingly slow (and seemingly dead
                    simple) qeuries.

                  Is there any optimization to do
                    here beyond flattening table and de-normalizing
                    data? Is the partitioning causing a slowness here? I
                    feel like partitioning is creating some
                    difficulty...

                -- 

    Well, you can create a trigger which would fire whenever row is
      inserted or deleted and would update joint, game_id and the count
      in a separate table. That is the usual solution for the problem
      you described. Also, if there is a small amount of games, you can
      use hash index instead of B-tree indexes.

    -- 
    Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

-- 
Wells Oliver
wells.oliver@xxxxxxxxx