Re: Out of Memory errors are frustrating as heck!

Gunther <raj@xxxxxxxx> · Sat, 20 Apr 2019 20:33:46 -0400

    On 4/20/2019 16:01, Tomas Vondra wrote:

    For me,
      this did the trick:

       update pg_class set (relpages, reltuples) = (1000000, 1) 
      where relname = 'tmp_r';

       update pg_class set (relpages, reltuples) = (1, 1000000) 
      where relname = 'tmp_q';

    YES! For me too. My EXPLAIN ANALYZE actually succeeded. 

     Hash Right Join  (cost=11009552.27..11377073.28 rows=11 width=4271) (actual time=511580.110..1058354.140 rows=113478386 loops=1)
   Hash Cond: (((q.documentinternalid)::text = (r.documentinternalid)::text) AND ((q.actinternalid)::text = (r.actinternalid)::text))
   ->  Seq Scan on tmp_q q  (cost=0.00..210021.00 rows=21000000 width=3417) (actual time=1.148..1.387 rows=236 loops=1)
   ->  Hash  (cost=11009552.11..11009552.11 rows=11 width=928) (actual time=511577.002..511577.002 rows=113478127 loops=1)
         Buckets: 16384 (originally 1024)  Batches: 131072 (originally 1)  Memory Usage: 679961kB
         ->  Seq Scan on tmp_r r  (cost=0.00..11009552.11 rows=11 width=928) (actual time=4.077..344558.954 rows=113478127 loops=1)
 Planning Time: 0.725 ms
 Execution Time: 1064128.721 ms

    But it used a lot of resident memory, and now it seems like I
    actually have a leak! Because after the command returned as shown
    above, the memory is still allocated:

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7100 postgres  20   0 2164012   1.1g 251364 S   0.0 14.5  23:27.23 postgres: postgres integrator [local] idle

and let's do the memory map dump:

2019-04-20 22:09:52.522 UTC [7100] LOG:  duration: 1064132.171 ms  statement: explain analyze
        SELECT *
          FROM tmp_q q
          RIGHT OUTER JOIN tmp_r r
            USING(documentInternalId, actInternalId);
TopMemoryContext: 153312 total in 8 blocks; 48168 free (70 chunks); 105144 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1154080 total in 20 blocks; 151784 free (1 chunks); 1002296 used
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
    ...
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (3 chunks); 256 used
Grand total: 2082624 bytes in 240 blocks; 382760 free (175 chunks); 1699864 used

    strange, it shows no leak here. Now I run this test again, to see
      if the memory grows further in top? This time I also leave the
      DISTINCT step in the query. I am trying to hit the out of memory
      situation. Well, I clearly saw memory growing now:

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7100 postgres  20   0 2601900   1.5g 251976 R  97.7 19.9  36:32.23 postgres: postgres integrator [local] EXPLAIN

TopMemoryContext: 2250520 total in 9 blocks; 45384 free (56 chunks); 2205136 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used
  TopTransactionContext: 8192 total in 1 blocks; 7528 free (1 chunks); 664 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 65536 total in 4 blocks; 28664 free (9 chunks); 36872 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
    PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used
    PortalContext: 147456 total in 21 blocks; 10400 free (7 chunks); 137056 used:
      ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used
        HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used
          HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used
        TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used
          Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used
  CacheMemoryContext: 1154080 total in 20 blocks; 149992 free (1 chunks); 1004088 used
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_class_tblspc_relfilenode_index
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
...
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 4992 free (6 chunks); 3200 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 1207458200 bytes in 79595 blocks; 6639272 free (130033 chunks); 1200818928 used

but Executor state is only 489 MB, so it is in the area of slow but massive growth. 

      ExecutorState: 489605432 total in 57794 blocks; 5522192 free (129776 chunks); 484083240 used
        HashTableContext: 2162800 total in 6 blocks; 64848 free (35 chunks); 2097952 used
          HashBatchContext: 706576176 total in 21503 blocks; 7936 free (0 chunks); 706568240 used
        TupleSort main: 452880 total in 8 blocks; 125880 free (29 chunks); 327000 used
          Caller tuples: 4194304 total in 10 blocks; 452280 free (20 chunks); 3742024 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used

    now I see if I can run it to completion anyway, and if there will
    then be a new bottom of memory. Now the newly allocated memory seems
    to have been released, so we stick to the 1.1G baseline we started
    out with.

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7100 postgres  20   0 2164012   1.1g 251976 D   6.3 14.5  55:26.82 postgres: postgres integrator [local] EXPLAIN
    on the other hand, the sort step is not yet finished. 

    Also, I think while we might have focused in on a peculiar
      planning situation where a very unfortunate plan is chosen which
      stresses the memory situation, the real reason for the final out
      of memory situation has not yet been determined. Remember, I have
      seen 3 stages in my original query:

      Steady state, sort-merge join active high CPU% memory at or
        below 100 kB
      Slow massive growth from over 200 kB to 1.5 GB or 1.8 GB
      Explosive growth within a second to over 2.2 GB

    It might well be that my initial query would have succeeded just
      fine despite the unfortunate plan with the big memory consumption
      on the oddly planned hash join, were it not for that third phase
      of explosive growth. And we haven't been able to catch this,
      because it happens too quickly. 

    It seems to me that some better memory tracing would be
      necessary. Looking at src/backend/utils/memdebug.c, mentions
      Valgrind. But to me, Valgrind would be a huge toolbox to just look
      after one thing. I wonder if we could not make a much simpler
      memory leak debugger tool.  One that is fast,  yet doesn't provide
      too much output to overwhelm the log destination file system (and
      waste too much time). There are already Debug macros there which,
      if enabled, just create an absolutely crazy amount of
      undecipherable log file content, because ALL backend processes
      would spit out this blabber. So I already stopped that by adding a
      variable that must be set to 1 (using the debugger on exactly one
      process for exactly the time desired):

    int _alloc_info = 0;
#ifdef HAVE_ALLOCINFO
#define AllocFreeInfo(_cxt, _chunk) \
            if(_alloc_info) \
                        fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
                                (_cxt)->header.name, (_chunk), (_chunk)->size)
#define AllocAllocInfo(_cxt, _chunk) \
                if(_alloc_info) \
                        fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
                                (_cxt)->header.name, (_chunk), (_chunk)->size)
#else
#define AllocFreeInfo(_cxt, _chunk)
#define AllocAllocInfo(_cxt, _chunk)
#endif

    But now I am thinking that should be the hook to use a limited
      cache where we can cancel out AllocSetAlloc with their
      AllocSetFree calls that follow relatively rapidly, which
      apparently is the majority of the log chatter created. 
    The memory debugger would allocate a single fixed memory chunk
      like 8 or 16 kB as a cache per each memory context that is
      actually traced. In each we would record the memory allocation in
      the shortest possible way. With everything compressed. Instead of
      pointer, references to the memory we would store whatever memory
      chunk index, a very brief backtrace would be stored in a
      compressed form, by instruction pointer ($rip) of the caller and
      then variable length differences to the $rip of the caller next
      up. These could even be stored in an index of the 100 most common
      caller chains to compress this data some more, while minimizing
      the cost in searching. Now each allocated chunk would be kept in
      this table and when it fills up, the oldest allocated chunk
      removed and written to the log file. When freed before being
      evicted from the cache, the chunk gets silently removed. When a
      chunk is freed that is no longer in the cache, the free event is
      recorded in the log. That way only those chunks get written to the
      log files that have endured beyond the capacity of the cache.
      Hopefully that would be the chunks most likely involved in the
      memory leak. Once the log has been created, it can be loaded into
      PostgreSQL table itself, and analyzed to find the chunks that
      never get freed and from the compressed backtrace figure out where
      they have been allocated. 

      BTW, my explain analyze is still running. That Sort - Unique step
      is taking forever on this data.

      OK, now I will try the various patches that people sent.

    -Gunther