Re: ERROR: too many dynamic shared memory segments

Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> · Tue, 28 Nov 2017 11:48:54 +1300

On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa <jakub.glapa@xxxxxxxxx> wrote:
> As for the crash. I dug up the initial log and it looks like a segmentation
> fault...
>
> 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR:  too
> many dynamic shared memory segments

Hmm.  Well this error can only occur in dsm_create() called without
DSM_CREATE_NULL_IF_MAXSEGMENTS.  parallel.c calls it with that flag
and dsa.c doesn't (perhaps it should, not sure, but that'd just change
the error message), so that means this the error arose from dsa.c
trying to get more segments.  That would be when Parallel Bitmap Heap
Scan tried to allocate memory.

I hacked my copy of PostgreSQL so that it allows only 5 DSM slots and
managed to reproduce a segv crash by trying to run concurrent Parallel
Bitmap Heap Scans.  The stack looks like this:

  * frame #0: 0x00000001083ace29
postgres`alloc_object(area=0x0000000000000000, size_class=10) + 25 at
dsa.c:1433
    frame #1: 0x00000001083acd14
postgres`dsa_allocate_extended(area=0x0000000000000000, size=72,
flags=4) + 1076 at dsa.c:785
    frame #2: 0x0000000108059c33
postgres`tbm_prepare_shared_iterate(tbm=0x00007f9743027660) + 67 at
tidbitmap.c:780
    frame #3: 0x0000000108000d57
postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
nodeBitmapHeapscan.c:156
    frame #4: 0x0000000107fefc5b
postgres`ExecScanFetch(node=0x00007f9743019c88,
accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
459 at execScan.c:95
    frame #5: 0x0000000107fef983
postgres`ExecScan(node=0x00007f9743019c88,
accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
147 at execScan.c:162
    frame #6: 0x00000001080008d1
postgres`ExecBitmapHeapScan(pstate=0x00007f9743019c88) + 49 at
nodeBitmapHeapscan.c:735

(lldb) f 3
frame #3: 0x0000000108000d57
postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
nodeBitmapHeapscan.c:156
   153 * dsa_pointer of the iterator state which will be used by
   154 * multiple processes to iterate jointly.
   155 */
-> 156 pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
   157 #ifdef USE_PREFETCH
   158 if (node->prefetch_maximum > 0)
   159
(lldb) print tbm->dsa
(dsa_area *) $3 = 0x0000000000000000
(lldb) print node->ss.ps.state->es_query_dsa
(dsa_area *) $5 = 0x0000000000000000
(lldb) f 17
frame #17: 0x000000010800363b
postgres`ExecGather(pstate=0x00007f9743019320) + 635 at
nodeGather.c:220
   217 * Get next tuple, either from one of our workers, or by running the plan
   218 * ourselves.
   219 */
-> 220 slot = gather_getnext(node);
   221 if (TupIsNull(slot))
   222 return NULL;
   223
(lldb) print *node->pei
(ParallelExecutorInfo) $8 = {
  planstate = 0x00007f9743019640
  pcxt = 0x00007f97450001b8
  buffer_usage = 0x0000000108b7e218
  instrumentation = 0x0000000108b7da38
  area = 0x0000000000000000
  param_exec = 0
  finished = '\0'
  tqueue = 0x0000000000000000
  reader = 0x0000000000000000
}
(lldb) print *node->pei->pcxt
warning: could not load any Objective-C class information. This will
significantly reduce the quality of type information available.
(ParallelContext) $9 = {
  node = {
    prev = 0x000000010855fb60
    next = 0x000000010855fb60
  }
  subid = 1
  nworkers = 0
  nworkers_launched = 0
  library_name = 0x00007f9745000248 "postgres"
  function_name = 0x00007f9745000268 "ParallelQueryMain"
  error_context_stack = 0x0000000000000000
  estimator = (space_for_chunks = 180352, number_of_keys = 19)
  seg = 0x0000000000000000
  private_memory = 0x0000000108b53038
  toc = 0x0000000108b53038
  worker = 0x0000000000000000
}

I think there are two failure modes: one of your sessions showed the
"too many ..." error (that's good, ran out of slots and said so and
our error machinery worked as it should), and another crashed with a
segfault, because it tried to use a NULL "area" pointer (bad).  I
think this is a degenerate case where we completely failed to launch
parallel query, but we ran the parallel query plan anyway and this
code thinks that the DSA is available.  Oops.

-- 
Thomas Munro
http://www.enterprisedb.com