Re: Out of Memory errors are frustrating as heck!

Gunther <raj@xxxxxxxx> · Fri, 23 Aug 2019 10:19:51 -0400



    OK, I went back through that old thread, and I noticed an early
      opinion by a certain Peter <pmc at citylink> who said that I
      should provision some swap space. Since I had plenty of disk and
      no other option I tried that. And it did some magic. Here this is
      a steady state now:
    top - 14:07:32 up 103 days,  9:57,  5 users,  load average: 1.33, 1.05, 0.54
Tasks: 329 total,   2 running, 117 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.0 us, 11.4 sy,  0.0 ni, 35.3 id, 22.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7910376 total,   120524 free,  2174940 used,  5614912 buff/cache
KiB Swap: 16777212 total, 16777212 free,        0 used.  3239724 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5068 postgres  20   0 4352496   4.0g   2.0g R  76.4 52.6   3:01.39 postgres: postgres integrator [local] INSERT
  435 root      20   0       0      0      0 S   4.0  0.0  10:52.38 [kswapd0]

    and the nice thing is, the backend server process appears to be
      bounded at 4GB, so there isn't really a "memory leak". And also,
      the swap space isn't really being used. This may have to do with
      these vm. sysctl settings, overcommit, etc.
    
      vm.overcommit_memory = 2 -- values are
        
          0 -- estimate free memory
          1 -- always assume there is enough memory
          2 -- no over-commit allocate only inside the following two
            parameters
        
      
      vm.overcommit_kbytes = 0 -- how many kB above swap can be
        over-committed, EITHER this OR
      vm.overcommit_ratio = 50 -- percent of main memory that can be
        committed over swap,
        
          with 0 swap, that percent can be committed
          i.e., this of 8 GB, 4 GB are reserved for buffer cache
          not a good idea probably
          at least we should allow 75% committed, i.e., 6 GB of 8
            GB, leaving
            
              2 GB of buffer cache
              2 GB of shared buffers
              4 GB of all other memory
            
          
    I have vm.overcommit_memory = 2, _kbytes = 0, _ratio = 50. So
      this means with _ratio = 50 I can commit 50% of memory, 4GB and
      this is exactly what the server process wants. So with little
      impact on the available buffer cache I am in a fairly good
      position now. The swap (that in my case I set at 2 x main memory =
      16G) serves as a buffer to smooth out this peak usage without ever
      actually paging.
    I suppose even without swap I could have set vm.overcommit_ratio
      = 75, and I notice now that I already commented this much (the
      above bullet points are my own notes.) 

    
    Anyway, for now, I am good. Thank you very much.
    regards,

      -Gunther

    
    On 8/23/2019 9:17, Gunther wrote:

    
      Hi all, I am connecting to a discussion back from April this
        year. My data has grown and now I am running into new out of
        memory situations. Meanwhile the world turned from 11.2 to 11.5
        which I just installed only to find the same out of memory
        error.
      Have any of the things discussed and proposed, especially this
        last one by Tomas Vondra, been applied to the 11 releases?
        Should I try these older patches from April? 

      
      regards,

        -Gunther

      
      For what it is worth, this is what I am getting:

      
      TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used
  TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used
    JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
    PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:
      ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used
        HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
          HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used
        HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used
          HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used
        TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used
          Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
        Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
        Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
        Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
        Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
        Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7360 free (0 chunks); 832 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used
        ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        1 more child contexts containing 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Relcache by OID: 16384 total in 2 blocks; 2472 free (2 chunks); 13912 used
  CacheMemoryContext: 1113488 total in 14 blocks; 16776 free (0 chunks); 1096712 used
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrcode_ndx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_sjrclass_ndx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_scopeiid_ndx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: docsubjh_dociid_ndx
    index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: role_telecom_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_addr_fkidx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: role_id_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role_name_fkidx
    index info: 4096 total in 3 blocks; 2064 free (2 chunks); 2032 used: entity_telecom_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx
    index info: 2048 total in 2 blocks; 696 free (1 chunks); 1352 used: entity_id_idx
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: entity_det_code_idx
    index info: 4096 total in 3 blocks; 2016 free (2 chunks); 2080 used: entity_code_nodash_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_pkey
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: connect_rule_pkey
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: role_context_idx
    index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_partitions
    index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_scoper_idx
    index info: 2048 total in 2 blocks; 640 free (2 chunks); 1408 used: role_player_idx
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: role__pkey
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_constraint_conrelid_contypid_conname_index
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_act_idx
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: participation_role_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: participation_pkey
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_relid_index
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: doc_ndx_internaiddoctype
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2618_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indrelid_index
    relation rules: 827392 total in 104 blocks; 2400 free (1 chunks); 824992 used: v_documentsubjecthistory
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: pg_db_role_setting_databaseid_rol_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_opclass_am_name_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_enum_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_class_relname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_pubname_index
    index info: 2048 total in 2 blocks; 592 free (3 chunks); 1456 used: pg_statistic_relid_att_inh_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_cast_source_target_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_transform_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_collation_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amop_fam_strat_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_index_indexrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_template_tmplname_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_ts_config_map_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_opclass_oid_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_foreign_data_wrapper_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_evtname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_statistic_ext_name_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_dict_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_event_trigger_oid_index
    index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used: pg_conversion_default_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_operator_oprname_l_r_n_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_trigger_tgrelid_tgname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_enum_typid_label_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_config_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_user_mapping_oid_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_opfamily_am_name_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_table_relid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_type_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_aggregate_fnoid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_constraint_oid_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_rewrite_rel_rulename_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_parser_prsname_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_config_cfgname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_parser_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_publication_rel_prrelid_prpubid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_operator_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_namespace_nspname_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_ts_template_oid_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_amop_opr_fam_index
    index info: 2048 total in 2 blocks; 672 free (3 chunks); 1376 used: pg_default_acl_role_nsp_obj_index
    index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used: pg_collation_name_enc_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_publication_rel_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_range_rngtypid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_ts_dict_dictname_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_type_typname_nsp_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_opfamily_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_statistic_ext_oid_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index
    index info: 2048 total in 2 blocks; 624 free (2 chunks); 1424 used: pg_proc_proname_args_nsp_index
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_partitioned_table_partrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_transform_type_lang_index
    index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used: pg_attribute_relid_attnum_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_proc_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_language_oid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_namespace_oid_index
    index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used: pg_amproc_fam_proc_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_foreign_server_name_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_attribute_relid_attnam_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_conversion_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_user_mapping_user_server_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_subscription_rel_srrelid_srsubid_index
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: pg_sequence_seqrelid_index
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_conversion_name_nsp_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_oid_index
    index info: 2048 total in 2 blocks; 728 free (1 chunks); 1320 used: pg_auth_members_member_role_index
    10 more child contexts containing 17408 total in 17 blocks; 6080 free (10 chunks); 11328 used
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 8192 total in 1 blocks; 6408 free (0 chunks); 1784 used
  LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 1345345736 bytes in 1209 blocks; 4529600 free (270 chunks); 1340816136 used


      On 4/28/2019 10:19, Tomas Vondra
        wrote:

      
      On Wed,
        Apr 24, 2019 at 02:36:33AM +0200, Tomas Vondra wrote: 

         
          ... 

          
          I still think the idea with an "overflow batch" is worth
          considering, 

          because it'd allow us to keep the memory usage within
          work_mem. And 

          after getting familiar with the hash join code again (haven't
          messed 

          with it since 9.5 or so) I think it should not be all that
          difficult. 

          I'll give it a try over the weekend if I get bored for a
          while. 

          
        OK, so I took a stab at this, and overall it seems to be
        workable. The 

        patches I have are nowhere near committable, but I think the
        approach 

        works fairly well - the memory is kept in check, and the
        performance is 

        comparable to the "ballancing" approach tested before. 

        
        To explain it a bit, the idea is that we can compute how many
        BufFile 

        structures we can keep in memory - we can't use more than
        work_mem/2 for 

        that, because then we'd mostly eliminate space for the actual
        data. For 

        example with 4MB, we know we can keep 128 batches - we need 128
        for 

        outer and inner side, so 256 in total, and 256*8kB = 2MB. 

        
        And then, we just increase the number of batches but instead of
        adding 

        the BufFile entries, we split batches into slices that we can
        keep in 

        memory (say, the 128 batches). And we keep BufFiles for the
        current one 

        and an "overflow file" for the other slices. After processing a
        slice, 

        we simply switch to the next one, and use the overflow file as a
        temp 

        file for the first batch - we redistribute it into the other
        batches in 

        the slice and another overflow file. 

        
        That's what the v3 patch (named 'single overflow file') does. I
        does 

        work, but unfortunately it significantly inflates the amount of
        data 

        written to temporary files. Assume we need e.g. 1024 batches,
        but only 

        128 fit into memory. That means we'll need 8 slices, and during
        the 

        first pass we'll handle 1/8 of the data and write 7/8 to the
        overflow 

        file.  Then after processing the slice and switching to the next
        one, we 

        repeat this dance - 1/8 gets processed, 6/8 written to another
        overflow 

        file. So essentially we "forward" about 

        
           7/8 + 6/8 + 5/8 + ... + 1/8 = 28/8 = 3.5 

        
        of data between slices, and we need to re-shuffle data in each
        slice, 

        which amounts to additional 1x data. That's pretty significant
        overhead, 

        as will be clear from the measurements I'll present shortly. 

        
        But luckily, there's a simple solution to this - instead of
        writing the 

        data into a single overflow file, we can create one overflow
        file for 

        each slice. That will leave us with the ~1x of additional writes
        when 

        distributing data into batches in the current slice, but it
        eliminates 

        the main source of write amplification - awalanche-like
        forwarding of 

        data between slices. 

        
        This relaxes the memory limit a bit again, because we can't
        really keep 

        the number of overflow files constrained by work_mem, but we
        should only 

        need few of them (much less than when adding one file per batch
        right 

        away). For example with 128 in-memory batches, this reduces the
        amount 

        of necessary memory 128x. 

        
        And this is what v4 (per-slice overflow file) does, pretty much.
        

        Two more comments, regarding memory accounting in previous
        patches. It 

        was a bit broken, because we actually need 2x the number of
        BufFiles. We 

        needed nbatch files for outer side and nbatch files for inner
        side, but 

        we only considered one of those - both when deciding when to
        increase 

        the number of batches / increase spaceAllowed, and when
        reporting the 

        memory usage. So with large number of batches the reported
        amount of 

        used memory was roughly 1/2 of the actual value :-/ 

        
        The memory accounting was a bit bogus for another reason -
        spaceUsed 

        simply tracks the amount of memory for hash table contents. But
        at the 

        end we were simply adding the current space for BufFile stuff,
        ignoring 

        the fact that that's likely much larger than when the spacePeak
        value 

        got stored. For example we might have kept early spaceUsed when
        it was 

        almost work_mem, and then added the final large BufFile
        allocation. 

        
        I've fixed both issues in the patches attached to this message.
        It does 

        not make a huge difference in practice, but it makes it easier
        to 

        compare values between patches. 

        
        Now, some test results - I've repeated the simple test with
        uniform data 

        set, which is pretty much ideal for hash joins (no unexlectedly
        large 

        batches that can't be split, etc.). I've done this with 1M, 5M,
        10M, 25M 

        and 50M rows in the large table (which gets picked for the
        "hash" side), 

        and measured how much memory gets used, how many batches, how
        long it 

        takes and how much data gets written to temp files. 

        
        See the hashjoin-test.sh script for more details. 

        
        So, here are the results with work_mem = 4MB (so the number of
        in-memory 

        batches for the last two entries is 128). The columns are: 

        
        * nbatch - the final number of batches 

        * memory - memory usage, as reported by explain analyze 

        * time - duration of the query (without explain analyze) in
        seconds 

        * size - size of the large table 

        * temp - amount of data written to temp files 

        * amplif - write amplification (temp / size) 

        
         1M rows 

 =================================================================== 

                         nbatch  memory   time  size (MB)  temp (MB) 
        amplif 

 ------------------------------------------------------------------- 

         master             256    7681    3.3        730        899   
        1.23 

         rebalance          256    7711    3.3        730        884   
        1.21 

         single file       1024    4161    7.2        730       3168   
        4.34 

         per-slice file    1024    4161    4.7        730       1653   
        2.26 

        
         5M rows 

 =================================================================== 

                         nbatch  memory   time  size (MB)  temp (MB) 
        amplif 

 ------------------------------------------------------------------- 

         master            2048   36353     22       3652       5276   
        1.44 

         rebalance          512   16515     18       3652       4169   
        1.14 

         single file       4096    4353    156       3652      53897  
        14.76 

         per-slice file    4096    4353     28       3652       8106   
        2.21 

        
         10M rows 

 =================================================================== 

                         nbatch  memory   time  size (MB)  temp (MB) 
        amplif 

 ------------------------------------------------------------------- 

         master            4096   69121     61       7303      10556   
        1.45 

         rebalance          512   24326     46       7303       7405   
        1.01 

         single file       8192    4636    762       7303     211234  
        28.92 

         per-slice file    8192    4636     65       7303      16278   
        2.23 

        
         25M rows 

 =================================================================== 

                         nbatch  memory   time  size (MB)  temp (MB) 
        amplif 

 ------------------------------------------------------------------- 

         master            8192  134657    190       7303      24279   
        1.33 

         rebalance         1024   36611    158       7303      20024   
        1.10 

         single file      16384    6011   4054       7303    1046174  
        57.32 

         per-slice file   16384    6011    207       7303      39073   
        2.14 

        
         50M rows 

 =================================================================== 

                         nbatch  memory   time  size (MB)  temp (MB) 
        amplif 

 ------------------------------------------------------------------- 

         master           16384  265729    531      36500      48519  
        1.33 

         rebalance         2048   53241    447      36500      48077  
        1.32 

         single file          -       -      -      36500         
        -      - 

         per-slice file   32768    8125    451      36500      78662  
        2.16 

        
        From those numbers it's pretty clear that per-slice overflow
        file does 

        by far the best job in enforcing work_mem and minimizing the
        amount of 

        data spilled to temp files. It does write a bit more data than
        both 

        master and the simple rebalancing, but that's the cost for
        enforcing 

        work_mem more strictly. It's generally a bit slower than those
        two 

        approaches, although on the largest scale it's actually a bit
        faster 

        than master. I think that's pretty acceptable, considering this
        is meant 

        to address extreme underestimates where we currently just eat
        memory. 

        
        The case with single overflow file performs rather poorly - I
        haven't 

        even collected data from the largest scale, but considering it
        spilled 

        1TB of temp files with a dataset half the size, that's not an
        issue. 

        (Note that this does not mean it needs 1TB of temp space, those
        writes 

        are spread over time and the files are created/closed as we go.
        The 

        system only has ~100GB of free disk space.) 

        
        Gunther, could you try the v2 and v4 patches on your data set?
        That 

        would be an interesting data point, I think. 

        
        regards