Re: Lock table is out of available lock entries

Mark Reynolds <mreynolds@xxxxxxxxxx> · Wed, 11 Aug 2021 09:21:51 -0400



    On 8/11/21 9:09 AM, Pierre Rogier
      wrote:

    
      Hi,

        I suspect that Kees discovered a new bug ...

        
        As the retro changelog is persistent across reboot,
          it is normal that their timestamp are absolute, so IMHO there
          is no way relative time should be used/seen in trimming
          computation.

          (in other words cur_time is wrong. )

          My suspicion is that we got a regression when the slapi_eq
          timers where  rewritten to use relative time instead of
          absolute time) (cur_time is provided by slapi_eq API and is
          now relative ( it should rather be initialized to time(0) )

        
    Looking at the code we are mixing the use of the same monotonic time
    for the interval (correct) and the age of an entry (not correct).

    
        The funny thing is that nobody have seen that retrocl
          trimming was broken for so long ...
      
    
    That's because it silently just builds up and no one notices :-(
      

    I opened a ticket for this: 
      https://github.com/389ds/389-ds-base/issues/4869
    Mark

    
        Regards Pierre,
      
      
        On Wed, Aug 11, 2021 at 11:50
          AM Kees Bakker <keesb@xxxxxxx> wrote:

        
        On
          09-08-2021 20:23, Mark Reynolds wrote:

          > On 8/9/21 1:47 PM, Kees Bakker wrote:

          >> On 09-08-2021 19:25, Mark Reynolds wrote:

          >>> On 8/9/21 1:16 PM, Kees Bakker wrote:

          >>>> On 09-08-2021 18:43, Mark Reynolds wrote:

          >>>>> On 8/9/21 11:20 AM, Kees Bakker wrote:

          >>>>>> On 09-08-2021 16:00, Mark Reynolds
          wrote:

          >>>>>>> On 8/9/21 8:09 AM, Kees Bakker
          wrote:

          >>>>>>>> Hi,

          >>>>>>>>

          >>>>>>>> When my dirsrv was trying to
          compact the databases I was getting

          >>>>>>>> this

          >>>>>>>> error

          >>>>>>>>

          >>>>>>>>
          [07/Aug/2021:23:59:02.715984489 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> Compacting databases ...

          >>>>>>>>
          [07/Aug/2021:23:59:02.765932397 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> Compacting DB start: userRoot

          >>>>>>>>
          [07/Aug/2021:23:59:03.518175414 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> compactdb: compact userRoot -
          417 pages freed

          >>>>>>>>
          [07/Aug/2021:23:59:03.576427786 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> Compacting DB start: ipaca

          >>>>>>>>
          [07/Aug/2021:23:59:03.659941533 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> compactdb: compact ipaca -
          419 pages freed

          >>>>>>>>
          [07/Aug/2021:23:59:03.718445310 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> Compacting DB start:
          changelog

          >>>>>>>>
          [08/Aug/2021:00:00:40.807571334 +0200] - NOTICE -

          >>>>>>>> NSMMReplicationPlugin -
          changelog program - cl5CompactDBs -

          >>>>>>>> compacting

          >>>>>>>> replication changelogs...

          >>>>>>>>
          [08/Aug/2021:00:00:54.309357211 +0200] - ERR - libdb - BDB2055
          Lock

          >>>>>>>> table is out of available
          lock entries

          >>>>>>>>
          [08/Aug/2021:00:00:54.726504736 +0200] - ERR - bdb_compact -

          >>>>>>>> compactdb: failed to compact
          changelog; db error - 12 Cannot

          >>>>>>>> allocate

          >>>>>>>> memory

          >>>>>>>>
          [08/Aug/2021:00:00:54.801571421 +0200] - ERR - libdb - BDB2055
          Lock

          >>>>>>>> table is out of available
          lock entries

          >>>>>>>>
          [08/Aug/2021:00:00:54.876618702 +0200] - ERR -

          >>>>>>>> NSMMReplicationPlugin -

          >>>>>>>> changelog program -
          cl5CompactDBs - Failed to compact

          >>>>>>>>
          a797bb0b-be1d11eb-88c0b677-613aa2ad; db error - 12 Cannot
          allocate

          >>>>>>>> memory

          >>>>>>>>
          [08/Aug/2021:00:00:57.253006449 +0200] - NOTICE - bdb_compact
          -

          >>>>>>>> Compacting databases
          finished.

          >>>>>>>>

          >>>>>>>> There are about 402k entries
          in cn=changelog.

          >>>>>>>>

          >>>>>>>> I have a few questions

          >>>>>>>> 1) is it normal to have so
          many entries in cn=changelog? On another

          >>>>>>>> replica I have almost 3M
          entries Isn't this cleaned up?

          >>>>>>>> 2) the number of locks is
          50000 (there are two config items).

          >>>>>>>> Should I

          >>>>>>>> increase that number? If so,
          increase to what?

          >>>>>>>> 3) is there maybe something
          else going on, causing the

          >>>>>>>> exhaustion of

          >>>>>>>> the locks?

          >>>>>>>

          >>>>>>> Ok, so by default there is no
          changelog trimming enabled. So the

          >>>>>>> changelog will grow without
          bounds, which is bad.

          >>>>>>

          >>>>>> How much of this [1] applies? Indeed
          it says "By default the

          >>>>>> Replication Changelog does not use
          any trimming by default."

          >>>>>> So, how is the trimming actually
          enabled? I have these entries

          >>>>>>

          >>>>>> dn: cn=changelog5,cn=config

          >>>>>> cn: changelog5

          >>>>>> nsslapd-changelogdir:
          /var/lib/dirsrv/slapd-EXAMPLE-COM/cldb

          >>>>>> nsslapd-changelogmaxage: 30d

          >>>>>> objectClass: top

          >>>>>> objectClass: extensibleobject

          >>>>>

          >>>>> So trimming is set, but it's set to 30
          days (30d), that

          >>>>> could/should be

          >>>>> shortened to "7d".

          >>>>

          >>>> Somehow I doubt that this will do anything.

          >>>

          >>> Ahhh, I thought this was the Replication
          Changelog, but it is the

          >>> "Retro" Changelog.  Ok so check this out:

          >>>

          >>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/administration_guide/managing_replication-using_the_retro_changelog_plug_in#Using_the_Retro_Changelog_Plug_in-Trimming_the_Retro_Changelog

          >>>

          >>>

          >>> Or update config directly using ldapmodify:

          >>>

          >>> # ldapmodify -D ...

          >>> dn: cn=Retro Changelog
          Plugin,cn=plugins,cn=config

          >>> changetype: modify

          >>> replace: nsslapd-changelogmaxage: 7d

          >>> nsslapd-changelogmaxage: 7d

          >>

          >> (( Typo? The replace line is without the value ))

          > yes, sorry...

          >>

          >> Hmm, that maxage was already at 2 days

          >

          > Ok, perhaps you are running into a known bug where
          indexes created by

          > plugins are not enabled.  Or might be a config issue?

          >

          > Try this first.  Set in the retro changelog entry:

          >

          > nsslapd-changelog-trim-interval: 300   <--- 5 minutes

          (Sorry to be picky, but it is nsslapd-changelogtrim-interval.
          It's default is 300.)

          >

          > Restart the server to make sure it was applied.  You can
          enable "plugin"

          > logging to see what it does.

          >

          > nsslapd-errorlog-level: 65536

          >

          > wait over 5 minutes, then grep the errors log for
          "DSRetroclPlugin".

          > This will tell us what the plugin is doing when it goes
          to trim the

          > database.

          >

          > IMPORTANT, disable plugin logging immediately after you
          get your data.

          > Set the errorlog level to 0 to go back to default
          logging.

          >

          > If the logging shows it's not finding any candidates then
          it might be

          > the indexing issue I mentioned above.  In that case you
          just need to

          > reindex the retro changelog:

          
          Before changing anything I enabled Plug-in debugging which
          revealed something interesting

          
          [root@linge ~]# grep trim
          /var/log/dirsrv/slapd-GHS-NL/errors-plugin-debug-20210811-1108

          [11/Aug/2021:11:01:49.792450548 +0200] - DEBUG -
          DSRetroclPlugin - cltrim: ldrc=0, first_time=1622032141,
          cur_time=3702600

          [11/Aug/2021:11:01:49.842339467 +0200] - DEBUG -
          DSRetroclPlugin - retrocl_housekeeping - changelog does not
          need to be trimmed

          [11/Aug/2021:11:06:49.953401407 +0200] - DEBUG -
          DSRetroclPlugin - cltrim: ldrc=0, first_time=1622032141,
          cur_time=3702900

          [11/Aug/2021:11:06:50.012249703 +0200] - DEBUG -
          DSRetroclPlugin - retrocl_housekeeping - changelog does not
          need to be trimmed

          
          This shows that the interval is 300 (even if
          nsslapd-changelogtrim-interval is not explicitly set in
          cn=changelog5,cn=config

          
          Look at cur_time. That is the "uptime" of the system, not UTC.

          
          [root@linge ~]# uptime

            11:21:42 up 42 days, 20:49,  2 users,  load average: 0.15,
          0.14, 0.16

          
          Next, assuming this code from retrocl_trim.c is making the
          decision to trim

          
                       first_time =
          retrocl_getchangetime(SLAPI_SEQ_FIRST, &ldrc);

                       slapi_log_err(SLAPI_LOG_PLUGIN,
          RETROCL_PLUGIN_NAME,

                                     "cltrim: ldrc=%d, first_time=%ld,
          cur_time=%ld\n",

                                     ldrc, first_time, cur_time);

                       if (LDAP_SUCCESS == ldrc && first_time
          > (time_t)0L &&

                           first_time + ts.ts_c_max_age < cur_time) {

                           must_trim = 1;

                       }

          
          There is a mixup of absolute (first_time) and relative
          (cur_time) time. Which of the two is wrong?

          
          >

          > # dsctl slapd-YOUR_INSTANCE stop

          >

          > # dsctl slapd-YOUR_INSTANCE db2index changelog

          >

          > # dsctl slapd-YOUR_INSTANCE start

          >

          > Then wait 5 minutes and see if anything was trimmed.

          >

          > Mark

          >

          >>

          >> dn: cn=Retro Changelog Plugin,cn=plugins,cn=config

          >> cn: Retro Changelog Plugin

          >> nsslapd-attribute: nsuniqueid:targetUniqueId

          >> nsslapd-changelogmaxage: 2d

          >> nsslapd-include-suffix: cn=dns,dc=example,dc=com

          >> nsslapd-plugin-depends-on-named: Class of Service

          >> nsslapd-plugin-depends-on-type: database

          >> nsslapd-pluginDescription: Retrocl Plugin

          >> nsslapd-pluginEnabled: on

          >> nsslapd-pluginId: retrocl

          >> nsslapd-pluginInitfunc: retrocl_plugin_init

          >> nsslapd-pluginPath: libretrocl-plugin

          >> nsslapd-pluginType: object

          >> nsslapd-pluginVendor: 389 Project

          >> nsslapd-pluginVersion: 1.4.3.231

          >> nsslapd-pluginbetxn: on

          >> nsslapd-pluginprecedence: 25

          >> objectClass: top

          >> objectClass: nsSlapdPlugin

          >> objectClass: extensibleObject

          >>

          >>>

          >>>

          >>> HTH,

          >>>

          >>> Mark

          >>>

          >>>>

          >>>> The first entry in cn=changelog is more than
          two months old. (May 26th

          >>>> is about

          >>>> the time I installed this server as a FreeIPA
          replica.)

          >>>>

          >>>> dn: changenumber=1,cn=changelog

          >>>> objectClass: top

          >>>> objectClass: changelogentry

          >>>> objectClass: extensibleObject

          >>>> targetuniqueid:
          1e89ebcb-e20111e8-8820f96e-fc0c60a4

          >>>> changeNumber: 1

          >>>> targetDn:

          >>>> idnsname=_ldap._tcp,idnsname=example.com.,cn=dns,dc=example,dc=com

          >>>> changeTime: 20210526122901Z

          >>>> changeType: modify

          >>>> changes::
          YWRkOiBzUlZSZWNvc...gplbnRyeXVzbjogMTEyCi0KAA==

          >>>>

          >>>> Is there a way to see trimming in action? I'm
          afraid it is not doing

          >>>> anything.

          >>

          >> What diagnostic can I enable to see the trimming in
          action?

          >>

          >>>>

          >>>>>

          >>>>>>

          >>>>>>

          >>>>>>>

          >>>>>>> I recommend setting up the
          changelog max age to 7 days:

          >>>>>>>

          >>>>>>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/administration_guide/trimming_the_replication_changelog

          >>>>>>>

          >>>>>>>

          >>>>>>>

          >>>>>>>

          >>>>>>> Once that the trimming cleans up
          the changelog the database

          >>>>>>> compaction

          >>>>>>> should work without issue.  You
          can also increase the database

          >>>>>>> locks to

          >>>>>>> 1 million (that does require a
          restart of the server to take

          >>>>>>> effect).

          >>>>>>

          >>>>>> Let's hope that 1 million is enough
          :-)

          >>>>>

          >>>>> It might not be, you can keep bumping it
          up if needed. There is no

          >>>>> limit, or negative impact on db
          performance.

          >>>>>

          >>>>> Mark

          >>>>>

          >>>>>>

          >>>>>>>

          >>>>>>> HTH,

          >>>>>>>

          >>>>>>> Mark

          >>>>>>>

          >>>>>> [1]

          >>>>>> https://directory.fedoraproject.org/docs/389ds/FAQ/changelog-trimming.html

          >>>>>>

          >>>>>>

          >>>>>

          >>>>> -- 

          >>>>> Directory Server Development Team

          >>>>>

          >>>>
          _______________________________________________

          >>>> 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx

          >>>> To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx

          >>>> Fedora Code of Conduct:

          >>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/

          >>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines

          >>>> List Archives:

          >>>> https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx

          >>>>

          >>>> Do not reply to spam on the list, report it:

          >>>> https://pagure.io/fedora-infrastructure

          >>>

          >>> -- 

          >>> Directory Server Development Team

          >>>

          >>

          > -- 

          > Directory Server Development Team

          >

          _______________________________________________

          389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx

          To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx

          Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/

          List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines

          List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx

          Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

        
      -- 

      
        --

          
          389 Directory Server
            Development Team

        
    -- 
Directory Server Development Team
  

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure