Re: 389 directory server crash

Rich Megginson <rmeggins@xxxxxxxxxx> · Tue, 16 Jul 2013 08:49:30 -0600

    On 07/16/2013 01:23 AM, Mitja Mihelič
      wrote:

      On 07/15/2013 05:28 PM, Rich
        Megginson wrote:

        On 07/15/2013 02:57 AM, Mitja
          Mihelič wrote:

          On 07/12/2013 05:55 PM, Rich
            Megginson wrote:

            On 07/12/2013 08:22 AM, Mitja
              Mihelič wrote:

              On 07/09/2013 03:34 PM, Rich
                Megginson wrote:

                On 07/09/2013 06:43 AM,
                  Mitja Mihelič wrote:

                  Hi!

                  We are having problems with some our 389-DS instances.
                  They crash after receiving an update from the
                  provider.

                After looking at the stack trace, I think this is https://fedorahosted.org/389/ticket/47391

          Yes, it looks like it might be it. When CONSUMER_ONE crashed
          for the first time, the last thing replicated was a password
          change.

          Do you perhaps know, where I could get a 389DS version for
          Centos6 that has the patch? The ticket says it was pushed to
          1.2.11, but would seem that our 1.2.11.15-14 is still an
          unpatched one and the repositories do not have any newer
          versions.

        Is that the 389-ds-base that is included with CentOS6?

      Yes, the 389-ds-base-1.2.11.15-14.el6_4.x86_64 and
      389-ds-base-libs-1.2.11.15-14.el6_4.x86_64 are from the official
      Centos6 updates repoository.

      389-ds-base-debuginfo is from http://debuginfo.centos.org/6/

      The rest are from epel.

    Looking at the stack trace you sent earlier - there is only 1
    thread?  You ran 

    gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

?  If so, I have no idea what's going on - I've never seen the server deadlock itself with only 1 thread . . .

                 The crash happened twice after about a
                  week of running without problems. The crashes happened
                  on two consumer servers but not at the same time.

                  The servers are running CentOS 6x with the following
                  389DS packages installed:

                  389-ds-console-doc-1.2.6-1.el6.noarch

                  389-console-1.1.7-1.el6.noarch

                  389-adminutil-1.1.15-1.el6.x86_64

                  389-dsgw-1.1.10-1.el6.x86_64

                  389-ds-base-debuginfo-1.2.11.15-14.el6_4.x86_64

                  389-admin-1.1.29-1.el6.x86_64

                  389-ds-console-1.2.6-1.el6.noarch

                  389-admin-console-doc-1.1.8-1.el6.noarch

                  389-ds-1.2.2-1.el6.noarch

                  389-ds-base-1.2.11.15-14.el6_4.x86_64

                  389-ds-base-libs-1.2.11.15-14.el6_4.x86_64

                  389-admin-console-1.1.8-1.el6.noarch

                  We are in the process of replacing the Centos 5x base
                  consumer+provider setup with a CentOS 6x base one. For
                  the time being, the CentOS 6 machines are acting as
                  consumers for the old server. They run for a while and
                  then the replicated instances crash though not at the
                  same time.

                  One of the servers did not want to start after the
                  crash,

                Can you provide the error messages from the errors log?

              I have attached error logs from the provider
              (2013-06-27-provider_error) and the consumer
              (2013-06-27-server_two_error) in question.

                so I have run db2index on its database.
                  It's been running for four days and it has still not
                  finished. 

                Try exporting using db2ldif, then importing using
                ldif2db.

              The export process hangs. After an hour strace still
              shows:

              futex(0x7f5822670ed4, FUTEX_WAIT, 1, NULL

              The error log for this is attached as
              2013-07-10-server_two-ldif_import_hangs.

            Are you using db2ldif or db2ldif.pl?  If you are using
            db2ldif, is the server running?  If not, please try first
            shutting down the server and use db2ldif.

            If db2ldif still hangs, then please follow the instructions
            at http://port389.org/wiki/FAQ#Debugging_Hangs
            to get a stack trace of the hung process.

          I was using db2ldif with the server shut down. I tried it
          again and it hung. The LDIF file was created but its size was
          zero. The produced stack trace is attached as
          server_two-db2ldif_hang-stacktrace.1373877200.txt.

                All I get from db2index now are these
                  outputs:

                  [09/Jul/2013:13:29:11 +0200] - reindex db: Processed
                  65095 entries (pass 1104) -- average rate
                  53686277.5/sec, recent rate 0.0/sec, hit ratio 0%

                How many entries do you have in your database?

              The number revolves around 65400. It varies perhaps 2 user
              del/add operations a month and 20 attribute changes per
              week, if that.

                  The other instance did start up, but the replication
                  process did not work anymore. I disabled the
                  replication to this host and set it up again. I chose
                  "Initialize consumer now" and the consumer crashed
                  every time.

                Can provide a stack trace of the core when the server
                crashes?  This may be different than the stack trace
                below.

              The last provided stack trace was produced at the last
              server crash. I will provide another stack trace when
              CONSUMER_ONE crashes again. Currently it refuses to crash
              at initialization time and keeps running.

                I have enabled full error logging and
                  could find nothing.

                  I have read a few threads (not all, I admit) on this
                  list and

                  http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes
                  and tried to troubleshoot.

                  The crash produced the attached core dump and I could
                  use your help with understanding it. As well as any
                  help with the crash. If more info is needed I will
                  gladly provide it.

                  Regards, Mitja

                  --
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users