Re: 389 directory server crash

Rich Megginson <rmeggins@xxxxxxxxxx> · Wed, 17 Jul 2013 09:36:21 -0600

    On 07/17/2013 01:52 AM, Mitja Mihelič
      wrote:

      On 07/16/2013 04:49 PM, Rich
        Megginson wrote:

        On 07/16/2013 01:23 AM, Mitja
          Mihelič wrote:

          On 07/15/2013 05:28 PM, Rich
            Megginson wrote:

            On 07/15/2013 02:57 AM, Mitja
              Mihelič wrote:

              On 07/12/2013 05:55 PM, Rich
                Megginson wrote:

                On 07/12/2013 08:22 AM,
                  Mitja Mihelič wrote:

                  On 07/09/2013 03:34 PM,
                    Rich Megginson wrote:

                    On 07/09/2013 06:43 AM,
                      Mitja Mihelič wrote:

                      Hi!

                      We are having problems with some our 389-DS
                      instances. They crash after receiving an update
                      from the provider.

                    After looking at the stack trace, I think this is https://fedorahosted.org/389/ticket/47391

              Yes, it looks like it might be it. When CONSUMER_ONE
              crashed for the first time, the last thing replicated was
              a password change.

              Do you perhaps know, where I could get a 389DS version for
              Centos6 that has the patch? The ticket says it was pushed
              to 1.2.11, but would seem that our 1.2.11.15-14 is still
              an unpatched one and the repositories do not have any
              newer versions.

            Is that the 389-ds-base that is included with CentOS6?

          Yes, the 389-ds-base-1.2.11.15-14.el6_4.x86_64 and
          389-ds-base-libs-1.2.11.15-14.el6_4.x86_64 are from the
          official Centos6 updates repoository.

          389-ds-base-debuginfo is from http://debuginfo.centos.org/6/

          The rest are from epel.

        Looking at the stack trace you sent earlier - there is only 1
        thread?  You ran 

        gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

?  If so, I have no idea what's going on - I've never seen the server deadlock itself with only 1 thread . . .

      I ran

      gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread
      apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof -o 49171
      ns-slapd` > stacktrace.`date +%s`.txt 2>&1

      The "-o 49171" is to exclude the pid of the config server
      instance, so only the problematic pid was looked at.

      If you get any more information regarding this crash it would be
      very much appreciated.

      It may be best if I removed all 389DS related data from both of
      the consumer servers and start fresh. If they crash again I will
      send the relevant stack traces.

    Yes, that sounds good.

                     The crash happened twice after about
                      a week of running without problems. The crashes
                      happened on two consumer servers but not at the
                      same time.

                      The servers are running CentOS 6x with the
                      following 389DS packages installed:

                      389-ds-console-doc-1.2.6-1.el6.noarch

                      389-console-1.1.7-1.el6.noarch

                      389-adminutil-1.1.15-1.el6.x86_64

                      389-dsgw-1.1.10-1.el6.x86_64

                      389-ds-base-debuginfo-1.2.11.15-14.el6_4.x86_64

                      389-admin-1.1.29-1.el6.x86_64

                      389-ds-console-1.2.6-1.el6.noarch

                      389-admin-console-doc-1.1.8-1.el6.noarch

                      389-ds-1.2.2-1.el6.noarch

                      389-ds-base-1.2.11.15-14.el6_4.x86_64

                      389-ds-base-libs-1.2.11.15-14.el6_4.x86_64

                      389-admin-console-1.1.8-1.el6.noarch

                      We are in the process of replacing the Centos 5x
                      base consumer+provider setup with a CentOS 6x base
                      one. For the time being, the CentOS 6 machines are
                      acting as consumers for the old server. They run
                      for a while and then the replicated instances
                      crash though not at the same time.

                      One of the servers did not want to start after the
                      crash,

                    Can you provide the error messages from the errors
                    log?

                  I have attached error logs from the provider
                  (2013-06-27-provider_error) and the consumer
                  (2013-06-27-server_two_error) in question.

                    so I have run db2index on its
                      database. It's been running for four days and it
                      has still not finished. 

                    Try exporting using db2ldif, then importing using
                    ldif2db.

                  The export process hangs. After an hour strace still
                  shows:

                  futex(0x7f5822670ed4, FUTEX_WAIT, 1, NULL

                  The error log for this is attached as
                  2013-07-10-server_two-ldif_import_hangs.

                Are you using db2ldif or db2ldif.pl?  If you are using
                db2ldif, is the server running?  If not, please try
                first shutting down the server and use db2ldif.

                If db2ldif still hangs, then please follow the
                instructions at http://port389.org/wiki/FAQ#Debugging_Hangs
                to get a stack trace of the hung process.

              I was using db2ldif with the server shut down. I tried it
              again and it hung. The LDIF file was created but its size
              was zero. The produced stack trace is attached as
              server_two-db2ldif_hang-stacktrace.1373877200.txt.

                    All I get from db2index now are these
                      outputs:

                      [09/Jul/2013:13:29:11 +0200] - reindex db:
                      Processed 65095 entries (pass 1104) -- average
                      rate 53686277.5/sec, recent rate 0.0/sec, hit
                      ratio 0%

                    How many entries do you have in your database?

                  The number revolves around 65400. It varies perhaps 2
                  user del/add operations a month and 20 attribute
                  changes per week, if that.

                      The other instance did start up, but the
                      replication process did not work anymore. I
                      disabled the replication to this host and set it
                      up again. I chose "Initialize consumer now" and
                      the consumer crashed every time.

                    Can provide a stack trace of the core when the
                    server crashes?  This may be different than the
                    stack trace below.

                  The last provided stack trace was produced at the last
                  server crash. I will provide another stack trace when
                  CONSUMER_ONE crashes again. Currently it refuses to
                  crash at initialization time and keeps running.

                    I have enabled full error logging and
                      could find nothing.

                      I have read a few threads (not all, I admit) on
                      this list and

                      http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes
                      and tried to troubleshoot.

                      The crash produced the attached core dump and I
                      could use your help with understanding it. As well
                      as any help with the crash. If more info is needed
                      I will gladly provide it.

                      Regards, Mitja

                      --
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users