Re: [389-users] importing large subtree crashes ns-slapd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christopher Wood wrote:
> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote:
>   
>> Christopher Wood wrote:
>>     
>>> I'm just getting started with 389 Directory Server (at work), and I've run into an issue that I'm not certain how to troubleshoot. I would greatly appreciate any assistance or tips you could offer, especially on where to look to see what's failing.
>>>
>>> Also, I apologize in advance for changing strings related to my employer's directory names and such, as I'm not comfortable with leaking that level information to a public list.
>>>   
>>>       
>> As well you should be - you should always obscure sensitive information 
>> like this.
>>     
>>> Overview:
>>>
>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other subtrees are fine.
>>>
>>>
>>> Top-Level Questions:
>>>
>>> 1) How do I stop ns-slapd from crashing?
>>>   
>>>       
>> Good question.
>>     
>>> 2) How do I figure out what precisely is causing the crash? (With various levels of debug logging I get the same log entry.)
>>>   
>>>       
>> You've already used the TRACE level (1) for logging - that's as verbose 
>> as it gets for this particular operation.  Next step would be to try to 
>> get a core file.
>>     
>>> 3) Is it possible to simply import my initialization ldif without duplication checks?
>>>   
>>>       
>> No.
>>     
>>> Background:
>>>
>>> At work we have NDS 6.2 (single master on a physical server, virtual machine slaves), and would like to move our directories intact to a 389 2.6 installation via replication.
>>>   
>>>       
>> What platform/OS?  32-bit or 64-bit?  By NDS 6.2 I'm assuming you mean 
>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 
>> should be hitting the mirrors tomorrow).
>>     
>
> 32 bit
>
>   
>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with no difficulties.
>>>
>>> I compiled our 389 installation from the source packages downloaded from http://directory.fedoraproject.org/wiki/Source.
>>>       
>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2?
>>     
>
> I used 1.2.6.a1 to compile originally and produce core files to answer your questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same version when trying to initially reproduce something.
>
>   
>> What compiler flags did you use?
>>     
>
> The makefile that came out of ./configure had these:
>
> CCASFLAGS = -g -O2
> CFLAGS = -g -O2
> CXXFLAGS = -g -O2
>
> For the plain debug build I edited that to insert these and rebuilt with make, make install:
>
> CCASFLAGS = -g
> CFLAGS = -g
> CXXFLAGS = -g
>
> (Fair warning that I'm not a programmer, so I'm not entirely sure doing that was right.)
>
>   
Note that you don't have to edit the Makefile - you can do a make 
distclean, then run configure like this:
 > CFLAGS="-g" /path/to/configure ...
 > make

But that looks right, anyway.  Note that if you change the flags like 
this by editing the makefile, you will have to do a make clean to remove 
the old object files, so that they will be rebuilt with the new flags.
>> Do you have a core file?  If so, try using gdb
>> gdb /path/to/ns-slapd /path/to/core.pid
>> once in gdb, type the "where" command
>> (gdb) where
>>     
>
> The original crash didn't produce a core file, but I could get one by attaching gdb later, to both the original build and a debug build.
>
>   
>>> The underlying platform is:
>>>
>>> $ uname -a
>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT 2009 i686 i686 i386 GNU/Linux
>>> $ cat /etc/redhat-release 
>>> CentOS release 5.4 (Final)
>>>
>>> $ free
>>>              total       used       free     shared    buffers     cached
>>> Mem:       3894000    1336012    2557988          0     144944    1004716
>>> -/+ buffers/cache:     186352    3707648
>>> Swap:      2031608          0    2031608
>>>
>>>
>>> Procedure To Crash 389's ns-slapd:
>>>
>>> a) In the NDS 6.2 admin console, create a new replication agreement for the "o=This Big Net" subtree, and choose to "Create consumer initialization file".
>>>
>>> b) Copy the file to the 389 server.
>>>
>>> c) In the 389 2.6 admin console for the Directory Server, in the Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and choose "Initialize Database". Use the ldif file copied over.
>>>
>>> The ns-slapd process crashes, and I always get this in /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines:
>>>
>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file "/home/cwood/tbn.ldif"
>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
>>>
>>>
>>> Other Details:
>>>
>>>
>>> I found two bugs with the str2entry_dupcheck string in it, but they don't seem pertinent:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115
>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488
>>>
>>>
>>> This says that str2entry_dupcheck could be about two things:
>>>
>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html
>>>
>>> "While attempting to convert a string entry to an LDAP entry, the server found that the entry has no DN."
>>>
>>> "The server failed to add a value to the value tree."
>>>
>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, without reading them all, that every entry will have a DN.)
>>>   
>>>       
>> The log message
>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck
>>
>> is just trace information, not a report of a problem or error.
>>
>> Does the crash happen almost immediately?  Or does it take a while?  If 
>> the problem happens quickly, it would be worthwhile to scan the first 
>> couple of dozen entries looking for things like - entries without a DN - 
>> attributes without a value
>>     
>
> I checked, and I couldn't see any data errors of this type.
>  
>   
>>> If 389 is trying to check for duplicate entries, perhaps there are simply too many DNs?
>>>
>>> $ grep '^dn:' tbn.ldif | wc -l
>>> 636985
>>> $ ls -lh acc.ldif 
>>> -rw-r--r-- 1 cwood cwood 755M Mar  3 11:24 tbn.ldif
>>>   
>>>       
>> No.  The server should be able to handle this much data easily.  And it 
>> must check for duplicate entries.
>>     
>>> Per the instructions here:
>>>
>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
>>>
>>> I set my debug logging first to 24579:
>>>
>>> 1 	 Trace function calls 
>>> 2 	 Debug packet handling 
>>> 8192 	 Replication debugging 
>>> 16384 	 Critical messages
>>>
>>> Then for the next try at reading logs I set it to 90115, the above plus:
>>>
>>> 65536 	 Plug-in debugging
>>>
>>> However, every time the log ended with the same set of lines noted above.
>>>   
>>>       
>> 1 Trace is really the best for this particular problem, and as you have 
>> found it is limited for this particular problem.
>>
>> I think the next step would be to build the server with full debugging 
>> information (use -g and omit -O2 or any other -Ox) and get a stack trace 
>> with full debug information.
>>     
>
> After recompiling I got different results. Still a crash, but after a different sequence of actions. This time I didn't create the other subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce the immediate crash, but it did crash after I tried to stop the import (rather than watch the errors noted below).
>
> I went back to my original compile (-g -O2), and also couldn't reproduce the instant crash under gdb. Here's the "where" output for that:
>
> (gdb) where
> #0  0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0
> #1  0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6
> #2  0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so
> #3  0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at ldap/servers/slapd/back-ldbm/cache.c:585
> #4  0x00b46d9d in import_main_offline (arg=0xab704268) at ldap/servers/slapd/back-ldbm/import.c:1300
> #5  0x00b47e4d in import_main (arg=0xab704268) at ldap/servers/slapd/back-ldbm/import.c:1386
> #6  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> #7  0x0022f5ab in start_thread () from /lib/libpthread.so.0
> #8  0x00d84cfe in clone () from /lib/libc.so.6
>
> Everything below is for my new compile, with -g but no -O2.
>
> I used this to attach gdb to the processes:
>
> gdb /opt/dirsrv/sbin/ns-slapd 12799
>
> I got a large number of these in the error log. Possibly this is one source of my problem, needing to move the schema from Netscape Directory Server 6.2 before I move the subtrees.
>
> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown object class "ldapauthvirtuseroc"
> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" which violates schema, ending line 2182899 of file "/home/cwood/tbn.ldif"
>   
This is definitely a source of your problems.  You must update the 
schema before you can import the data.  Although the server should not 
crash either.

If you run the import job in gdb, and it crashes, gdb should report a 
SIGSEGV.  Since the import is multi-threaded, the crash may not be in 
the current thread.  In this case, you should dump the stack of all 
threads like this:
(gdb) thread apply all bt
> When I tried to cancel my import the log ended here, and ns-slapd crashed.
>
> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import threads...
>
> It took me a bit to figure out how to get a core file in gdb, but I eventually got one.
>
> This is the result of "back" when I didn't have a core file:
>
> (gdb) back
> #0  0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 "%s") at ldap/servers/slapd/task.c:231
> #1  0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> #2  0x00a9dad7 in import_main_offline (arg=0x8a334f0) at ldap/servers/slapd/back-ldbm/import.c:1196
> #3  0x00a9e01f in import_main (arg=0x8a334f0) at ldap/servers/slapd/back-ldbm/import.c:1386
> #4  0x001fe6ed in ?? () from /usr/lib/libnspr4.so
> #5  0x002165ab in start_thread () from /lib/libpthread.so.0
> #6  0x0081dcfe in clone () from /lib/libc.so.6
>
> This is what gdb printed to the terminal when the SIGSEGVs came in:
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xb26fdb90 (LWP 13323)]
> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") at ldap/servers/slapd/task.c:231
> 231         len = 2 + strlen(buffer) + (task->task_log ? strlen(task->task_log) : 0);
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xb26fdb90 (LWP 14197)]
> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") at ldap/servers/slapd/task.c:231
> 231         len = 2 + strlen(buffer) + (task->task_log ? strlen(task->task_log) : 0);
>
> When I figured out getting a core dump with my debug build, this is "where":
>
> (gdb) where
> #0  0x00d2309b in strlen () from /lib/libc.so.6
> #1  0x00d22e05 in strdup () from /lib/libc.so.6
> #2  0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at ldap/servers/slapd/ch_malloc.c:277
> #3  0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at ldap/servers/slapd/dn.c:1229
> #4  0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) at ldap/servers/slapd/modify.c:582
> #5  0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at ldap/servers/slapd/modify.c:526
> #6  0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at ldap/servers/slapd/modify.c:416
> #7  0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626
> #8  0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at ldap/servers/slapd/task.c:299
> #9  0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 "%s") at ldap/servers/slapd/task.c:264
> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193
> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at ldap/servers/slapd/back-ldbm/import.c:1196
> #12 0x00b8301f in import_main (arg=0xb0009318) at ldap/servers/slapd/back-ldbm/import.c:1386
> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so
> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0
> #15 0x00d84cfe in clone () from /lib/libc.so.6
>   
This is a known problem
*Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop 
"initialize Database" crashes the server
> --
> 389 users mailing list
> 389-users@xxxxxxxxxxxxxxxxxxxxxxx
> https://admin.fedoraproject.org/mailman/listinfo/389-users
>   

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users

[Index of Archives]     [Fedora Directory Users]     [Fedora Directory Devel]     [Fedora Announce]     [Fedora Legacy Announce]     [Kernel]     [Fedora Legacy]     [Share Photos]     [Fedora Desktop]     [PAM]     [Red Hat Watch]     [Red Hat Development]     [Big List of Linux Books]     [Gimp]     [Yosemite News]

  Powered by Linux