On Mon, Mar 15, 2010 at 12:57:08PM -0600, Rich Megginson wrote: > Christopher Wood wrote: > > On Thu, Mar 04, 2010 at 12:06:31PM -0700, Rich Megginson wrote: > > > >> Christopher Wood wrote: > >> > >>> On Wed, Mar 03, 2010 at 08:30:19PM -0700, Rich Megginson wrote: > >>> > >>> > >>>> Christopher Wood wrote: > >>>> > >>>> > >>>>> I'm just getting started with 389 Directory Server (at work), and I've run into an issue that I'm not certain how to troubleshoot. I would greatly appreciate any assistance or tips you could offer, especially on where to look to see what's failing. > >>>>> > >>>>> Also, I apologize in advance for changing strings related to my employer's directory names and such, as I'm not comfortable with leaking that level information to a public list. > >>>>> > >>>>> > >>>>> > >>>> As well you should be - you should always obscure sensitive information > >>>> like this. > >>>> > >>>> > >>>>> Overview: > >>>>> > >>>>> Initializing a large subtree from NDS 6.2 crashes ns-slapd, but other subtrees are fine. > >>>>> > >>>>> > >>>>> Top-Level Questions: > >>>>> > >>>>> 1) How do I stop ns-slapd from crashing? > >>>>> > >>>>> > >>>>> > >>>> Good question. > >>>> > >>>> > >>>>> 2) How do I figure out what precisely is causing the crash? (With various levels of debug logging I get the same log entry.) > >>>>> > >>>>> > >>>>> > >>>> You've already used the TRACE level (1) for logging - that's as verbose > >>>> as it gets for this particular operation. Next step would be to try to > >>>> get a core file. > >>>> > >>>> > >>>>> 3) Is it possible to simply import my initialization ldif without duplication checks? > >>>>> > >>>>> > >>>>> > >>>> No. > >>>> > >>>> > >>>>> Background: > >>>>> > >>>>> At work we have NDS 6.2 (single master on a physical server, virtual machine slaves), and would like to move our directories intact to a 389 2.6 installation via replication. > >>>>> > >>>>> > >>>>> > >>>> What platform/OS? 32-bit or 64-bit? By NDS 6.2 I'm assuming you mean > >>>> Netscape Directory Server - by 2.6 I'm assuming you mean 1.2.6.a1 (a2 > >>>> should be hitting the mirrors tomorrow). > >>>> > >>>> > >>> 32 bit > >>> > >>> > >>> > >>>>> I already have replicated several of our NDS 6.2 subtrees to 389 2.6 with no difficulties. > >>>>> > >>>>> I compiled our 389 installation from the source packages downloaded from http://directory.fedoraproject.org/wiki/Source. > >>>>> > >>>>> > >>>> Did you grab 389-ds-base 1.2.6.a1 or 1.2.6.a2? > >>>> > >>>> > >>> I used 1.2.6.a1 to compile originally and produce core files to answer your questions. Next I'll try this with 1.2.6.a2, but I'd rather keep the same version when trying to initially reproduce something. > >>> > >>> > >>> > >>>> What compiler flags did you use? > >>>> > >>>> > >>> The makefile that came out of ./configure had these: > >>> > >>> CCASFLAGS = -g -O2 > >>> CFLAGS = -g -O2 > >>> CXXFLAGS = -g -O2 > >>> > >>> For the plain debug build I edited that to insert these and rebuilt with make, make install: > >>> > >>> CCASFLAGS = -g > >>> CFLAGS = -g > >>> CXXFLAGS = -g > >>> > >>> (Fair warning that I'm not a programmer, so I'm not entirely sure doing that was right.) > >>> > >>> > >>> > >> Note that you don't have to edit the Makefile - you can do a make > >> distclean, then run configure like this: > >> > CFLAGS="-g" /path/to/configure ... > >> > make > >> > >> But that looks right, anyway. Note that if you change the flags like > >> this by editing the makefile, you will have to do a make clean to remove > >> the old object files, so that they will be rebuilt with the new flags. > >> > > > > I did a make clean before rebuilding in this case. For my later debug builds I've used your step to re-run configure. > > > > > >>>> Do you have a core file? If so, try using gdb > >>>> gdb /path/to/ns-slapd /path/to/core.pid > >>>> once in gdb, type the "where" command > >>>> (gdb) where > >>>> > >>>> > >>> The original crash didn't produce a core file, but I could get one by attaching gdb later, to both the original build and a debug build. > >>> > >>> > >>> > >>>>> The underlying platform is: > >>>>> > >>>>> $ uname -a > >>>>> Linux cwlab-02.mycompany.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT 2009 i686 i686 i386 GNU/Linux > >>>>> $ cat /etc/redhat-release > >>>>> CentOS release 5.4 (Final) > >>>>> > >>>>> $ free > >>>>> total used free shared buffers cached > >>>>> Mem: 3894000 1336012 2557988 0 144944 1004716 > >>>>> -/+ buffers/cache: 186352 3707648 > >>>>> Swap: 2031608 0 2031608 > >>>>> > >>>>> > >>>>> Procedure To Crash 389's ns-slapd: > >>>>> > >>>>> a) In the NDS 6.2 admin console, create a new replication agreement for the "o=This Big Net" subtree, and choose to "Create consumer initialization file". > >>>>> > >>>>> b) Copy the file to the 389 server. > >>>>> > >>>>> c) In the 389 2.6 admin console for the Directory Server, in the Configuration tab (Data -> o=This Big Net -> dbRoot), right-click and choose "Initialize Database". Use the ldif file copied over. > >>>>> > >>>>> The ns-slapd process crashes, and I always get this in /opt/dirsrv/var/log/dirsrv/slapd-cwlab-02/errors as the last two lines: > >>>>> > >>>>> [03/Mar/2010:12:50:04 -0500] - import ldapAuthRoot: Processing file "/home/cwood/tbn.ldif" > >>>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >>>>> > >>>>> > >>>>> Other Details: > >>>>> > >>>>> > >>>>> I found two bugs with the str2entry_dupcheck string in it, but they don't seem pertinent: > >>>>> > >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=548115 > >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=243488 > >>>>> > >>>>> > >>>>> This says that str2entry_dupcheck could be about two things: > >>>>> > >>>>> http://docs.sun.com/source/816-6699-10/ax_errcd.html > >>>>> > >>>>> "While attempting to convert a string entry to an LDAP entry, the server found that the entry has no DN." > >>>>> > >>>>> "The server failed to add a value to the value tree." > >>>>> > >>>>> (But this is an exported database from NDS 6.2, and I'm fairly sure, without reading them all, that every entry will have a DN.) > >>>>> > >>>>> > >>>>> > >>>> The log message > >>>> [03/Mar/2010:12:50:04 -0500] - => str2entry_dupcheck > >>>> > >>>> is just trace information, not a report of a problem or error. > >>>> > >>>> Does the crash happen almost immediately? Or does it take a while? If > >>>> the problem happens quickly, it would be worthwhile to scan the first > >>>> couple of dozen entries looking for things like - entries without a DN - > >>>> attributes without a value > >>>> > >>>> > >>> I checked, and I couldn't see any data errors of this type. > >>> > >>> > >>> > >>>>> If 389 is trying to check for duplicate entries, perhaps there are simply too many DNs? > >>>>> > >>>>> $ grep '^dn:' tbn.ldif | wc -l > >>>>> 636985 > >>>>> $ ls -lh acc.ldif > >>>>> -rw-r--r-- 1 cwood cwood 755M Mar 3 11:24 tbn.ldif > >>>>> > >>>>> > >>>>> > >>>> No. The server should be able to handle this much data easily. And it > >>>> must check for duplicate entries. > >>>> > >>>> > >>>>> Per the instructions here: > >>>>> > >>>>> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting > >>>>> > >>>>> I set my debug logging first to 24579: > >>>>> > >>>>> 1 Trace function calls > >>>>> 2 Debug packet handling > >>>>> 8192 Replication debugging > >>>>> 16384 Critical messages > >>>>> > >>>>> Then for the next try at reading logs I set it to 90115, the above plus: > >>>>> > >>>>> 65536 Plug-in debugging > >>>>> > >>>>> However, every time the log ended with the same set of lines noted above. > >>>>> > >>>>> > >>>>> > >>>> 1 Trace is really the best for this particular problem, and as you have > >>>> found it is limited for this particular problem. > >>>> > >>>> I think the next step would be to build the server with full debugging > >>>> information (use -g and omit -O2 or any other -Ox) and get a stack trace > >>>> with full debug information. > >>>> > >>>> > >>> After recompiling I got different results. Still a crash, but after a different sequence of actions. This time I didn't create the other subtrees, I went straight for the "TBN Net" subtree. I couldn't reproduce the immediate crash, but it did crash after I tried to stop the import (rather than watch the errors noted below). > >>> > >>> I went back to my original compile (-g -O2), and also couldn't reproduce the instant crash under gdb. Here's the "where" output for that: > >>> > >>> (gdb) where > >>> #0 0x002318bd in pthread_mutex_lock () from /lib/libpthread.so.0 > >>> #1 0x00d91536 in pthread_mutex_lock () from /lib/libc.so.6 > >>> #2 0x001f8782 in PR_Lock () from /usr/lib/libnspr4.so > >>> #3 0x00b302fb in cache_clear (cache=0xab7042bc, type=0) at ldap/servers/slapd/back-ldbm/cache.c:585 > >>> #4 0x00b46d9d in import_main_offline (arg=0xab704268) at ldap/servers/slapd/back-ldbm/import.c:1300 > >>> #5 0x00b47e4d in import_main (arg=0xab704268) at ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #6 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > >>> #7 0x0022f5ab in start_thread () from /lib/libpthread.so.0 > >>> #8 0x00d84cfe in clone () from /lib/libc.so.6 > >>> > >>> Everything below is for my new compile, with -g but no -O2. > >>> > >>> I used this to attach gdb to the processes: > >>> > >>> gdb /opt/dirsrv/sbin/ns-slapd 12799 > >>> > >>> I got a large number of these in the error log. Possibly this is one source of my problem, needing to move the schema from Netscape Directory Server 6.2 before I move the subtrees. > >>> > >>> [04/Mar/2010:11:45:56 -0500] - Entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" has unknown object class "ldapauthvirtuseroc" > >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: WARNING: skipping entry "ldapAuthLogin=bob, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=TBN Net" which violates schema, ending line 2182899 of file "/home/cwood/tbn.ldif" > >>> > >>> > >> This is definitely a source of your problems. You must update the > >> schema before you can import the data. Although the server should not > >> crash either. > >> > > > > I was having some trouble updating just the schema, so I grossly hacked up migrate-ds.pl, DSMigration::fixup99user and DSMigration::migrateSchema. With that I managed to get the schema from the old Netscape Directory Server 6.2 99user.ldif to a new 389 1.2.5 98mycompany.ldif. The new schema showed through in the admin gui, so I take it that I was successful in getting my schema updated. > > > > I've been tearing my hair out trying to reproduce the crash and/or get my data imported, even with the schema added. (I eyeballed the schema and a failing directory entry, and they seemed to match to the naked eye.) > Even though the server should not crash, I believe that if we could get > the syntax problem taken care of, you could import your data without a > crash (according to the stack trace below). > > The message appears to be entry \"ldapAuthLogin=username, > ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which > violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ld > > Would it be possible for you to post excerpts from that file at or > around that line/entry, along with the definition of the attributes used? > > Just when I gave up on trying to get it to crash, it did so once more (and not since then). However, I did get this, not sure if it's of any use: > > > > Program received signal SIGABRT, Aborted. > > [Switching to Thread 0x9bdd9b90 (LWP 10174)] > > 0x003cc402 in __kernel_vsyscall () > > (gdb) where > > #0 0x003cc402 in __kernel_vsyscall () > > #1 0x001a3df0 in raise () from /lib/libc.so.6 > > #2 0x001a5701 in abort () from /lib/libc.so.6 > > #3 0x001dc28b in __libc_message () from /lib/libc.so.6 > > #4 0x001e4595 in _int_free () from /lib/libc.so.6 > > #5 0x001e66ac in _int_realloc () from /lib/libc.so.6 > > #6 0x001e7276 in realloc () from /lib/libc.so.6 > > #7 0x00da781e in slapi_ch_realloc ( > > block=0x8b886170 "p\003??p\003??ha\210\213ha\210\213g entry \"ldapAuthLogin=username, ldapAuthRealmName=mycompany.com, ou=UsersByRealm, o=This Realm\" which violates attribute syntax, ending line 4701511 of file \"/home/cwood/dump.ldi"..., > > size=6752) at ldap/servers/slapd/ch_malloc.c:199 > > #8 0x00e1184e in slapi_task_log_notice (task=0x8f97460, format=0xc14b2e "%s") > > at ldap/servers/slapd/task.c:254 > > #9 0x00bd7cce in import_log_notice (job=0xb3a28420, > > format=0xc15934 "WARNING: skipping entry \"%s\" which violates attribute syntax, ending line %d of file \"%s\"") at ldap/servers/slapd/back-ldbm/import.c:193 > > #10 0x00bdc7b9 in import_producer (param=0x8f977d0) > > at ldap/servers/slapd/back-ldbm/import-threads.c:601 > > #11 0x001686ed in ?? () from /usr/lib/libnspr4.so > > #12 0x002c55ab in start_thread () from /lib/libpthread.so.0 > > #13 0x0024ccfe in clone () from /lib/libc.so.6 > > > This is definitely a bug - please file a bug at > https://bugzilla.redhat.com/enter_bug.cgi?product=389 and please add the > above stack trace and any other information. https://bugzilla.redhat.com/show_bug.cgi?id=573896 Stack trace, gdb where, and the error log all included. > This looks like a double free, or a free/realloc of un-malloc'd memory. > Would it be possible for you to run the server using valgrind? > I have a script here http://github.com/richm/scripts called ns-slapd.sh > that you can use to replace ns-slapd (moving it to ns-slapd.orig) and > run the server/import/whatever using valgrind. > > > > > >> If you run the import job in gdb, and it crashes, gdb should report a > >> SIGSEGV. Since the import is multi-threaded, the crash may not be in > >> the current thread. In this case, you should dump the stack of all > >> threads like this: > >> (gdb) thread apply all bt > >> > > > > Sadly I only remembered gdb "where", but if it crashes again I'll get this part right. > > > > > >>> When I tried to cancel my import the log ended here, and ns-slapd crashed. > >>> > >>> [04/Mar/2010:11:45:56 -0500] - import ldapAuthRoot: Aborting all import threads... > >>> > >>> It took me a bit to figure out how to get a core file in gdb, but I eventually got one. > >>> > >>> This is the result of "back" when I didn't have a core file: > >>> > >>> (gdb) back > >>> #0 0x00e34655 in slapi_task_log_notice (task=0x32317472, format=0xae3be6 "%s") at ldap/servers/slapd/task.c:231 > >>> #1 0x00a9bdd6 in import_log_notice (job=0x8a334f0, format=0xae3cff "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > >>> #2 0x00a9dad7 in import_main_offline (arg=0x8a334f0) at ldap/servers/slapd/back-ldbm/import.c:1196 > >>> #3 0x00a9e01f in import_main (arg=0x8a334f0) at ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #4 0x001fe6ed in ?? () from /usr/lib/libnspr4.so > >>> #5 0x002165ab in start_thread () from /lib/libpthread.so.0 > >>> #6 0x0081dcfe in clone () from /lib/libc.so.6 > >>> > >>> This is what gdb printed to the terminal when the SIGSEGVs came in: > >>> > >>> Program received signal SIGSEGV, Segmentation fault. > >>> [Switching to Thread 0xb26fdb90 (LWP 13323)] > >>> 0x009b8655 in slapi_task_log_notice (task=0x7972636e, format=0xf1fbe6 "%s") at ldap/servers/slapd/task.c:231 > >>> 231 len = 2 + strlen(buffer) + (task->task_log ? strlen(task->task_log) : 0); > >>> > >>> Program received signal SIGSEGV, Segmentation fault. > >>> [Switching to Thread 0xb26fdb90 (LWP 14197)] > >>> 0x00ca5655 in slapi_task_log_notice (task=0x7865646e, format=0xe90be6 "%s") at ldap/servers/slapd/task.c:231 > >>> 231 len = 2 + strlen(buffer) + (task->task_log ? strlen(task->task_log) : 0); > >>> > >>> When I figured out getting a core dump with my debug build, this is "where": > >>> > >>> (gdb) where > >>> #0 0x00d2309b in strlen () from /lib/libc.so.6 > >>> #1 0x00d22e05 in strdup () from /lib/libc.so.6 > >>> #2 0x0013e051 in slapi_ch_strdup (s1=0x61 <Address 0x61 out of bounds>) at ldap/servers/slapd/ch_malloc.c:277 > >>> #3 0x0014509c in slapi_sdn_get_ndn (sdn=0xb29cbc4c) at ldap/servers/slapd/dn.c:1229 > >>> #4 0x00177ba7 in op_shared_modify (pb=0xb29cdd6c, pw_change=0, old_pw=0x0) at ldap/servers/slapd/modify.c:582 > >>> #5 0x001779d0 in modify_internal_pb (pb=0xb29cdd6c) at ldap/servers/slapd/modify.c:526 > >>> #6 0x00177684 in slapi_modify_internal_pb (pb=0xb29cdd6c) at ldap/servers/slapd/modify.c:416 > >>> #7 0x001aa772 in modify_internal_entry (dn=0x61 <Address 0x61 out of bounds>, mods=0xb29cdf38) at ldap/servers/slapd/task.c:626 > >>> #8 0x001a9d58 in slapi_task_status_changed (task=0xb000f928) at ldap/servers/slapd/task.c:299 > >>> #9 0x001a9883 in slapi_task_log_notice (task=0xb000f928, format=0xbc8be6 "%s") at ldap/servers/slapd/task.c:264 > >>> #10 0x00b80dd6 in import_log_notice (job=0xb0009318, format=0xbc8cff "Import threads aborted.") at ldap/servers/slapd/back-ldbm/import.c:193 > >>> #11 0x00b82ad7 in import_main_offline (arg=0xb0009318) at ldap/servers/slapd/back-ldbm/import.c:1196 > >>> #12 0x00b8301f in import_main (arg=0xb0009318) at ldap/servers/slapd/back-ldbm/import.c:1386 > >>> #13 0x002516ed in ?? () from /usr/lib/libnspr4.so > >>> #14 0x001e65ab in start_thread () from /lib/libpthread.so.0 > >>> #15 0x00d84cfe in clone () from /lib/libc.so.6 > >>> > >>> > >> This is a known problem > >> *Bug 515805* <https://bugzilla.redhat.com/show_bug.cgi?id=515805> - Stop > >> "initialize Database" crashes the server > >> > >>> -- > >>> 389 users mailing list > >>> 389-users at lists.fedoraproject.org > >>> https://admin.fedoraproject.org/mailman/listinfo/389-users > >>> > >>> > >> -- > >> 389 users mailing list > >> 389-users at lists.fedoraproject.org > >> https://admin.fedoraproject.org/mailman/listinfo/389-users > >> > > -- > > 389 users mailing list > > 389-users at lists.fedoraproject.org > > https://admin.fedoraproject.org/mailman/listinfo/389-users > > -- > 389 users mailing list > 389-users at lists.fedoraproject.org > https://admin.fedoraproject.org/mailman/listinfo/389-users