On Tue, 11 Mar 2014 16:38:51 -0600 Rich Megginson <rmeggins@xxxxxxxxxx> wrote: > On 03/11/2014 04:09 PM, Timothy Pollard wrote: > > On Tue, 11 Mar 2014 07:17:25 -0600 > > Rich Megginson <rmeggins@xxxxxxxxxx> wrote: > >> On 03/10/2014 09:17 PM, Timothy Pollard wrote: > >>> On Mon, 10 Mar 2014 20:56:08 -0600 > >>> Rich Megginson <rmeggins@xxxxxxxxxx> wrote: > >>>> On 03/10/2014 08:42 PM, Timothy Pollard wrote: > >>>>> A small update; we're now > >>>> Now as opposed to some time in the past? At what point did you begin > >>>> seeing these messages, and what changed? > >>> It looks like it started after I manually "fixed" the entry. > >> What exactly did you do to fix the entry? > > I edited it and filled it what looked like the missing values (which I > > copied from an old LDIF file): > > > > dNSClass: IN > > zoneName: cvsdude.com > > relativeDomainName: testingstatus > > objectClass: top > > objectClass: dNSZone > > dNSTTL: 100 > > Did you use ldapdelete to delete old one and ldapmodify/ldapadd to add this > fixed one? I actually used ldapvi, which backs onto ldapmodify, it won't have deleted it, just modified it. > > > > >>> As I said it is a > >>> test entry, so I'm happy to delete it entirely and recreate it if you > >>> think this will fix the issue, > >> I don't think it will fix the issue, but it may help reproduce it more > >> easily. > >> > >>> but I can hold off on that if you'd like me to find > >>> out more information. > >> If you are not experiencing the "non-contiguous" problem now, there's not > >> much information to get. > >> > > We're not seeing the non-contiguous problem any more, but we are seeing > > repeated DB crashes: > > > > [11/Mar/2014:21:57:14 +0000] - libdb: dnsRoot/id2entry.db4 page 36132 is on > > free list with type 5 [11/Mar/2014:21:57:14 +0000] - libdb: PANIC: Invalid > > argument [11/Mar/2014:21:57:14 +0000] - libdb: PANIC: fatal region error > > detected; run recovery [11/Mar/2014:21:57:14 +0000] - Serious > > Error---Failed in dblayer_txn_abort, err=-30974 (DB_RUNRECOVERY: Fatal > > error, run database recovery) [11/Mar/2014:21:57:14 +0000] - libdb: PANIC: > > fatal region error detected; run recovery [11/Mar/2014:21:57:14 +0000] - > > FATAL ERROR at idl_new.c (1); server stopping as database recovery needed. > > I don't suppose you are running out of disk space? Any other disk errors? > Is this a VM with a virtual disk image holding the db? We have plenty of disk space, and haven't seen any other disk issues, and can't find any obvious entries in dmesg or /var/log/messages. > > > > > This happens within a few minutes after every restart of the daemon. I'm not > > sure if this is related though. It (the new DB error) first occurred after > > ns-slapd was killed by the oom-killer. Could that cause database corruption? > > It is not supposed to, but it is a possibility. > > > > > It also looks like we might need to do some memory tuning on 389, is there > > some suggested documentation on that, or should I just google it? > https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Performance_Tuning_Guide/index.html > is a good place to start OK, thanks. > > > > At the moment we've switched to our other master (we use a multi-master > > replication setup), so we'll probably just rebuild the problem server from > > there, but is there anything that I should look at to diagnose the problem > > first? > > I'm not sure. Looks like we are now working on several different problems in > various states of knowledge/severity . . . > Yeah, that's the problem with our system, we can't really tell if it's one problem with many symptoms, or multiple different problems. I think we might be going to need to get in an LDAP consultant. Thanks for your help; and anything else you can point me at to try would be much appreciated. -- TimP [http://blog.timp.com.au] [http://resume.timp.com.au]
Attachment:
signature.asc
Description: PGP signature
-- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users