On Wed, 2018-08-15 at 11:03 -0600, Rich Megginson wrote: > On 08/15/2018 10:56 AM, David Boreham wrote: > > > > > > On 8/15/2018 10:36 AM, Rich Megginson wrote: > > > > > > Updating the csn generator and the uuid generator will cause a > > > lot of > > > churn in dse.ldif. There are other housekeeping tasks which > > > will > > > write dse.ldif > > > > But if those things were being done so frequently that the > > resulting > > filesystem I/O showed up on the radar as a potential system-wide > > performance issue, that would mean something was wrong somewhere, > > right? > > I would think so. Then I suppose the first step would be to measure > the > dse.ldif churn on a "normal" system to get a baseline. We do have some poor locking strategies in some parts of the codebase that sadly I just never had time to finish fixing. Access logging comes to mind as a culprit for bottlenecking the server over locks needing a complete rewrite .... However he says the access log is off. I'm not sure that means that the locking over it is disabled though. Other areas are the operation struct reuse (locks and has unbounded growth with high operation numbers), the cn=config locking on many variables (even for reads) which is frequently locked/unlocked (some are at least atomics now, but they still cause stalls on cpu syncs), locking in some plugins. These could all be parts of the issue. I also had plans to add profiling into access logs so we could really narrow down the time hogs in searches/writes, but given the current access log design it's hard to do. We need more visibility into the server state when these queries come in, and today we just don't have it :( Finally, it could come down to simpler things like indexes or db locking for certain queries ... I think we need the access log enabled to get a better idea, with highres etime enabled. I would not be surprised to see a pattern like : 100 fast searches 1 slow search 100 fast searches 1 slow search That would indicate issues in the logging lock. To me, this entire issue, indicates we need better profiling information in our logs, because tools like strace just don't explain what's going on. We need to know *why* and *where* threads are getting stuck, how long each plugin is taking, and where the stalls are. Without investment into our servers internal state, these issues will always remain elusive to our users and us as a team. ----- long description of current access log issues ----- The current access log is protected in memory by a single mutex, that has an in memory buffer. Threads during their operations are writing to (and contending) the mutex to this buffer. Normally this is "reasonably fast", until the buffer fills. At this point the buffer needs to be flushed. The next search thread when it encounters the buffer approaching this limit *does the flushing itself* while holding the log mutex. At this point, one search is busy writing the whole access log buffer, while every other thread begins to build up behind it waiting to write into the buffer themself. At some point the (poor, unlucky) operation thread that was writing to the buffer has been stuck doing disk IO for any period of time, while everyone else waits. It may not have even begun to send results to the client either! Finally, once done, it unlocks, and can complete the operation. This is commonly what causes the "burst" behaviour of the server. Temporary fixes are to lower the buffer size (so it's written in smaller, more frequent amounts), but really, the only solution today is put the log in ramdisk/ssd to make that write "finish faster", or to use the syslog interface with async enabled. -- Sincerely, William _______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx/message/4AGDTUMTCT4XQOHKHMZ3ITXS3SNS6D2R/