Re: Future of nunc-stans

William Brown <wbrown@xxxxxxx> · Wed, 9 Oct 2019 09:32:07 +1000

> On 9 Oct 2019, at 09:18, Rich Megginson <rmeggins@xxxxxxxxxx> wrote:
> 
> On 10/8/19 4:55 PM, William Brown wrote:
>> Hi everyone,
>> In our previous catch up (about 4/5 weeks ago when I was visiting Matus/Simon), we talked about nunc-stans and getting it at least cleaned up and into the code base.
>> I've been looking at it again, and really thinking about it and reflecting on it and I have a lot of questions and ideas now.
>> The main question is *why* do we want it merged?
>> Is it performance? Recently I provided a patch that yielded an approximate ~30% speed up in the entire server through put just by changing our existing connection code.
>> Is it features? What features are we wanting from this? We have no complaints about our current threading model and thread allocations.
>> Is it maximum number of connections? We can always change the conntable to a better datastructure that would help scale this number higher (which would also yield a performance gain).
> 
> It is mostly about the c10k problem, trying to figure out a way to use epoll, via an event framework like libevent, libev, or libtevent, but in a multi-threaded way (at the time none of those were really thread safe, or suitable for use in the way we do multi-threading in 389).
> 
> It wasn't about performance, although I hoped that using lock-free data structures might solve some of the performance issues around thread contention, and perhaps using a "proper" event framework might give us some performance boost e.g. the idle thread processing using libevent timeouts.  I think that using poll() is never going to scale as well as epoll() in some cases e.g. lots of concurrent connections, no matter what sort of datastructure you use for the conntable.

The conntable was bottlenecking because when you had:

| conn | conn | freeslot | ....

it would attempt to lock each conn to see if it was free or not. This meant if a conn was in an io, we would block waiting for it to finish before we could move to the next conn to check if it was free. After changing to pthread, we can now do trylock, where if trylock fails it can be implied the conn slot must be in use, so skip it. This is how we got the 30% speed up recently (my laptop went from about 4200 conn/sec to over 6000).

However the conntable exists to allow the conn struct to be re-used over and over. There are multiple ways we could solve this. On conn free, we could append to a freelist to prevent iterating over the list to find a slot. We could use a btreemap and just alloc/dealloc conns as needed to prevent the need to walk and find conns (this would help sanitizers with memory too). We could bring in libevent directly to the main server and have it replace poll too. 

And even from then, I think we should be using flamegraphs to work out where our time limits are too. 

> 
> As far as features goes - it would be nice to give plugins the ability to inject event requests, get timeout events, using the same framework as the main server engine.
> 
>> The more I have looked at the code, I guess with time and experience, the more hesitant I am to actually commit to merging it. It was designed by people who did not understand low-level concurrency issues and memory architectures of systems,
> 
> I resemble that remark.  I suppose you could "turn off" the lock-free code and use mutexes.

It wasn't meant to be an insult btw - when I first looked I didn't understand either, so I resemble this remark too. There are lots of my own mistakes all over that code. 

We have changed to mutexes since, but even with those the job model and how they queue / dequeue and work along with libevent still has issues because you need to protect the items in certain states basically. Having a job self-rearm was a nightmare problem. We had to add a monitor on each ns_job_t to allow it to work correctly too because of the call back nature of the code. 

When it was initially added there was a job per io event, but now it's a job per connection and it lives the life of the connection. This has lead to the current issue where there is a 1:1 of slapi_conn_t and ns_job_t, and each has a lock and that led to deadlocks ... 

> 
>> so it's had a huge number of (difficult and subtle) unsafety issues. And while most of those are fixed, what it does is duplicating the connection structure from core 389,
> 
> It was supposed to eventually replace the connection code.

There is too much in the slapi_conn_t structure that can't be removed though, and too much locking of that struct through out the codebase of it. It's probably better to try to update slapi_conn_t to have new ideas than trying to replace it.

> 
>> leading to weird solutions like lock sharing and having to use monitors and more. We've tried a few times to push forward with this, but each time we end up with a lot of complexity and fragility.
>> So I'm currently thinking a better idea is to step back, re-evaluate what the problem is we are trying to solve for, then to solve *that*.
>> The question now is "what is the concern that ns would solve". From knowing that, then we can make a plan and approach it more constructively I think.
> 
> I agree.  There are probably better ways to solve the problems now.

Maybe even just smaller steps rather than one big-bang replacement :) 

> 
>> At the end of the day, I'm questioning if we should just rm -r src/nunc-stans and rethink this whole approach - there are just too many architectural flaws and limitations in ns that are causing us headaches.
>> Ideas and thoughts?
>> --
>> Sincerely,
>> William
>> _______________________________________________
>> 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
>> To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx
> _______________________________________________
> 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx