Hi, I'm reposting this to bring attention to these patches again. The complete set of these patches has survive a nunc-stans stress test, and the 24hour connection reliability test in directory server. https://fedorahosted.org/nunc-stans/ticket/56 https://fedorahosted.org/nunc-stans/attachment/ticket/56/0002-Ticket-56-undefined-reference-to-rpl_malloc.patch On fedora rawhide this causes the build to fail. We aren't changing the malloc impl, and we are allowing the over-ride to malloc via the thread pool malloc callbacks, so this check is not so important. https://fedorahosted.org/nunc-stans/ticket/51 https://fedorahosted.org/nunc-stans/attachment/ticket/51/0001-Ticket-51-Job-rearm-should-ignore-if-ns_persist-is-s.patch When a job is marked persistent, this means that after the event has occurred, and we dispatch work, the job itself stays armed. Armed meaning "accepting new work". In this case it is still registered in the event framework, and will "just work" to accept new work. However, if you rearm this job, it may be placed into the event_q again, and the event framework may or may not accept the insertion of the duplicated fd for events. At best, the event framework ignores it. At moderate, we trigger two events for one input. At worst, we explode. We should ignore rearm on persistent jobs as it is not needed! they stay armed. https://fedorahosted.org/nunc-stans/ticket/52 https://fedorahosted.org/nunc-stans/attachment/ticket/52/0004-Ticket-52-ns_job_modify-should-not-rearm.patch When we call ns_job_modify, IE to change a READ job to a WRITE job, it automatically rearms the job. Given that a job may have other alterations performed before or after the ns_job_modify, this causes issues. When the job is armed, it's moved to the queue through a set of lock free steps. Once queued, if you change the job, there is NO GUARANTEE about the atomicity of the change you make, whether it will see another thread or not. Worst, it will only partially be seen. It's certainly possible to cause an issue where a job_modify is run, then another thread dequeues the work, and begins to work on it, while the original thread is still changing it! Subtle race conditions. This behaviour is surprising. The function should do one thing, and one thing well: modifying the type. https://fedorahosted.org/nunc-stans/ticket/57 https://fedorahosted.org/nunc-stans/attachment/ticket/57/0001-Ticket-57-Implement-a-strict-state-machine-for-nunc-.2.patch Given the discussed issues above, to prevent and detect issues we must enforce that jobs are in certain states so that we can act upon them with reasonable behaviours. A job is created as WAITING. In the WAITING state, the job is NOT in a queue, and can be modified freely. When the job is rearmed, it moves to ARMED. If you attempt to modify the job, you will be denied. Only when the job is triggered IE dequeued, it moves to WAITING and the callback trigged. The job may now be modified and rearmed. A job that is ARMED cannot be re-armed. A job that is persistent, never leaves the ARMED state, IE cannot be modified. A job that is ARMED cannot be deleted unless the server is shutting down. Imagine if you free-ed a job as it was being dequeued and callback triggering. Segfault risk. A job that is WAITING can me sent to done, where it will be marked as NEEDS_DELETE. After wards, it will be moved to DELETED and freed. This allows us to prevent double frees (marking a job done twice), use after free (arming a done job), race conditions (modifying armed jobs), and many more. It gives us a solid base to assert and reason about the correctness of jobs. In the future, I will need to find a way to make it so that ARMED jobs *can* be deleted, so that we can delete persistent jobs during run time. The issue here is the way that lfds works with regards to barries of the memory, so I will need to study this further. I hope this makes the review process clearer, and the intent of these changes better. -- Sincerely, William Brown Software Engineer Red Hat, Brisbane
Attachment:
signature.asc
Description: This is a digitally signed message part
-- 389-devel mailing list 389-devel@xxxxxxxxxxxxxxxxxxxxxxx https://lists.fedoraproject.org/admin/lists/389-devel@xxxxxxxxxxxxxxxxxxxxxxx