On 07/06/2017 08:13 PM, William Brown
wrote:
Historically speaking, a long time ago, we used to see high CPU when the RI plugin was engaged. Setting the delay to 1 second, and allowing the log thread to do the work, improved performance. Of course this is now obsolete with the betxn plugin model and other improvements, but I wanted to share why the feature even existed.On Thu, 2017-07-06 at 14:33 -0400, Mark Reynolds wrote:On 07/06/2017 01:07 PM, Ilias Stamatis wrote:Hello, A desire had been expressed to get rid of referint plugin's logfile: https://pagure.io/389-ds-base/issue/49202 It finally turns out that this file is used for other purposes than real logging. The referint plugin currently works like that; When the delay update is set to be more than 0 a new thread is created executing referential integrity code every x seconds (set by delay update). When a delete or modrdn operation happens, the plugin will write that down to its logfile. So, every x seconds the plugin will check the logfile, see what happened and apply the changes. Finally, it deletes the file, thus clearing the state for the next time it reads from it. After discussing this with William he suggested it's better to replace the file with a queue, since the fileinvolves excess fsync / sync, and has all kinds of potential state/race issues. Using a queue will be much faster as well. William went even further and suggested that we could get rid of the async referint update completely. This probably wouldn't happen soon though, since likely customers are using it. For now we could provide a warning such as "we recommend you set delay to 0". Finally, the referint-logchanges attribute does absolutely nothing. It seems to be completely ignored by the plugin, so we could remove this as well. I'll start working on these changes soon. Any thoughts or objections on the above would be welcome.The only problem with going to a queue is if the server goes down unexpectedly. In such a case those RI updates would be lost.We already have this issue because there is a delay between the change to the object and the log being sync() to disk. So we can already lose changes here. TBH the only fix is ot remove the async model. I actually question why we still need async/delay processing of the refint plugin ... This also brings up a different point... the RI plugin is a backend txn plugin. If we write changes to a log, and those changes end up failing for some reason, then there is no way to rollback the original transaction --> breaking the backend txn plugin model. Perhaps the log/delay should just be removed? Or ignore the log/delay settings if the plugin is set as a backend txn plugin?Completely agree. Because of the delay, if we roll back the txn we still do the refint check. I would be fully in support of removing the delay option and going betxn for the plugin only. This delay behaviour is the reason we advise you only run refint on one master in a topology, where if we remove this and go betxn, we can run on all masters correctly. I think we would need to make the plugin ignore replicated ops then too. My only concern would be what version to have this change land in - as much as I'm excited to make the change we should be careful. Perhaps we remove the delay processing, and have the "delay" process flag act as a switch to check incoming repl ops? Because today if you have delay > 0, you likely have refint on one master, so we need to refint incoming repl ops. If you have delay 0, you ignore repl ops because you assume all masters have refint? No matter what, it's not a smooth upgrade process here, but I think long term it's nicer to just have it on "all masters". |
_______________________________________________ 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx