On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote: > The atomic replace livepatch mechanism was introduced to handle scenarios > where we want to unload a specific livepatch without unloading others. > However, its current implementation has significant shortcomings, making > it less than ideal in practice. Below are the key downsides: > > - It is expensive > > During testing with frequent replacements of an old livepatch, random RCU > warnings were observed: > > [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old. > [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old. > [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old. > [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old. > [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old. > [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old. > [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old. > [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old. > [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old. > [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old. > [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old. > > This indicates that atomic replacement can cause performance issues, > particularly with RCU synchronization under frequent use. Why does this happen? > - Potential Risks During Replacement > > One known issue involves replacing livepatched versions of critical > functions such as do_exit(). During the replacement process, a panic > might occur, as highlighted in [0]. Other potential risks may also arise > due to inconsistencies or race conditions during transitions. That needs to be fixed. > - Temporary Loss of Patching > > During the replacement process, the old patch is set to a NOP (no-operation) > before the new patch is fully applied. This creates a window where the > function temporarily reverts to its original, unpatched state. If the old > patch fixed a critical issue (e.g., one that prevented a system panic), the > system could become vulnerable to that issue during the transition. Are you saying that atomic replace is not atomic? If so, this sounds like another bug. -- Josh