Re: [RFC PATCH 2/2] livepatch: implement 'sticky' patches

Nicolai Stange <nstange@xxxxxxx> · Wed, 13 Feb 2019 08:46:02 +0100

Joe Lawrence <joe.lawrence@xxxxxxxxxx> writes:

> On 2/12/19 11:08 AM, Petr Mladek wrote:
>> On Tue 2019-02-12 10:02:19, Joe Lawrence wrote:
>>> On 2/12/19 6:39 AM, Petr Mladek wrote:
>>>> On Mon 2019-02-11 11:05:17, Joe Lawrence wrote:
>
> [ ... snip ... ]
>
>>>>> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
>>>>> index cc120b5018e1..50bf39ec2a9d 100644
>>>>> --- a/kernel/livepatch/core.c
>>>>> +++ b/kernel/livepatch/core.c
>>>>> @@ -380,7 +380,7 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
>>>>>    	 */
>>>>>    	if (patch == klp_transition_patch)
>>>>>    		klp_reverse_transition();
>>>>> -	else if (!enabled)
>>>>> +	else if (!enabled && !patch->sticky)
>>>>>    		ret = __klp_disable_patch(patch);
>>>>
>>>> Hmm, I could imagine that even the revert would not be safe.
>>>> A problematic operations might be done already by pre-patch
>>>> callbacks or by already patched threads.
>>>
>>> This part of the policy was an incidental side effect of where I decided to
>>> place the final update.  I would agree that reversing the transition may not
>>> be safe... however, I was afraid of nailing down a livepatch that wasn't
>>> disable-able but never finishes its transition.  Wouldn't we be dead in the
>>> water with respect to the state machine?  ie, how to escape that condition?
>>
>> I see the point.
>>
>
> [ ... snip ... ]
>
>>>> If the transition is safe then the problematic
>>>> operation need to be done by a callback anyway.
>>>
>>> Not sure I follow.  Are you saying that if the transition can be safely
>>> reversed, problematic operations will be handled by unpatch callback(s)?
>>
>> I mean something else. If a migrated thread starts doing anything that
>> could not get simply reverted then klp_reverse_transition() is not
>> safe.
>
> Agreed.
>
>> If it is safe to revert transition before klp_transition_patch
>> is cleared in klp_complete_transition() then the hard-to-revert
>> operation must be done by some other patch-specific operations.
>> And the only patch-specific operation called from
>> klp_complete_transition() is the post-patch callback.
>
> Ah I see.
>
>> By other words, there are two situations:
>>
>>     + transition is problematic => the patch is sticky once
>> 	the transition starts and revert must be disabled
>>
>>     + post_patch callback is problematic => sticky bit could
>> 	get set later by the post_patch callback.
>>
>> Does this make any sense?
>
> Yes, I got it now.  That is an interesting distinction (pre-patch
> callback may be safe, but post-patch callback not).
>
> If it wasn't for the stuck-in-transition corner case I mentioned at
> the top, I would say we shouldn't allow any transition reversal for
> these patches.  But admittedly, I haven't thought through all the
> cases.  My gut says that dangerous pre-patch callbacks are less likely
> than post-patch ones.  What do you think?

To amend the discussion, let me outline the scheme used in the real world
example of L1TF again:

- the post patch callback would set some flag internal to the livepatch,
  say "do_incompatible_stuff"

- and the patched swap PTE writers would have

  if (!do_incompatible_stuff)
    /* original behaviour */
  else
    /* incompatible behaviour */

So as long as the post patch handler has not been run, reversing the
transition would be safe in this case.

For Petr's concern about cases where the transition itself might be
problematic: perhaps the live patch author can always "convert" those
into the second case by introducing some kind of artificial boolean flag
enabling the incompatible behaviour? This would not work in situations
where the task-based consistency model is needed, i.e. when there would
be multiple evaluations of do_incompatible_stuff in different functions
on the same stack and these rely on each other being consistent. But I
think that's an artifical situation (read: due to my lack of
imagination, I can't think of any where this would matter).

To pre-patch: as I see it, it's useful for two things in this context:
- preallocating things, because it is allowed to fail,
  but post-patch is not and
- inspect the "do_incompatible_stuff" state of the previous live patch:
  if we're migrating from a previous L1TF live patch, i.e. the situation
  your patch 1/2 addresses, then the next live patch must use the
  new semantics right from the beginning, not only after its post-patch
  has been called.

So in this example, stickiness is needed from the point when post-patch
starts to run. In particular, reversing the transition would be safe
here...

Thanks,

Nicolai