Re: [PATCH v9] livepatch: Clear relocation targets on a module removal

Song Liu <song@xxxxxxxxxx> · Fri, 20 Jan 2023 11:41:02 -0800

On Fri, Jan 20, 2023 at 11:16 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Wed, Jan 18, 2023 at 12:47:28PM -0800, Song Liu wrote:
> > From: Miroslav Benes <mbenes@xxxxxxx>
> >
> > Josh reported a bug:
> >
> >   When the object to be patched is a module, and that module is
> >   rmmod'ed and reloaded, it fails to load with:
> >
> >   module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c
> >   livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8)
> >   livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd'
> >
> >   The livepatch module has a relocation which references a symbol
> >   in the _previous_ loading of nfsd. When apply_relocate_add()
> >   tries to replace the old relocation with a new one, it sees that
> >   the previous one is nonzero and it errors out.
>
> Should we add a selftest to make sure this problem doesn't come back?

IIRC, a selftest for this issue is not easy without Joe's klp-convert work.
At the moment I use kpatch-build for testing.

>
> >   On ppc64le, we have a similar issue:
> >
> >   module_64: livepatch_nfsd: Expected nop after call, got e8410018 at e_show+0x60/0x548 [livepatch_nfsd]
> >   livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8)
> >   livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd'
>
> Just to clarify my previous comment, the reference to the powerpc issue
> should be removed because it'll be fixed in a separate patch.

Will remove.

>
> > He also proposed three different solutions. We could remove the error
> > check in apply_relocate_add() introduced by commit eda9cec4c9a1
> > ("x86/module: Detect and skip invalid relocations"). However the check
> > is useful for detecting corrupted modules.
> >
> > We could also deny the patched modules to be removed. If it proved to be
> > a major drawback for users, we could still implement a different
> > approach. The solution would also complicate the existing code a lot.
> >
> > We thus decided to reverse the relocation patching (clear all relocation
> > targets on x86_64). The solution is not
> > universal and is too much arch-specific, but it may prove to be simpler
> > in the end.
> >
> > Signed-off-by: Miroslav Benes <mbenes@xxxxxxx>
> > Signed-off-by: Song Liu <song@xxxxxxxxxx>
> > Acked-by: Miroslav Benes <mbenes@xxxxxxx>
> > Co-developed-by: Song Liu <song@xxxxxxxxxx>
> > Reported-by: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
>
> According to submitting-patches.rst the Co-developed-by is supposed to
> be immediately before your Signed-off-by.
>
> Also, other than the commit log, this patch is almost completely
> unrecognizable compared to Miroslav's original patch.  Does it still
> make sense for him to be listed as the author?
>
> In the -tip tree they sometimes use an Originally-by tag, which might be
> relevant here.

How about:

Signed-off-by: Song Liu <song@xxxxxxxxxx>
Originally-by: Miroslav Benes <mbenes@xxxxxxx>
Acked-by: Miroslav Benes <mbenes@xxxxxxx>
Reported-by: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>

>
> > -static int __apply_relocate_add(Elf64_Shdr *sechdrs,
> > +static int __write_relocate_add(Elf64_Shdr *sechdrs,
> >                  const char *strtab,
> >                  unsigned int symindex,
> >                  unsigned int relsec,
> >                  struct module *me,
> > -                void *(*write)(void *dest, const void *src, size_t len))
> > +                void *(*write)(void *dest, const void *src, size_t len),
> > +                bool apply)
> >  {
> >       unsigned int i;
> >       Elf64_Rela *rel = (void *)sechdrs[relsec].sh_addr;
> >       Elf64_Sym *sym;
> >       void *loc;
> >       u64 val;
> > +     u64 zero = 0ULL;
> >
> > -     DEBUGP("Applying relocate section %u to %u\n",
> > +     DEBUGP("%s relocate section %u to %u\n",
> > +            (apply) ? "Applying" : "Clearing",
>
> "(apply)" has unnecessary parentheses.
>
> >              relsec, sechdrs[relsec].sh_info);
> >       for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
> > +             int write_size = 4;
>
> We already know we're writing, I suggest just calling it 'size' to be
> more consistent with kernel style.
>
> Also, it can be an unsigned value like size_t.
>
> Also, instead of initializing it here with a potentially bogus value
> which may need to be overwritten for a 64-bit reloc, it's better to
> explicitly set the size in each individual case below.  That's makes the
> logic clearer and helps prevent future bugs if new 64-bit relocation
> types are added later.

Will fix in v10.

>
> >               case R_X86_64_PC32:
> >               case R_X86_64_PLT32:
> > -                     if (*(u32 *)loc != 0)
> > -                             goto invalid_relocation;
> >                       val -= (u64)loc;
> > -                     write(loc, &val, 4);
> >  #if 0
> > -                     if ((s64)val != *(s32 *)loc)
> > +                     if ((s64)val != *(s32 *)&val)
> >                               goto overflow;
> >  #endif
>
> Why is the compiled-out code getting changed?  Is it actually fixing a
> hypothetical bug?

I just changed them the same way as other cases (assuming we remove
the #if 0, of course).

>
> This code really needs to be removed anyway, it's been dead for at least
> 15 years.

Shall we remove it now? Within the same patch? Or with a preparation
patch?

>
> >                       break;
> >               case R_X86_64_PC64:
> > -                     if (*(u64 *)loc != 0)
> > -                             goto invalid_relocation;
> >                       val -= (u64)loc;
> > -                     write(loc, &val, 8);
> > +                     write_size = 8;
> >                       break;
> >               default:
> >                       pr_err("%s: Unknown rela relocation: %llu\n",
> >                              me->name, ELF64_R_TYPE(rel[i].r_info));
> >                       return -ENOEXEC;
> >               }
> > +
> > +             if (apply) {
> > +                     if (memcmp(loc, &zero, write_size)) {
> > +                             pr_err("x86/modules: Skipping invalid relocation target, existing value is nonzero for type %d, loc %p, val %Lx\n",
>
> It's not just "skipping", it's erroring out completely.  Yes, it's is a
> pre-existing error message but we might as well improve it while
> touching it.
>
> Maybe just remove the "Skipping", i.e. change "Skipping invalid ..." to
> "Invalid ..." ?

Sounds good to me.

>
> > +                                    (int)ELF64_R_TYPE(rel[i].r_info), loc, val);
> > +                             return -ENOEXEC;
> > +                     }
> > +                     write(loc, &val, write_size);
> > +             } else {
> > +                     if (memcmp(loc, &val, write_size)) {
> > +                             pr_warn("x86/modules: Clearing invalid relocation target, existing value does not match expected value for type %d, loc %p, val %Lx\n",
> > +                                     (int)ELF64_R_TYPE(rel[i].r_info), loc, val);
> > +                     }
> > +                     write(loc, &zero, write_size);
>
> If the value doesn't match then something has gone badly wrong.  Why go
> ahead with the clearing in that case?

We can pr_err() then return -ENOEXEC (?). But I guess we need to
handle the error case in:
  klp_cleanup_module_patches_limited()
  klp_module_coming()
  klp_module_going()
and all the functions that call klp_module_going().

This seems a big overkill to me...

Or do you mean we just skip the write()?

>
> > +#ifdef CONFIG_LIVEPATCH
> > +
> > +void clear_relocate_add(Elf64_Shdr *sechdrs,
> > +                     const char *strtab,
> > +                     unsigned int symindex,
> > +                     unsigned int relsec,
> > +                     struct module *me)
> > +{
> > +     write_relocate_add(sechdrs, strtab, symindex, relsec, me, false);
> > +}
> > +#endif
>
> Superflous newline after the '#ifdef CONFIG_LIVEPATCH'.
>
> > +
> >  #endif
> >
> >  int module_finalize(const Elf_Ehdr *hdr,
> > diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> > index 7b4587a19189..0b54ec9856df 100644
> > --- a/include/linux/moduleloader.h
> > +++ b/include/linux/moduleloader.h
> > @@ -75,6 +75,13 @@ int apply_relocate_add(Elf_Shdr *sechdrs,
> >                      unsigned int symindex,
> >                      unsigned int relsec,
> >                      struct module *mod);
> > +#ifdef CONFIG_LIVEPATCH
>
> This could use a comment explaining the purpose of this function:
>
> /*
>  * Some architectures (namely x86_64 and ppc64) perform sanity checks when
>  * applying relocations.  If a patched module gets unloaded and then later
>  * reloaded (and re-patched), klp re-applies relocations to the replacement
>  * function(s).  Any leftover relocations from the previous loading of the
>  * patched module might trigger the sanity checks.
>  *
>  * To prevent that, when unloading a patched module, clear out any relocations
>  * that might trigger arch-specific sanity checks on a future module reload.
>  */

Sounds great. Will add this to the next version.

>
> > +void clear_relocate_add(Elf_Shdr *sechdrs,
> > +                const char *strtab,
> > +                unsigned int symindex,
> > +                unsigned int relsec,
> > +                struct module *me);
> > +#endif
>
>
> --
> Josh