Re: [PATCH] iov_iter: separate direction from flavour

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 4 Jul 2021 11:31:45 -0700

[ Added Christoph, since the issue technically goes further back than
when the warning appeared - it just used to be silent ]

On Sun, Jul 4, 2021 at 10:29 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
>
> This patch results in the following runtime warning on nommu systems.

Ok, good, it actually found something.

> [    8.574191] [<21059cab>] (vfs_read) from [<2105d92b>] (read_code+0x15/0x2e)
> [    8.574329] [<2105d92b>] (read_code) from [<21085a8d>] (load_flat_file+0x341/0x4f0)
> [    8.574481] [<21085a8d>] (load_flat_file) from [<21085e03>] (load_flat_binary+0x47/0x2dc)
> [    8.574639] [<21085e03>] (load_flat_binary) from [<2105d581>] (bprm_execve+0x1fd/0x32c)

Hmm. That actually loads things into user space, so the problem isn't
that it shouldn't use vfs_read() or that iov_iter_init() would be
doing somethign wrong - the problem appears purely to be that we're in
an "uaccess_kernel()" region.

And yes, we're still in the early init code:

> [    8.574797] [<2105d581>] (bprm_execve) from [<2105dbb3>] (kernel_execve+0xa3/0xac)
> [    8.574947] [<2105dbb3>] (kernel_execve) from [<211e7095>] (kernel_init+0x31/0xb0)
> [    8.575099] [<211e7095>] (kernel_init) from [<2100814d>] (ret_from_fork+0x11/0x24)

which presumably runs with KERNEL_DS.

Which is kind of bogus in the new world order.

None of this *matters* for a nommu system, of course, which is why
that code used to work, and why it's now warning.

But for the same reason, it should still continue to work, despite the
warning. Because iov_iter_init() will actually be doing the right
thing and making it all about user pointers.

Can you verify that it otherwise looks ok apart from the new warning?

I *think* we should move to initializing the kernel state to
"set_fs(USER_DS)", and that would be the right model these days.

Of course, that could cause other things to pop up on architectures
that haven't been converted away from CONFIG_SET_FS.

The safer thing might be to move it earlier in kernel_execve(): it
does end up doing that "set_fs(USER_DS)" eventually, but it's done
fairly late in "begin_new_exec()" (and it's done as a
force_uaccess_begin(), not set_fs(), but in a CONFIG_SET_FS
configuration that ends up being what it does.

> The same warning is also observed with m68k and mcf5208evb,
> though the traceback isn't as nice.

Hmm. Either the m68k trace printing is just bad, or maybe it just
doesn't have CONFIG_KALLSYMS (or KALLSYMS_ALL) enabled.

Anyway, does a hacky patch something like this

   diff --git a/fs/exec.c b/fs/exec.c
   index 38f63451b928..26293bd7c502 100644
   --- a/fs/exec.c
   +++ b/fs/exec.c
   @@ -1934,6 +1934,10 @@ int kernel_execve(const char *kernel_filename,
        int fd = AT_FDCWD;
        int retval;

   +    // Make sure CONFIG_SET_FS architectures actually
   +    // do things into user space.
   +    force_uaccess_begin();
   +
        filename = getname_kernel(kernel_filename);
        if (IS_ERR(filename))
                return PTR_ERR(filename);

make the warning go away? I really would like to set USER_DS even
earlier, but this might be the least disruptive place.

Anything that accesses kernel space should *not* depend on KERNEL_DS
at this point, since that would make all the properly converted
architectures already fail.

And any architectures that haven't been converted away from
CONFIG_SET_FS would have been hitting that force_uaccess_begin() later
in  begin_new_exec(), so they can't depend on any KERNEL_DS games
after kernel_execve() either.

So that one-liner is hacky, but *feels* safe to me. Am I perhaps
missing something?

It probably means we could remove the force_uaccess_begin() in
begin_new_exec() entirely, but let's first see if the one-liner at
least makes the warning go away.

           Linus