Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 4 Jul 2020 15:57:38 +0900

On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> writes:
> 
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>>> called from both kernel thread and from process context (e.g. init_module()
>>>> syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() waits for
>> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
>> needs to be processed before execve(), can't it?
> 
> As a module the guarantee is we call task_work_run.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.

Have you ever checked how early the kthreadd (PID=2) gets started?

----------

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
        trace_task_newtask(p, clone_flags);
        uprobe_copy_process(p, clone_flags);

+       printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
        return p;

 bad_fork_cancel_cgroup:
----------

----------
[    0.090757][    T0] pid_max: default: 65536 minimum: 512
[    0.090890][    T0] LSM: Security Framework initializing
[    0.090890][    T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Disabled fast string operations
[    0.090890][    T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[    0.090890][    T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[    0.090890][    T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.090890][    T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[    0.090890][    T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.090890][    T0] SRBDS: Unknown: Dependent on hypervisor status
[    0.090890][    T0] MDS: Mitigation: Clear CPU buffers
[    0.090890][    T0] Freeing SMP alternatives memory: 24K
[    0.090890][    T0] Created PID: 1 Comm: swapper/0
[    0.090890][    T0] Created PID: 2 Comm: swapper/0
[    0.090890][    T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[    0.091000][    T2] Created PID: 3 Comm: kthreadd
[    0.091995][    T2] Created PID: 4 Comm: kthreadd
[    0.093028][    T2] Created PID: 5 Comm: kthreadd
[    0.093997][    T2] Created PID: 6 Comm: kthreadd
[    0.094995][    T2] Created PID: 7 Comm: kthreadd
[    0.096037][    T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[    0.135716][    T2] Created PID: 13 Comm: kthreadd
[    0.135716][    T1] smp: Bringing up secondary CPUs ...
[    0.135716][    T2] Created PID: 14 Comm: kthreadd
[    0.135716][    T2] Created PID: 15 Comm: kthreadd
[    0.135716][    T2] Created PID: 16 Comm: kthreadd
[    0.135716][    T2] Created PID: 17 Comm: kthreadd
[    0.135716][    T2] Created PID: 18 Comm: kthreadd
[    0.135716][    T1] x86: Booting SMP configuration:
(...snipped...)
[    0.901990][    T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.902145][    T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.902213][    T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes
[    0.902224][    T1] Trying to unpack rootfs image as initramfs...
[    1.107993][    T1] Freeing initrd memory: 18876K
[    1.109049][    T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.111003][    T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB)
[    1.112136][    T1] check: Scanning for low memory corruption every 60 seconds
[    1.115040][    T2] Created PID: 52 Comm: kthreadd
[    1.116110][    T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[    1.120936][    T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled
[    1.129626][    T2] Created PID: 53 Comm: kthreadd
[    1.131403][    T2] Created PID: 54 Comm: kthreadd
----------

kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by
swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking
initramfs.

And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel
thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being
called after many processes already started running.

> 
>> Also, I don't know how convoluted the dependency of all "struct file" linked into
>> delayed_fput_list might be, for there can be "struct file" which will not be a
>> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>>
>> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
>> there is a guarantee that __fput_sync() waits for the completion of "struct file"
>> which needs to be flushed before execve(), isn't there?
> 
> There is really not a good helper or helpers, and this code suggests we
> have something better.  Right now I have used the existing helpers to
> the best of my ability.  If you or someone else wants to write a better
> version of flushing so that exec can happen be my guest.
> 
> As far as I can tell what I have is good enough.

Just saying what you think is not a "review". I'm waiting for answer from Al Viro
because I consider that Al will be the most familiar with fput()'s behavior.
At least I consider that

	if (current->flags & PF_KTHREAD) {
		__fput_sync(file);
	} else {
		fput(file);
		task_work_run();
	}

is a candidate for closing the race window. And depending on Al's answer,
removing

	BUG_ON(!(task->flags & PF_KTHREAD));

 from __fput_sync() and unconditionally using

	__fput_sync(file);

 from blob_to_mnt() might be the better choice. Anyway, I consider that
Al's response is important for this "review".

> 
>>> We fundamentally AKA in any correct version of this code need to flush
>>> the file descriptor before we call exec or exec can not open it a
>>> read-only denying all writes from any other opens.
>>>
>>> The use case of flush_delayed_fput is exactly the same as that used
>>> when loading the initramfs.
>>
>> When loading the initramfs, the number of threads is quite few (which
>> means that the possibility of hitting the race window and convoluted
>> dependency is small).
> 
> But the reality is the code run very early, before the initramfs is
> initialized in practice.

Such expectation is not a reality.

> 
>> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
>> flush_delayed_fput() might be called after many number of threads already
>> started running.
> 
> At which point the code probably won't be runnig from a kernel thread
> but instead will be running in a thread where task_work_run is relevant.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> 
> At worst it is a very small race, where someone else in another thread
> starts flushing the file.  Which means the file could still be
> completely close before exec.   Even that is not necessarily fatal,
> as the usermode driver code has a respawn capability.
> 
> Code that is used enough that it hits that race sounds like a very
> good problem to have from the perspective of the usermode driver code.

In general, unconditionally retrying call_usermodehelper() when it returned
a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which
code is an implementation of "a respawn capability"; I'd like to check where
that code is and whether that code is checking -ETXTBSY.