Re: [PATCH] Avoid that SCSI device removal through sysfs triggers a deadlock

Bart Van Assche <bart.vanassche@xxxxxxxxxxx> · Tue, 8 Nov 2016 15:33:33 -0800

On 11/08/2016 11:15 AM, Eric W. Biederman wrote:
James Bottomley <jejb@xxxxxxxxxxxxxxxxxx> writes:

On Tue, 2016-11-08 at 08:52 -0800, Bart Van Assche wrote:
On 11/08/2016 07:28 AM, James Bottomley wrote:
On Mon, 2016-11-07 at 16:32 -0800, Bart Van Assche wrote:

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index cf4c636..44ec536 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1410,7 +1410,7 @@ int kernfs_remove_by_name_ns(struct
kernfs_node
*parent, const char *name,
 	mutex_lock(&kernfs_mutex);

 	kn = kernfs_find_ns(parent, name, ns);
-	if (kn)
+	if (kn && !(kn->flags & KERNFS_SUICIDED))

Actually, wrong flag, you need KERNFS_SUICIDAL.  The reason is that
kernfs_mutex is actually dropped half way through __kernfs_remove,
so KERNFS_SUICIDED is not set atomically with this mutex.

Hello James,

Sorry but what you wrote is not correct.

I think you agree it is dropped.  I don't need to add the bit about the
reacquisition because the race is mediated by the first acquisition not
the second one, if you mediate on KERNFS_SUICIDAL, you only need to
worry about this because the mediation is in the first acquisition.  If
you mediate on KERNFS_SUICIDED, you need to explain that the final
thing that means the race can't happen is the unbreak in the sysfs
delete path re-acquiring s_active ... the explanation of what's going
on and why gets about 2x more complex.

Is it really the dropping of the lock that is causing this?
I don't see that when I read those traces.

I am going to put my vote in for doing all of the self removal
sem-asynchronously by using task_work_add, and killing
device_remove_self, sysfs_remove_self, kernfs_remove_self.  Using
task_work_add remains synchronous with userspace so userspace should not
care, and we get the benefit of not having two different variants of the
same code racing with each other.

It might take a little more work but will leave code that is much more
maintainable in the long run.

Hello Eric,

I'm completely in favor of keeping code maintainable. But what's not 
clear to me is whether asynchronous I/O can be submitted to a sysfs 
attribute? If so, on what context will task work queued through 
task_work_add() be executed if an aio write is used to write into a 
sysfs attribute?

Additionally, can a process enter the exiting state after writing into 
the sysfs delete attribute started and before task_work_add() has been 
called? I'm asking this because in that case task_work_add() will fail.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html