Re: [PATCH] tcmu: fix crash for dereferencing the released udev->mb_addr memory

Xiubo Li <xiubli@xxxxxxxxxx> · Sat, 21 Jul 2018 10:21:46 +0800

On 2018/7/20 23:13, Mike Christie wrote:
On 07/19/2018 07:34 PM, Xiubo Li wrote:
On 2018/7/19 23:49, Mike Christie wrote:
On 07/19/2018 09:30 AM, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

The logs are:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
PGD 800000000e254067 PUD e255067 PMD 0
Oops: 0002 [#1] SMP
[...]
CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1
It is not clear to me how you hit this. It's not a RHEL only bug is it?
Are you sure you are hitting it during device removal?
Yes, not only the RHEL bug, for the RHEL we have sync the related code
to the upstream. Before as I remembered I have hit one sudden crash by
using the SCSI repo code when testing the tcmu-runner PR#402:
tcmu-runner: add reset netlink support.

The SCSI repo is :
git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git

Since the kdump service was not working for the upstream code in my
setups that time, so could only get very simple logs. This time I was
using Prasanna's script and RHEL code with the netlink and UIO crash
fixing patches to reproduce this. The test script please see the attachment.

If we are calling tcmu_reset_ring_store isn't there a refcount on the
item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release
until after that has been released, so they would never run at the same
time. And, if userspace were to try to open/write to that reset file
after the rmdir has been done on the se_device then configfs detects
this and fails the open.
With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do :
1, block netlink
2, reset netlink events
3, block device --> reset ring --> open device --> unblock device
If the device was only created and not yet enabled you will hit the
crash here.

If in your test you kill runner while a enable is in process then this
would happen.

As you mentioned it is possible only created then runner is killed.

4, unblock netlink


Currently there may has one bug in systemd service, because we hit one
odd issue by using the test script:
When executing 'systemctl restart tcmu-runner', this process seems
waiting one signal from somewhere by stucking itself, if we do nothing
It will wait for about 6 ~20 minutes(the time comes from our test
result) to recovery from that. And we can execute 'systemctl restart
gluster-blockd' in another terminal, which could stop the stuck state
above immediately.

[root@localhost ~]# ps -aux | grep -e tcmu -e
target
root      2021  0.0  0.0      0     0 ?        S<   15:09   0:00
[target_completi]
root      2386  0.6  0.5 226232 11096 ?        Ds   15:11   0:00
/usr/bin/python /usr/bin/targetctl clear
root      2400  0.0  0.0 134816  1316 pts/0    S+   15:11   0:00
systemctl restart tcmu-runner                        root      2415
0.0  0.0 112704   988 pts/1    R+   15:11   0:00 grep --color=auto -e
tcmu -e target
[root@localhost ~]# cat
/proc/2400/stack

[<ffffffff9d0548f5>]
poll_schedule_timeout+0x55/0xb0

[<ffffffff9d05607d>]
do_sys_poll+0x48d/0x590

[<ffffffff9d0564f3>]
SyS_ppoll+0x1d3/0x1f0

[<ffffffff9d56cdef>]
system_call_fastpath+0x1c/0x21


And with IOs going on, when killing the tcmu-runner process, the
Are you just doing a test to kill runner at various times or is your
restart of your service doing this while the targetcli clear operation
is running?
There are 3 related services, and here '-->' means will Bindto and depend to
gluster-blockd.service --> gluster-block-target.service --> tcmu-runner 
service

I was running the script as I have attached in last mail and in another 
terminal running the 'systemctl restart gluster-blockd' cmd when this 
crash is hit. Because as I mentioned without the 'systemctl restart 
gluster-blockd' in another terminal, the script will be stuck in 
'systemctl restart tcmu-runner' for a very long time.

When resarting the gluster-blockd, it will start the tcmu-runner and 
gluster-block-target first.

And it is possible that the targetcli clear operation is stuck somewhere 
then the runner is restarted.

If you re killing it at various times, how do you know you are not
killing it during a enable operation and hitting this bug there?
It is very possible.


If you are killing on purpose as part of your restart operation why are
you killing it when the targetcli clear operation needs the backends to
complete their IO as part of the tpg disablement process.

Just after runner service is killed then the gluster-block-target 
service will be killed too, and it will invoke the 'targetctl clear' 
when stopping.

We are testing the netlink reset and the ring reset feature, so the 
script is killing the runner daemon to emulate the D state cases. And 
the script also emulating the crash case of the tcmu-runner daemon for 
some reasons.


'targetctl clear' process will stuck in [1] first. Then we restart the
tcmu-runner, which will reset ring with accessing the udev->mb_addr,
currently we still could access the
/sys/kernel/config/target/core/user_XX/*. And while the ring reset is
still going on, the stucked 'targetctl clear' process will be woken up
and then tries to delete the device which will kfree(udev->mb_addr),
here any accessing to  /sys/kernel/config/target/core/user_XX/* will stuck.
The rtslib clear operation deletes the target first and that disables
the tpg which waits for outstanding commands. It then finishes tearing
down the objects under the target, and then the backend devices are deleted.

If targetcli clear is unhung from the tpg disable operation then
tcmu-runner has restarted and has started to clean up the kernel state.
So runner has reset the ring and failed running IO right (this is how
the tpg disable operation will see all outstanding commands completed
and procceed)? We then proceed and can do a rmdir on the tcmu device. So
I am saying the only way tpg disablement has completed when there is
outstanding IO is if reset ring has completed first.

Are you certain you have not killed a runner in the middle of a
enable/configure event at some point.
No, I am not sure, because the reason I mentioned above, it is very 
possible the tcmu-runner is killed in the middle of a enable/configure 
event.

Then runner has restarted and
killed that nl operation leaving the uncongigured device. It then runs
reset ring which fails outstanding commands and accessed the unsetup
mb_addr and hits the crash here. When the outstanding commands have
completed the rtslib operation is also unhung from the tpg disable
completes, and we proceed to the backend device deletion.

There are two stuck case:
[1], If there has any uncompleted IO or cmd, 'targetctl clear' process
will stuck in iscsit_free_session() -->
iscsit_cause_connection_reinstatement() first, then after all the IOs or
cmds completed, it will stuck in [2].

[2], If there has no uncompleted IOs or cmds, 'targetctl clear' process
will stuck in tcmu_destroy_device() --> tcmu_netlink_event() -->
wait_for_completion().

  

I think we can hit your bug during initialization. We do not setup the
mb_addr until the device is configured, but the configfs files are
exported to userspace at device creation time. So between those times,
userspace could be writing to the reset file and hitting this bug. Is
that maybe what you hit?

If that is the bug, we need to grab the cmdr_lock in
tcmu_configure_device when we are setting up the mb_addr and in the
failure path in that function. In tcmu_reset_ring then I think we also
need a check for a NULL mb_addr.

I checked all the related code and find that all the mb_addr accessing
are protect by the cmdr_lock except in tcmu_dev_kref_release() after the
device initialized.

Today I have reproduced this again.

If you can reproduce it can you debug it a little more to figure out
what is going on exactly? If you cannot get a kdump, then at least stick
in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that
the release is run before the reset. Also stick in a printk to the
tcmu_configure_device function to verify it is always succeeding.

The problem is that if you can hit the bug you describe where configfs
is accessing the device after the device has had a rmdir done on it then
a lot of the configfs code will have a possible bug. For the se_device
tcmu_dev_kref_release is only run when the se_device's configfs item's
refcount has gone to zero (target_core_dev_release -> target_free_device
-> target_free_device -> tcmu_free_device), so the se_device/tcmu_device
could also be freed at any time while configfs is operation is running
if you are right and your patch is not going to help that.
Sure, I will debug this with some printk messages. But not sure could 
reproduce it in time, it may take days or weeks.

Because the first time to hit this is many weeks ago and this week twice.

And if you are agree with current fixing, let's push this first and I 
will post the V2 with some other fixing.
Then I will continue to reproduce this by adding some printk messages to 
make sure there has no problem with the configfs code, if it's really 
has bugs, that means the three times crash may have different root 
causes. Then let's fix it ?

Thanks Mike.
BRs


Thanks，
BRs
Xiubo

diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 847707a..8d7274e 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref)
  	bool all_expired = true;
  	int i;
  
-	vfree(udev->mb_addr);
-	udev->mb_addr = NULL;
-
  	spin_lock_bh(&timed_out_udevs_lock);
  	if (!list_empty(&udev->timedout_entry))
  		list_del(&udev->timedout_entry);
  	spin_unlock_bh(&timed_out_udevs_lock);
  
-	/* Upper layer should drain all requests before calling this */
  	mutex_lock(&udev->cmdr_lock);
+	vfree(udev->mb_addr);
+	udev->mb_addr = NULL;
+
+	/* Upper layer should drain all requests before calling this */
  	idr_for_each_entry(&udev->commands, cmd, i) {
  		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
  			all_expired = false;

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html