Re: [PATCH] tcmu: fix crash for dereferencing the released udev->mb_addr memory

Mike Christie <mchristi@xxxxxxxxxx> · Fri, 20 Jul 2018 10:13:53 -0500

On 07/19/2018 07:34 PM, Xiubo Li wrote:
> On 2018/7/19 23:49, Mike Christie wrote:
>> On 07/19/2018 09:30 AM, xiubli@xxxxxxxxxx wrote:
>>> From: Xiubo Li <xiubli@xxxxxxxxxx>
>>>
>>> The logs are:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
>>> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
>>> PGD 800000000e254067 PUD e255067 PMD 0
>>> Oops: 0002 [#1] SMP
>>> [...]
>>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1
>>
>> It is not clear to me how you hit this. It's not a RHEL only bug is it?
>> Are you sure you are hitting it during device removal?
> Yes, not only the RHEL bug, for the RHEL we have sync the related code
> to the upstream. Before as I remembered I have hit one sudden crash by
> using the SCSI repo code when testing the tcmu-runner PR#402:
> tcmu-runner: add reset netlink support.
> 
> The SCSI repo is :
> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
> 
> Since the kdump service was not working for the upstream code in my
> setups that time, so could only get very simple logs. This time I was
> using Prasanna's script and RHEL code with the netlink and UIO crash
> fixing patches to reproduce this. The test script please see the attachment.
> 
>> If we are calling tcmu_reset_ring_store isn't there a refcount on the
>> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release
>> until after that has been released, so they would never run at the same
>> time. And, if userspace were to try to open/write to that reset file
>> after the rmdir has been done on the se_device then configfs detects
>> this and fails the open.
> With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do :
> 1, block netlink
> 2, reset netlink events
> 3, block device --> reset ring --> open device --> unblock device

If the device was only created and not yet enabled you will hit the
crash here.

If in your test you kill runner while a enable is in process then this
would happen.

> 4, unblock netlink
> 
> 
> Currently there may has one bug in systemd service, because we hit one
> odd issue by using the test script:
> When executing 'systemctl restart tcmu-runner', this process seems
> waiting one signal from somewhere by stucking itself, if we do nothing
> It will wait for about 6 ~20 minutes(the time comes from our test
> result) to recovery from that. And we can execute 'systemctl restart
> gluster-blockd' in another terminal, which could stop the stuck state
> above immediately.
> 
> [root@localhost ~]# ps -aux | grep -e tcmu -e
> target                                                                 
> root      2021  0.0  0.0      0     0 ?        S<   15:09   0:00
> [target_completi]                                   
> root      2386  0.6  0.5 226232 11096 ?        Ds   15:11   0:00
> /usr/bin/python /usr/bin/targetctl clear            
> root      2400  0.0  0.0 134816  1316 pts/0    S+   15:11   0:00
> systemctl restart tcmu-runner                        root      2415 
> 0.0  0.0 112704   988 pts/1    R+   15:11   0:00 grep --color=auto -e
> tcmu -e target
> [root@localhost ~]# cat
> /proc/2400/stack                                                                             
> 
> [<ffffffff9d0548f5>]
> poll_schedule_timeout+0x55/0xb0                                                                 
> 
> [<ffffffff9d05607d>]
> do_sys_poll+0x48d/0x590                                                                         
> 
> [<ffffffff9d0564f3>]
> SyS_ppoll+0x1d3/0x1f0                                                                           
> 
> [<ffffffff9d56cdef>]
> system_call_fastpath+0x1c/0x21                                                                  
> 
> 
> And with IOs going on, when killing the tcmu-runner process, the

Are you just doing a test to kill runner at various times or is your
restart of your service doing this while the targetcli clear operation
is running?

If you re killing it at various times, how do you know you are not
killing it during a enable operation and hitting this bug there?

If you are killing on purpose as part of your restart operation why are
you killing it when the targetcli clear operation needs the backends to
complete their IO as part of the tpg disablement process.

> 'targetctl clear' process will stuck in [1] first. Then we restart the
> tcmu-runner, which will reset ring with accessing the udev->mb_addr,
> currently we still could access the 
> /sys/kernel/config/target/core/user_XX/*. And while the ring reset is
> still going on, the stucked 'targetctl clear' process will be woken up
> and then tries to delete the device which will kfree(udev->mb_addr),
> here any accessing to  /sys/kernel/config/target/core/user_XX/* will stuck.

The rtslib clear operation deletes the target first and that disables
the tpg which waits for outstanding commands. It then finishes tearing
down the objects under the target, and then the backend devices are deleted.

If targetcli clear is unhung from the tpg disable operation then
tcmu-runner has restarted and has started to clean up the kernel state.
So runner has reset the ring and failed running IO right (this is how
the tpg disable operation will see all outstanding commands completed
and procceed)? We then proceed and can do a rmdir on the tcmu device. So
I am saying the only way tpg disablement has completed when there is
outstanding IO is if reset ring has completed first.

Are you certain you have not killed a runner in the middle of a
enable/configure event at some point. Then runner has restarted and
killed that nl operation leaving the uncongigured device. It then runs
reset ring which fails outstanding commands and accessed the unsetup
mb_addr and hits the crash here. When the outstanding commands have
completed the rtslib operation is also unhung from the tpg disable
completes, and we proceed to the backend device deletion.

> 
> There are two stuck case:
> [1], If there has any uncompleted IO or cmd, 'targetctl clear' process
> will stuck in iscsit_free_session() -->
> iscsit_cause_connection_reinstatement() first, then after all the IOs or
> cmds completed, it will stuck in [2].
> 
> [2], If there has no uncompleted IOs or cmds, 'targetctl clear' process
> will stuck in tcmu_destroy_device() --> tcmu_netlink_event() -->
> wait_for_completion().
> 
>  
> 
>> I think we can hit your bug during initialization. We do not setup the
>> mb_addr until the device is configured, but the configfs files are
>> exported to userspace at device creation time. So between those times,
>> userspace could be writing to the reset file and hitting this bug. Is
>> that maybe what you hit?
>>
>> If that is the bug, we need to grab the cmdr_lock in
>> tcmu_configure_device when we are setting up the mb_addr and in the
>> failure path in that function. In tcmu_reset_ring then I think we also
>> need a check for a NULL mb_addr.
>>
> I checked all the related code and find that all the mb_addr accessing
> are protect by the cmdr_lock except in tcmu_dev_kref_release() after the
> device initialized.
> 
> Today I have reproduced this again.
> 

If you can reproduce it can you debug it a little more to figure out
what is going on exactly? If you cannot get a kdump, then at least stick
in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that
the release is run before the reset. Also stick in a printk to the
tcmu_configure_device function to verify it is always succeeding.

The problem is that if you can hit the bug you describe where configfs
is accessing the device after the device has had a rmdir done on it then
a lot of the configfs code will have a possible bug. For the se_device
tcmu_dev_kref_release is only run when the se_device's configfs item's
refcount has gone to zero (target_core_dev_release -> target_free_device
-> target_free_device -> tcmu_free_device), so the se_device/tcmu_device
could also be freed at any time while configfs is operation is running
if you are right and your patch is not going to help that.

> Thanks，
> BRs
> Xiubo
> 
>>
>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>>> index 847707a..8d7274e 100644
>>> --- a/drivers/target/target_core_user.c
>>> +++ b/drivers/target/target_core_user.c
>>> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref)
>>>  	bool all_expired = true;
>>>  	int i;
>>>  
>>> -	vfree(udev->mb_addr);
>>> -	udev->mb_addr = NULL;
>>> -
>>>  	spin_lock_bh(&timed_out_udevs_lock);
>>>  	if (!list_empty(&udev->timedout_entry))
>>>  		list_del(&udev->timedout_entry);
>>>  	spin_unlock_bh(&timed_out_udevs_lock);
>>>  
>>> -	/* Upper layer should drain all requests before calling this */
>>>  	mutex_lock(&udev->cmdr_lock);
>>> +	vfree(udev->mb_addr);
>>> +	udev->mb_addr = NULL;
>>> +
>>> +	/* Upper layer should drain all requests before calling this */
>>>  	idr_for_each_entry(&udev->commands, cmd, i) {
>>>  		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
>>>  			all_expired = false;
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe target-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>