Re: 2.6.16-rc1 crash in scsi_target_reap_work

Brian King <brking@xxxxxxxxxx> · Wed, 22 Feb 2006 08:38:03 -0600

Olaf Hering wrote:
>  On Mon, Feb 20, Brian King wrote:
> 
>> Olaf Hering wrote:
>>> 1:mon> d c0000000024cacc8
>>> c0000000024cacc8 00000000dead4ead ffffffff00000000  |......N.........|
>>> c0000000024cacd8 ffffffffffffffff c0000000024cace0  |.............L..|
>>> c0000000024cace8 c0000000024cace0 c000000000614f68  |.....L.......aOh|
>>> c0000000024cacf8 c000000000614f38 0000000000000000  |.....aO8........|
>>> c0000000024cad08 0000000000000000 0000000000000000  |................|
>>> c0000000024cad18 0000000000000000 0000000000000000  |................|
>>> c0000000024cad28 0000000000000000 0000000000000000  |................|
>>> c0000000024cad38 0000000000000000 0000000000000000  |................|
>>> c0000000024cad48 0000000000000000 0000000000000000  |................|
>>> c0000000024cad58 0000000000000000 0000000000000000  |................|
>>> c0000000024cad68 0000000000000000 0000000000000000  |................|
>>> c0000000024cad78 0000000000000000 0000000000000000  |................|
>>> c0000000024cad88 0000000000000000 0000000000000000  |................|
>>> c0000000024cad98 0000000000000000 0000000000000000  |................|
>>> c0000000024cada8 0000000000000000 0000000000000000  |................|
>>> c0000000024cadb8 0000000000000000 0000000000000000  |................|
>>> c0000000024cadc8 0000000000000000 0000000000000000  |................|
>>> c0000000024cadd8 0000000000000000 0000000000000000  |................|
>> I've now seen a couple recreates of this problem on various systems in
>> our labs, and there are always a bunch of zeroes in the struct device
>> in the same place as above. I wonder if perhaps the call to device_add
>> is failing in scsi_alloc_target. Failure of this call is not being handled
>> today. Can you give the attached patch a try? 
> 
> This fixes it, tested with plain rc3. Lots of -EEXIST, I wonder if the real bug is elsewhere.

I would guess that the -EEXIST is coming from:

create_dir
sysfs_create_dir
create_dir
kobject_add
device_add

Looking at the scsi_target reap code, it looks like there is a race condition. The
target is removed from the hosts list of targets under the host lock, then the host
lock is released. If another thread tries to add the same target that is being
tore down at this point (before device_del), the device_add will fail with EEXIST
since the sysfs directory for the device still exists.

Any reason we can't protect the target reaping code from this by grabbing the 
scan_mutex?

Brian

-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center
-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html