Re: Hotplug

Jens Axboe <axboe@xxxxxxxxx> · Wed, 11 Apr 2018 10:20:07 -0600

On 4/11/18 10:14 AM, Jan Kara wrote:
> Hello,
> 
> On Wed 11-04-18 08:11:13, Jens Axboe wrote:
>> On 4/11/18 7:58 AM, Jan Kara wrote:
>>> On Tue 10-04-18 11:17:46, Jens Axboe wrote:
>>>> Been running some tests and I keep running into issues with hotplug.
>>>> This looks similar to what Bart posted the other day, but it looks
>>>> more deeply rooted than just having to protect the queue in
>>>> generic_make_request_checks(). The test run is blktests,
>>>> block/001. Current -git doesn't survive it. I've seen at least two
>>>> different oopses, pasted below.
>>>>
>>>> [  102.163442] NULL pointer dereference at 0000000000000010
>>>> [  102.163444] PGD 0 P4D 0 
>>>> [  102.163447] Oops: 0000 [#1] PREEMPT SMP
>>>> [  102.163449] Modules linked in:
>>>> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
>>>> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl
>>>> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
>>>> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc
>>>> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
>>>> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif]
>>>> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
>>>> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
>>>> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
>>>> [  102.358299] RSP: 0018:ffff883ff357bb58 EFLAGS: 00010292
>>>> [  102.364734] RAX: ffffffffa00b07d0 RBX: ffff883ff3058000 RCX: ffff883ff357bb66
>>>> [  102.373220] RDX: 0000000000000003 RSI: 0000000000007530 RDI: ffff881fea631000
>>>> [  102.381705] RBP: 0000000000000000 R08: ffff881fe4d38400 R09: 0000000000000000
>>>> [  102.390185] R10: 0000000000000000 R11: 00000000000001b6 R12: 000000000800005d
>>>> [  102.398671] R13: 000000000800005d R14: ffff883ffd9b3790 R15: 0000000000000000
>>>> [  102.407156] FS:  00007f7dc8e6d8c0(0000) GS:ffff883fff340000(0000) knlGS:0000000000000000
>>>> [  102.417138] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  102.424066] CR2: 0000000000000010 CR3: 0000003ffda98005 CR4: 00000000003606e0
>>>> [  102.432545] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  102.441024] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  102.449502] Call Trace:
>>>> [  102.452744]  ? __invalidate_device+0x48/0x60
>>>> [  102.458022]  check_disk_change+0x4c/0x60
>>>> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
>>>> [  102.468270]  __blkdev_get+0xb9/0x450
>>>> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
>>>> [  102.477568]  blkdev_get+0x11e/0x320
>>>> [  102.481969]  ? bdget+0x11d/0x150
>>>> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.490968]  ? bd_acquire+0xc0/0xc0
>>>> [  102.495368]  do_dentry_open+0x1b0/0x320
>>>> [  102.500159]  ? inode_permission+0x24/0xc0
>>>> [  102.505140]  path_openat+0x4e6/0x1420
>>>> [  102.509741]  ? cpumask_any_but+0x1f/0x40
>>>> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
>>>> [  102.519903]  do_filp_open+0x8c/0xf0
>>>> [  102.524305]  ? __seccomp_filter+0x28/0x230
>>>> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
>>>> [  102.539559]  ? list_lru_add+0xa8/0xc0
>>>> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
>>>> [  102.549047]  ? __alloc_fd+0xaf/0x160
>>>> [  102.553549]  ? do_sys_open+0x1a6/0x230
>>>> [  102.558244]  do_sys_open+0x1a6/0x230
>>>> [  102.562742]  do_syscall_64+0x5a/0x100
>>>> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>
>>> Interesting. Thinking out loud: This is cd->device dereference I guess
>>> which means disk->private_data was NULL. That gets set in sr_probe()
>>> together with disk->fops which are certainly set as they must have led us
>>> to the crashing function sr_block_revalidate_disk(). So likely
>>> disk->private_data got already cleared. That happens in sr_kref_release()
>>> and the fact that that function got called means struct scsi_cd went away -
>>> so sr_remove() must have been called as well. That all seems possible like:
>>>
>>> CPU1		CPU2
>>> sr_probe()
>>> 		__blkdev_get()
>>> 		  disk = bdev_get_gendisk();
>>> <device removed>
>>> sr_remove()
>>>   del_gendisk()
>>>   ...
>>>   kref_put(&cd->kref, sr_kref_release);
>>>     disk->private_data = NULL;
>>>     put_disk(disk);
>>>     kfree(cd);
>>> 		  if (disk->fops->open) {
>>> 		    ret = disk->fops->open(bdev, mode); => sr_block_open
>>> 		      check_disk_change(bdev);
>>> 		        sr_block_revalidate_disk()
>>> 			  CRASH
>>>
>>> And I think the problem is in sr_block_revalidate_disk() itself as the
>>> scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
>>> and thus that 'cd' does not disappear under it. IMHO it needs to use
>>> scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something?
>>
>> No I think you are correct, from the revalidate path it should grab/release
>> a reference. Looks like sr_block_check_events() needs the same treatment.
>> How about the below?
>  
> Yeah, that looks good to me except one thing:
> 
>> diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
>> index 0cf25d789d05..3f3cb72e0c0c 100644
>> --- a/drivers/scsi/sr.c
>> +++ b/drivers/scsi/sr.c
>>  static int sr_block_revalidate_disk(struct gendisk *disk)
>>  {
>> -	struct scsi_cd *cd = scsi_cd(disk);
>>  	struct scsi_sense_hdr sshdr;
>> +	struct scsi_cd *cd;
>> +
>> +	cd = scsi_cd_get(disk);
>> +	if (!cd)
>> +		return -ENXIO;
> 
> So if this returns -ENXIO I somewhat wonder about the code in
> fs/block_dev.c: revalidate_disk(). Why do we do all the work there and then
> return -ENXIO?

That's a good (but separate) question. I'd have to look deeper, but
it's entirely possible that revalidate could fail for other reasons
and we still want to ensure we reset the size of the device, for
instance.

-- 
Jens Axboe