On Dec 27 Axel Theilmann wrote: > On 12/25/2011 09:58 PM, Stefan Richter wrote: [... >>> 在 2011年12月15日 上午1:59,Axel Theilmann <theilmann@xxxxxxxxxxxx> : >>>> two weeks ago Huajun Li posted a patch for a kernel oops, subject >>>> [PATCH] SCSI/sd: Fix NULL dereference in sd_revalidate_disk". >>>> >>>> The patch was discussed but considered "clearly wrong". The bug shows >>>> up for us in kernel 3.1.4 quite often when unplugging usb sticks and >>>> it seems a few other people have the same problem: >>>> >>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/859199 >>>> https://bugzilla.redhat.com/show_bug.cgi?id=754518 >>>> https://bugzilla.novell.com/show_bug.cgi?id=722350 >>>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=649735 >>>> >>>> Can anyone of you maybe give me any status update on that bug? ...] > > as far as I remember, all Linux releases in 2011 have been broken WRT hot > > removal of block devices; some more severely, some less. Various patches > > for this went in over the year, but if they fixed anything, they always > > uncovered the next lingering unplug related bug. The presumed first Linux > > so now there are 2 known NULL-pointer problems in the cd-rom code and one in > the scsi-disk code. The two CD-ROM related traces which I posted seem to indicate a bug between block layer's and SCSI core lifetime managements, rather than in the cd-rom code particularly. When I get the time, I will try the "1. open(), 2. remove device, 3. ioctl()" sequence on an sd_mod device instead of an sr_mod one and see where this goes. > Would a complete fix for this issue be a question of locating all the > possible NULL-pointers and fixing them or do you think that the hotplug > problem has to be fixed on a more "fundamental" level? I don't know what my two traces tell us what particularly is broken and where to attack the problem. In case of the sd_revalidate_disk oops from the thread which I highjacked (which refers to "[PATCH] SCSI/sd: Fix NULL dereference in sd_revalidate_disk", http://thread.gmane.org/gmane.linux.scsi/71174), the trouble is that nobody came up with an answer to James' question on how it could happen in the first place that sd_revalidate_disk(disk) could be called on a disk that leads to a NULL scsi_disk. In turn, this presumably means among else that the answer to my earlier question --- what prevents the scsi_disk to go invalid slightly after that newly added NULL pointer check --- cannot be answered yet either. However, I do think that the pitiful state of block device unplugging throughout circa a whole year indicates a fundamental problem indeed. But I am only familiar with one of the SCSI transport layer drivers, not with the kernel layers above, so what do I know. > Even if there is a more fundamental problem below that has to be fixed, it > would still be nice to get in fixes for the dereferences that are currently > known to keep peoples systems from crashing. > > We built a kernel with Huajun's patch included and will do some tests to see > if the problem goes away (and no others show up). AFAIU it is not clear whether this patch actually prevents dereference of an invalid sdkp or only makes it considerably more unlikely. In either case, since there is apparently an underlying issue that this patch does not address, it is a judgment call whether such a patch is allowed into a kernel --- distributor kernel or mainline kernel. If somebody takes it, then at least a FIXME comment should be put there that sd_revalidate_disk is supposed to rely on an always valid sdkp. > > With a little bit of bad luck, udisks-daemon or in older distros hald > > should hit the bug too. Under kernel 3.1 I typically just got processes > > hanging in unkillable sleep. With kernel 3.2-rc7 I get an instant kernel > > panic. > > Yes, udisks is what probably triggers the bug for us. People removing USB > media before udisks is finished initializing the medium. With kernel 3.1.4 > we get instant kernel panics as well. > > tty, axel Sounds like both "your" and "my" bug occur at the end of the sequence 1. open(), 2. remove device, 3. ioctl() or whatever though perhaps with the extra twist in your case that this has to happen before the device bring-up was entirely finished...? In my CD-ROM related case the bug is not timing-sensitive at all; it always happens with above sequence. -- Stefan Richter -=====-==-== ==-- ==-== http://arcgraph.de/sr/ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html