Hi Alan, I applied the latest patch, changing the unlinking behavior of EHCI. I tested quite extensively and I could not see the issue (as expected). At a certain point I lost 3 HDDs, but this is something I saw before (it happens very seldom, maybe future debugging) and it was not really locking the EHCI driver. Another thing is that the read performance was around about 10% slower that the usual one. I saw 3700~3800 KB/sec per HDD, while, under normal conditions, I saw 4100~4200 KB/sec per HDD. I think that's all, how do you want to proceed now? Please note that shortly I'll not be anymore with my PC, until January, so we can take it easy. In this respect, I wish you Merry Christmas and Happy New Year. bye, pg On Mon, Dec 17, 2012 at 10:57:48AM -0500, Alan Stern wrote: > On Sun, 16 Dec 2012, Piergiorgio Sartor wrote: > > > Hi Alan, > > > > I applied the modifications you suggested, this is the output: > > > > [ 110.922009] ehci_hcd 0000:00:0b.1: async off > > [ 209.362134] ehci_hcd 0000:00:0b.1: async on > > [ 241.794760] ehci_hcd 0000:00:0b.1: alan start cur time 4294908992 last scan 4294878606 > > [ 241.794774] ehci_hcd 0000:00:0b.1: command 10035 cmd reg 10035 status reg e008 > > [ 241.794779] ehci_hcd 0000:00:0b.1: async list: > > [ 241.794787] ehci_hcd 0000:00:0b.1: qh ffff880149fe7800 hw ffff8800af800000 dma af800000 next ffff880146ce3f80 hw_next af800c62 > > [ 241.794795] ehci_hcd 0000:00:0b.1: qh ffff880146ce3f80 hw ffff8800af800c60 dma af800c60 next ffff880146ce3780 hw_next af800d22 > > [ 241.794801] ehci_hcd 0000:00:0b.1: qh ffff880146ce3780 hw ffff8800af800d20 dma af800d20 next ffff880147776680 hw_next af800ae2 > > [ 241.794808] ehci_hcd 0000:00:0b.1: qh ffff880147776680 hw ffff8800af800ae0 dma af800ae0 next ffff8801478bf380 hw_next af800a22 > > [ 241.794815] ehci_hcd 0000:00:0b.1: qh ffff8801478bf380 hw ffff8800af800a20 dma af800a20 next ffff88014070da00 hw_next af800f02 > > [ 241.794821] ehci_hcd 0000:00:0b.1: qh ffff88014070da00 hw ffff8800af800f00 dma af800f00 next ffff8801407d3800 hw_next af800de2 > > [ 241.794828] ehci_hcd 0000:00:0b.1: qh ffff8801407d3800 hw ffff8800af800de0 dma af800de0 next ffff8801478bfc80 hw_next af800782 > > [ 241.794834] ehci_hcd 0000:00:0b.1: qh ffff8801478bfc80 hw ffff8800af800780 dma af800780 next ffff88014079e480 hw_next af800ba2 > > [ 241.794841] ehci_hcd 0000:00:0b.1: qh ffff88014079e480 hw ffff8800af800ba0 dma af800ba0 next ffff8801478bf780 hw_next af8008a2 > > [ 241.794847] ehci_hcd 0000:00:0b.1: qh ffff8801478bf780 hw ffff8800af8008a0 dma af8008a0 next ffff8801478bfd00 hw_next af800962 > > [ 241.794854] ehci_hcd 0000:00:0b.1: qh ffff8801478bfd00 hw ffff8800af800960 dma af800960 next (null) hw_next af800002 > > [ 241.794861] ehci_hcd 0000:00:0b.1: This qh link time 4294878614 enqueue time 4294878614 td token 1f8c80 ov token 0 > > [ 241.794869] ehci_hcd 0000:00:0b.1: qh ffff8801478bfd00 naf800002 info 4200220e 40000000 qtd af820540 > > [ 241.794876] ehci_hcd 0000:00:0b.1: overlay td ffff8800af800970 naf820d80 00000001 t00000000 p0=af883a9f > > [ 241.794884] ehci_hcd 0000:00:0b.1: dma af820d80 td ffff8800af820d80 naf820960 00000001 t001f8c80 p0=af883a80 > > [ 241.794892] ehci_hcd 0000:00:0b.1: dummy af820960 td ffff8800af820960 n00000001 00000001 t00000040 p0=00000000 > > [ 241.805935] ehci_hcd 0000:00:0b.1: giveback urb ffff8801455fa300 actual 0 > > [ 241.805943] ehci_hcd 0000:00:0b.1: alan end > > This confirms it. The async schedule isn't running, even though it is > turned on and the status register says that it is running. > > > The first "async off" happened, I guess correctly, > > when the hot-plug settled and all the HDDs where > > in place as expected. > > The "async on" happened as soon as I started to > > read 10 HDDs in parallel, I think correctly too. > > > > After few seconds, this time without any "help", > > the problem appeared with the log as reported. > > > > I noticed several "async on/off" in dmesg, it > > seems to me all were consistent. > > This is all what one would expect in normal operation. > > It seems clear that we aren't going to be able to find the cause of > the bug. So let's try another approach: We'll change the commit you > identified as the one where the problems began. > > The patch below will prevent the driver from unlinking any idle QHs > from the async schedule. This is a little overboard, because even > before that commit they did get unlinked, one at a time. Still, let's > see if this prevents the problem. (You should apply this on top of the > other patches.) > > Alan Stern > > > > Index: usb-3.7/drivers/usb/host/ehci-q.c > =================================================================== > --- usb-3.7.orig/drivers/usb/host/ehci-q.c > +++ usb-3.7/drivers/usb/host/ehci-q.c > @@ -1347,7 +1347,6 @@ static void scan_async (struct ehci_hcd > } else if (list_empty(&qh->qtd_list) > && qh->qh_state == QH_STATE_LINKED) { > qh->unlink_cycle = ehci->async_unlink_cycle; > - check_unlinks_later = true; > } else if (temp != 0) > goto rescan; > } -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html