On Fri, 2007-01-05 at 10:47 +0100, BERTRAND Joël wrote: > Hello, > > Some trouble again with ESP DMA and the 2.6.19 kernel ? With very high > disk load (raid reconstruction or apt-get dist-upgrade), I can seen ESP > DMA error on a SS20 workstation. This trouble has been fixed in 2.6.18. > Is there any chance to fix this with the last 2.6.20-rc ? > > Test config : > - dual SS-II/75 MHz > - 448 MB > - 8MB VSIMM > - 2 36 GB internal SCSI disks > - 2.6.19.1 kernel For what it is worth, I'm not actually able to get esp.ko (Aurora builds esp as a module) working at all on any sparc32 systems (immediate testing on ss4 and ss20). Several of the Aurora folks have been helping me try to track down the failure, and here is what we know so far: 2.6.16 works properly on the ss4: http://beer.tclug.org/jima/text/tmi-2.6.16-1.2241sp7.1-dmesg.txt We see the disk on the esp controller at target three, and life is good. 2.6.18.1 does not work properly on the ss4: http://beer.tclug.org/jima/text/tmi-2.6.18-1.2798.al3.3.txt The esp.ko module loads, but it doesn't see the disk on the esp controller at target three. We then tested on an Ultra 2 to see if that hardware (which has multiple esp controllers in it) would detect devices correctly, and it does: http://beer.tclug.org/jima/text/badger-2.6.18-1.2798.al3.1smp-dmesg.txt I also tested 2.6.20-rc1-git5 on the sparc32s, but it also fails to see any of the attached devices (disk on the ss4, disk and cdrom on the ss20). At this point, I built 2.6.16 and 2.6.18.1 kernels with all the DEBUG_ defines enabled, in the hopes of exposing the differences in behavior. Here is the 2.6.16 debugging output from the ss4: http://beer.tclug.org/jima/text/tmi-2.6.16-1.2241sp7.2.txt Here is the 2.6.18.1 debugging output from the ss4: http://beer.tclug.org/jima/text/tmi-2.6.18-1.2798.al3.4.txt The differences are rather staggering, the 2.6.16 kernel seems to call esp_queue multiple times upon discovering the disk on target 3, but the 2.6.18.1 kernel only calls it once. Comparing the output on the ss4: On 2.6.16, we see: <SLCTNORM>I[0:0]( <CLUELESS>esp_do_data: <DATAIN>newphase <DATAIN> hmuch<36> DMA|TI --> do_intr_end ) I[0:0](esp_work_bus: esp_do_data_finale: trans_z(36), bytes_sent(36), <CLUELESS>!bogus_data, to new phase At the same point in 2.6.18.1, we see: <SLCTNORM>I[0:0]( <CLUELESS>esp_do_data: <DATAIN>newphase <DATAIN> hmuch<252> DMA|TI --> do_intr_end ) I[0:0](esp_work_bus: esp_do_data_finale: trans_sz(252), bytes_sent(18), <CLUELESS>!bogus_data, to new phase Note the difference in trans_sz, bytes_sent, hmuch... not sure if that is relevant, or just noise. After this, both kernels output: <STATUS>esp_do_status: ack msg, got something, got both, status= 0 msg= 0, and was COMMAND_COMPLETE <FREEING>F<03,00>) But while the 2.6.16 kernel continues on target 3, recalling esp_queue, and finding the disk on target 3: esp_queue: target=3 lun=0 N<03,00> esp: Selecting device for first time. target=3 lun=0 <SLCTNORM>I[0:0](<CLUELESS>esp_do_data: <DATAIN>newphase<DATAIN> hmuch<144> DMA|TI --> do_intr_end )I[0:0](esp_work_bus: esp_do_data_finale: trans_sz(144), bytes_sent(144), <CLUELESS>!bogus_data, to new phase <STATUS>esp_do_status: ack msg, got something, got both, status= 0 msg= 0, and was COMMAND_COMPLETE <FREEING>F<03,00>)<5> Vendor: SEAGATE Model: ST34573WC Rev: 6244 Type: Direct-Access ANSI SCSI revision: 02 ..... The 2.6.18.1 kernel stops and moves on to target 4+: esp_queue: target=4 lun=0 N<04,00> esp: Selecting device for first time. target=4 lun=0 <SLCTNORM>I[0:0](esp: selection failure, maybe nobody there? esp: target 4 lun 0 )esp_queue: target=5 lun=0 N<05,00> esp: Selecting device for first time. target=5 lun=0 <SLCTNORM>I[0:0](esp: selection failure, maybe nobody there? esp: target 5 lun 0 )esp_queue: target=6 lun=0 N<06,00> esp: Selecting device for first time. target=6 lun=0 <SLCTNORM>I[0:0](esp: selection failure, maybe nobody there? esp: target 6 lun 0 ) I'm not sure why this is failing, looking at git, the changes to esp to port it to the new SBUS layer are the most significant differences: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=411aa5540536feace62c97478a8ea5dab7469377 But, I'm not sure why this would break sparc32 and not sparc64, and reverting it really isn't the way to go here. I've also got the debugging output from esp.ko on 2.6.18.1 on the Ultra 2: http://beer.tclug.org/jima/text/badger-2.6.18-1.2798.al3.4smp.txt It's worth noting that the trans_sz, bytes_sent and hmuch values here match the values from the working 2.6.16 kernel on the ss4. All kernels were built with the same toolchain: gcc version 4.1.1 20061011 (Red Hat 4.1.1-30) Dave, any assistance you can offer here would be greatly appreciated, as this is pretty much a showstopper for the next Aurora release (broken scsi). ~spot - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html