On Sun, Feb 22, 2009 at 02:53:00PM +0200, Or Gerlitz wrote: > Dr. Volker Jaenisch wrote: >> every combination that I've tried when there are multiple >> simultaneous readers Reproduced that. On a single core more than one >> simultanteous threads accessing the LUN over iSER also give read >> errors. > OK, Thanks a lot for doing all this testing / bug hunting work. > > I read the Feb 2008 "iser multiple readers" thread and wasn't sure if / > what was the conclusion. just to chime in here - I don't think there was any conclusion from 12 months ago... as I was the only one seeing problems at that time, it (quite rightly) couldn't be ruled out that there was something odd with our machines/setup. now that other people are seeing problems too, the chances that the problem is real and of finding a fix are better. however, it turns out that we don't need iSER in production any time soon, so I haven't been spending any time on it. but let me know if you want me to test a fix and I'll try to find time to break it :-) cheers, robin >OTOH Robin reported that the patch that slows > down tgt not to send the scsi response before the rdma write is > completed eliminated the error but OTOH Pete was doing some analysis of > the errors, @ > http://lists.berlios.de/pipermail/stgt-devel/2008-February/001379.html > said >> "The offsets are always positive, which fits in with the theory that >> future RDMAs are overwriting earlier ones. This goes against the >> theory in your (my) patch, which guesses that the SCSI responsemessage >> is sneaking ahead of RDMA operations." > > and here starts the talking on possible relations of this error with > FMRs, where Pete suggested to disable FMRs and see if the problem > persists, I wasn't sure if you did that. > >> My guess is that the AMD hyper-transport may interfere with the fmr. >> But I am no linux memory management specialist .. so please correct me >> if I am wrong. Maybe the following happens: Bootet with one CPU all >> FMR request goes to the 16GB RAM this single CPU directly addresses >> via its memory controller. In case of more than one active CPU the >> memory is fetched from both CPUs memory controllers with preference >> to local memory. In seldom cases the memory manager fetchs memory for >> the FMR process running on CPU0 from the CPU1 via the hyper-transport >> channel and something weird happens. > To make sure we are on the same page (...) here: FMR (Fast Memory > Registration) is a means to register with the HCA a (say) arbitrary list > of pages to be used for an I/O. This page SC (scatter-gather) list was > allocated and provided by the SCSI midlayer to the iSER SCSI LLD > (low-level-driver) through the queuecommand interface. So I read your > comment as saying that when using one CPU and or a system with one > memory controller all I/O are served with pages from the "same memory" > where when this doesn't happen, something gets broken. > > I wasn't sure to follow on the sentence "In seldom cases the memory > manager fetchs memory for the FMR process running on CPU0 from the CPU1 > via the hyper-transport channel and something weird happens" - can you > explain a bit what you were referring to? > > Or. > > > -- > To unsubscribe from this list: send the line "unsubscribe stgt" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html