Re: stgtd 0.9.3 : Read-Errors using iser transport

Pete Wyckoff <pw@xxxxxxxx> · Sun, 22 Feb 2009 09:47:06 -0500

ogerlitz@xxxxxxxxxxxx wrote on Sun, 22 Feb 2009 14:53 +0200:
> Dr. Volker Jaenisch wrote:
>> every combination that I've tried when there are  multiple  
>> simultaneous readers Reproduced that. On a single core more than one  
>> simultanteous threads accessing the LUN over iSER also give read 
>> errors.
> OK, Thanks a lot for doing all this testing / bug hunting work.
>
> I read the Feb 2008 "iser multiple readers" thread and wasn't sure if /  
> what was the conclusion. OTOH Robin reported that the patch that slows  
> down tgt not to send the scsi response before the rdma write is  
> completed eliminated the error but OTOH Pete was doing some analysis of  
> the errors, @  
> http://lists.berlios.de/pipermail/stgt-devel/2008-February/001379.html 
> said
>> "The offsets are always positive, which fits in with the theory that  
>> future RDMAs are overwriting earlier ones. This goes against the  
>> theory in your (my) patch, which guesses that the SCSI responsemessage  
>> is sneaking ahead of RDMA operations."
>
> and here starts the talking on possible relations of this error with  
> FMRs, where Pete suggested to disable FMRs and see if the problem  
> persists, I wasn't sure if you did that.

This idea of the SCSI response message sneaking ahead of the RDMA
operations is pretty well debunked.  That "can't happen" according
to the IB spec, and never even made sense.  It just slows things
down, in fact, which seems to mask the problems.

That left us with the idea of some page mapping problem.  I am
always quick to claim this isn't a target page mapping problem
because there is a single MR used to map the entire 96 MB buffer
used as the source for all RDMA operations.  As stgt runs, it reads
data somewhere into that buffer, and sends it over the network using
ib_send() with a single VA and MR into that space.  The mappings in
the NIC never change after stgt first initializes.  Core stgt is
strictly serial, so should have no threading or locking issues;
although, there is a separate thread that brings data from disk into
a buf, then the main thread issues the RDMA operation.  In case
something isn't getting flushed right in there, but it's just
copies, not mapping.

Hence, I was ending up at FMR.  But other things use FMR in the
kernel too, like SRP, and we don't know of problems there.  SRP uses
the cache feature of the FMR pool, while iSER does not.  This seems
like less complexity to worry about for iSER.

>> My guess is that the AMD hyper-transport may interfere with the fmr.  
>> But I am no linux memory management specialist .. so please correct me  
>> if I am wrong. Maybe the following happens:  Bootet with one CPU all  
>> FMR request goes to the 16GB RAM this single CPU directly addresses  
>> via its memory controller.  In case of more than one active CPU the  
>> memory is fetched from both CPUs memory controllers  with preference  
>> to local memory. In seldom cases the memory manager fetchs memory for  
>> the FMR process running on CPU0 from the CPU1 via the hyper-transport  
>> channel and something weird happens.
> To make sure we are on the same page (...) here: FMR (Fast Memory  
> Registration) is a means to register with the HCA a (say) arbitrary list  
> of pages to be used for an I/O. This page SC (scatter-gather) list was  
> allocated and provided by the SCSI midlayer to the iSER SCSI LLD  
> (low-level-driver) through the queuecommand interface. So I read your  
> comment as saying that when using one CPU and or a system with one  
> memory controller all I/O are served with pages from the "same memory"  
> where when this doesn't happen, something gets broken.
>
> I wasn't sure to follow on the sentence "In seldom cases the memory  
> manager fetchs memory for the FMR process running on CPU0 from the CPU1  
> via the hyper-transport channel and something weird happens" - can you  
> explain a bit what you were referring to?

It could all just be timing issues.  Robin could generate the
problem at will on particularly hefty SGI boxes.  He also noticed
that multiple readers would generate the problem more reliably, but
it was also possible with a single reader.  I never could manage to
see a problem on little 2-socket AMD 4 GB boxes.  A flaw in the PCIE
IO controllers talking via HT to remote memory seems unlikely.

		-- Pete
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html