On Wed, Nov 08, 2017 at 10:57:23AM -0700, Jens Axboe wrote: > On 11/08/2017 09:41 AM, Bart Van Assche wrote: > > On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote: > >> At this point, I have no idea what Bart's setup looks like. Bart, it > >> would be REALLY helpful if you could tell us how you are reproducing > >> your hang. I don't know why this has to be dragged out. > > > > Hello Jens, > > > > It is a disappointment to me that you have allowed Ming to evaluate other > > approaches than reverting "blk-mq: don't handle TAG_SHARED in restart". That > > patch namely replaces an algorithm that is trusted by the community with an > > algorithm of which even Ming acknowledged that it is racy. A quote from [1]: > > "IO hang may be caused if all requests are completed just before the current > > SCSI device is added to shost->starved_list". I don't know of any way to fix > > that race other than serializing request submission and completion by adding > > locking around these actions, which is something we don't want. Hence my > > request to revert that patch. > > I was reluctant to revert it, in case we could work out a better way of > doing it. As I mentioned in the other replies, it's not exactly the > prettiest or most efficient. However, since we currently don't have > a good solution for the issue, I'm fine with reverting that patch. > > > Regarding the test I run, here is a summary of what I mentioned in previous > > e-mails: > > * I modified the SRP initiator such that the SCSI target queue depth is > > reduced to one by setting starget->can_queue to 1 from inside > > scsi_host_template.target_alloc. > > * With that modified SRP initiator I run the srp-test software as follows > > until something breaks: > > while ./run_tests -f xfs -d -e deadline -r 60; do :; done > > What kernel options are needed? Where do I download everything I need? > > In other words, would it be possible to do a fuller guide for getting > this setup and running? > > I'll run my simple test case as well, since it's currently breaking > basically everywhere. > > > Today a system with at least one InfiniBand HCA is required to run that test. > > When I have the time I will post the SRP initiator and target patches on the > > linux-rdma mailing list that make it possible to run that test against the > > SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware required to > > use that driver is an Ethernet adapter. > > OK, I guess I can't run it then... I'll have to rely on your testing. Even we don't need to run it, just post out the log from 'tags' or 'sched_tags', which should tell us more, when this IO hang happens. Unfortunately still not see such logs up to now, :-( -- Ming