----- Original Message ----- > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > To: "Mike Snitzer" <snitzer@xxxxxxxxxx> > Cc: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx>, dm-devel@xxxxxxxxxx, linux-scsi@xxxxxxxxxxxxxxx > Sent: Tuesday, August 2, 2016 10:55:59 PM > Subject: Re: dm-mq and end_clone_request() > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > To: "Mike Snitzer" <snitzer@xxxxxxxxxx> > > Cc: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx>, dm-devel@xxxxxxxxxx, > > linux-scsi@xxxxxxxxxxxxxxx > > Sent: Tuesday, August 2, 2016 10:18:30 PM > > Subject: Re: dm-mq and end_clone_request() > > > > > > > > ----- Original Message ----- > > > From: "Mike Snitzer" <snitzer@xxxxxxxxxx> > > > To: "Laurence Oberman" <loberman@xxxxxxxxxx> > > > Cc: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx>, dm-devel@xxxxxxxxxx, > > > linux-scsi@xxxxxxxxxxxxxxx > > > Sent: Tuesday, August 2, 2016 10:10:12 PM > > > Subject: Re: dm-mq and end_clone_request() > > > > > > On Tue, Aug 02 2016 at 9:33pm -0400, > > > Laurence Oberman <loberman@xxxxxxxxxx> wrote: > > > > > > > Hi Bart > > > > > > > > I simplified the test to 2 simple scripts and only running against one > > > > XFS > > > > file system. > > > > Can you validate these and tell me if its enough to emulate what you > > > > are > > > > doing. > > > > Perhaps our test-suite is too simple. > > > > > > > > Start the test > > > > > > > > # cat run_test.sh > > > > #!/bin/bash > > > > logger "Starting Bart's test" > > > > #for i in `seq 1 10` > > > > for i in 1 > > > > do > > > > fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ > > > > --iodepth=64 --group_reporting --sync=1 --direct=1 > > > > --ioengine=libaio \ > > > > --directory="/data-$i" --name=data-integrity-test --thread > > > > --numjobs=16 \ > > > > --runtime=600 --output=fio-output.txt >/dev/null & > > > > done > > > > > > > > Delete the host, I wait 10s in between host deletions. > > > > But I also tested with 3s and still its stable with Mike's patches. > > > > > > > > #!/bin/bash > > > > for i in /sys/class/srp_remote_ports/* > > > > do > > > > echo "Deleting host $i, it will re-connect via srp_daemon" > > > > echo 1 > $i/delete > > > > sleep 10 > > > > done > > > > > > > > Check for I/O errors affecting XFS and we now have none with the > > > > patches > > > > Mike provided. > > > > After recovery I can create files in the xfs mount with no issues. > > > > > > > > Can you use my scripts and 1 mount and see if it still fails for you. > > > > > > In parallel we can try Bart's testsuite that he shared earlier in this > > > thread: https://github.com/bvanassche/srp-test > > > > > > README.md says: > > > "Running these tests manually is tedious. Hence this test suite that > > > tests the SRP initiator and target drivers by loading both drivers on > > > the same server, by logging in using the IB loopback functionality and > > > by sending I/O through the SRP initiator driver to a RAM disk exported > > > by the SRP target driver." > > > > > > This could explain why Bart is still seeing issues. He isn't testing > > > real hardware -- as such he is using ramdisk to expose races, etc. > > > > > > Mike > > > > > > > Hi Mike, > > > > I looked at Bart's scripts, they looked fine but I wanted a more simplified > > way to bring the error out. > > Using ramdisk is not uncommon as an LIO backend via ib_srpt to serve LUNS. > > That is the same way I do it when I am not connected to a large array as it > > is the only way I can get EDR like speeds. > > > > I don't thinks its racing due to the ramdisk back-end but maybe we need to > > ramp ours up to run more in parallel in a loop. > > > > I will run 21 parallel runs and see if it makes a difference tonight and > > report back tomorrow. > > Clearly prior to your final patches we were escaping back to the FS layer > > with errors but since your patches, at least in out test harness that is > > resolved. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Hello > > I ran 20 parallel runs with 3 loops through host deletion and in each case > fio survived with no hard error escaping to the FS layer. > Its solid in our test bed, > Keep in mind we have no ib_srpt loaded as we have a hardware based array and > are connected directly to the array with EDR 100. > I am also not removing and reloading modules like is happening in Barts's > scripts and also not trying to delete mpath maps etc. > > I focused only on the I/O error that was escaping up to the FS layer. > I will check in with Bart tomorrow. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi Bart Looking back at your email. I also get these, but those are expected as we are in the process of doing I/O when we yank the hosts and that has in-flights affected. Aug 2 22:41:23 jumpclient kernel: device-mapper: multipath: Failing path 8:192. Aug 2 22:41:23 jumpclient kernel: blk_update_request: I/O error, dev sdm, sector 258504 Aug 2 22:41:23 jumpclient kernel: blk_update_request: I/O error, dev sdm, sector 60320 However I never get any of these any more (with the patches applied) that you show: [ 162.903284] Buffer I/O error on dev dm-0, logical block 32928, lost sync page write I will work with you to understand why with Mike's patches, its now stable here but not in your configuration Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html