On Sunday, April 04/08/18, 2018 at 13:54:28 +0300, Sagi Grimberg wrote: > > >Hi Sagi, > > Hi Raju, > > >We are running nvmf link-test (toggling the link with some delay) to ensure that the connections are restored and I/O resumes on link-up. This test used to work until the commit e818a5b. But with this commit included, we see I/O errors as the link goes down, also when the link comes up, the nvmf disks are remounted as read-only and doesn't allow I/O run. > > Is this a manual remount? or just after a successful reconnect? This is not a manual remount, this is a automount that happens just after the link is brought up. > > >As per commit message, this seems to be the expected behaviour. Could you please confirm if what we are seeing is expected? > > The behavior is fail all inflight requests immediately when a link > failure is detected. Continuation of service is expected from a > multipath layer. > > When it worked for you before, what was your link toggle delay? > There was a 20 seconds delay between link down and link up. > >Same behaviour is seen with other vendors too. > > > >Here is the commit msg: > > > >*commit e818a5b487fea20494b0e48548c1085634abdc0d > >Author: Sagi Grimberg <sagi@xxxxxxxxxxx> > >Date: Mon Jun 5 20:35:56 2017 +0300 > > > > nvme-rdma: fast fail incoming requests while we reconnect > > > > When we encounter an transport/controller errors, error recovery > > kicks in which performs: > > 1. stops io/admin queues > > 2. moves transport queues out of LIVE state > > 3. fast fail pending io > > 4. schedule periodic reconnects. > > > > But we also need to fast fail incoming IO taht enters after we > > already scheduled. Given that our queue is not LIVE anymore, simply > > restart the request queues to fail in .queue_rq > > > > > >I/O errors: > >[Tue Apr 3 19:14:55 2018] print_req_error: I/O error, dev nvme2n1, sector 1108136 > >[Tue Apr 3 19:14:55 2018] Aborting journal on device nvme2n1-8. > >[Tue Apr 3 19:14:55 2018] print_req_error: I/O error, dev nvme2n1, sector 1052688 > >[Tue Apr 3 19:14:55 2018] Buffer I/O error on dev nvme2n1, logical block 131586, lost sync page write > >[Tue Apr 3 19:14:55 2018] JBD2: Error -5 detected when updating journal superblock for nvme2n1-8. > > > >and IO fails to resume as the devices are remounted as read-only... > > This looks like a journal write failed which was failed when link > failure occurred. > > Can you verify if mount --remount makes this go away? > The remount fails. # mount -o remount,rw /dev/nvme0n1 /mnt/nvme0 mount: cannot remount /dev/nvme0n1 read-write, is write-protected # mount -o remount,rw /mnt/nvme0 mount: cannot remount /dev/nvme0n1 read-write, is write-protected But, if I manually umount, format and mount again, then IO runs fine. > If this is the case, then yes, its expected. The driver will fail > inflight IO at some point, today it does this immediately, we could add > another timer to fail inflight IO which would allow to reconnect > before, but it would require some work. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html