Re: I/O errors during nvmf link toggle

Raju Rangoju <rajur@xxxxxxxxxxx> · Mon, 9 Apr 2018 18:37:57 +0530

On Sunday, April 04/08/18, 2018 at 13:54:28 +0300, Sagi Grimberg wrote:
> 
> >Hi Sagi,
> 
> Hi Raju,
> 
> >We are running nvmf link-test (toggling the link with some delay) to ensure that the connections are restored and I/O resumes on link-up. This test used to work until the commit e818a5b. But with this commit included, we see I/O errors as the link goes down, also when the link comes up, the nvmf disks are remounted as read-only and doesn't allow I/O run.
> 
> Is this a manual remount? or just after a successful reconnect?
This is not a manual remount, this is a automount that happens just after the
link is brought up.
> 
> >As per commit message, this seems to be the expected behaviour. Could you please confirm if what we are seeing is expected?
> 
> The behavior is fail all inflight requests immediately when a link
> failure is detected. Continuation of service is expected from a
> multipath layer.
> 
> When it worked for you before, what was your link toggle delay?
> 
There was a 20 seconds delay between link down and link up.

> >Same behaviour is seen with other vendors too.
> >
> >Here is the commit msg:
> >
> >*commit e818a5b487fea20494b0e48548c1085634abdc0d
> >Author: Sagi Grimberg <sagi@xxxxxxxxxxx>
> >Date:   Mon Jun 5 20:35:56 2017 +0300
> >
> >     nvme-rdma: fast fail incoming requests while we reconnect
> >
> >     When we encounter an transport/controller errors, error recovery
> >     kicks in which performs:
> >     1. stops io/admin queues
> >     2. moves transport queues out of LIVE state
> >     3. fast fail pending io
> >     4. schedule periodic reconnects.
> >
> >     But we also need to fast fail incoming IO taht enters after we
> >     already scheduled. Given that our queue is not LIVE anymore, simply
> >     restart the request queues to fail in .queue_rq
> >
> >
> >I/O errors:
> >[Tue Apr  3 19:14:55 2018] print_req_error: I/O error, dev nvme2n1, sector 1108136
> >[Tue Apr  3 19:14:55 2018] Aborting journal on device nvme2n1-8.
> >[Tue Apr  3 19:14:55 2018] print_req_error: I/O error, dev nvme2n1, sector 1052688
> >[Tue Apr  3 19:14:55 2018] Buffer I/O error on dev nvme2n1, logical block 131586, lost sync page write
> >[Tue Apr  3 19:14:55 2018] JBD2: Error -5 detected when updating journal superblock for nvme2n1-8.
> >
> >and IO fails to resume as the devices are remounted as read-only...
> 
> This looks like a journal write failed which was failed when link
> failure occurred.
> 
> Can you verify if mount --remount makes this go away?
>
The remount fails.

# mount -o remount,rw /dev/nvme0n1 /mnt/nvme0
mount: cannot remount /dev/nvme0n1 read-write, is write-protected

# mount -o remount,rw /mnt/nvme0
mount: cannot remount /dev/nvme0n1 read-write, is write-protected

But, if I manually umount, format and mount again, then IO runs fine.

> If this is the case, then yes, its expected. The driver will fail
> inflight IO at some point, today it does this immediately, we could add
> another timer to fail inflight IO which would allow to reconnect
> before, but it would require some work.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html