Re: [bug report][bisected] blktests nvme/tcp nvme/030 failed on latest linux-block/for-next

Sagi Grimberg <sagi@xxxxxxxxxxx> · Thu, 11 Aug 2022 15:28:21 +0300

nvme/030 triggered several errors during CKI tests on
linux-block/for-next, pls help check it, and feel free to let me
know if you need any test/info, thanks.

Hi Chaitanya and Yi,

This commit (submitted last February) simply exposes two read-only
attributes to the sysfs.

Seems it was not the culprit, but nvme/030 can pass after I revert
that commit on v5.19.

Hi Sagi

I did more testing and finally found that reverting this udev rule
change in nvme-cli fix the nvme/030 failure issue,  could you check
it?

commit f86faaaa2a1ff319bde188dc8988be1ec054d238 (refs/bisect/bad)
Author: Sagi Grimberg <sagi@grimberg.m
Date:   Mon Jun 27 11:06:50 2022 +0300

      udev: re-read the discovery log page when a discovery controller
reconnected

      When using persistent discovery controllers, if the discovery
      controller loses connectivity and manage to reconnect after a while,
      we need to retrieve again the discovery log page in order to learn
      about possible changes that may have occurred during this time as
      discovery log change events were lost.

      Signed-off-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
      Signed-off-by: Daniel Wagner <dwagner@xxxxxxx>
      Link:
https://urldefense.com/v3/__https://lore.kernel.org/r/20220627080650.1
08936-1-
sagi@grimberg.me__;!!LpKI!lYFKeBqI0lmp0AycSrZ6krKxEMUNjSwCO-tY
-FyMAu5KLid5bBqYpfEBGaRgfGtk1c3HLXUekSSPXr6pKw$ [lore[.]kernel[.]org]

Yes, this change is reverted now from nvme-cli...
I'm thinking how should we solve the original issue, the only way I can think of
at this moment is a "reconnected" event. Does anyone have an idea how
userspace can do the right thing here without it?

Hi Sagi. We had a discussion regarding this back in January (or February?).

I needed such an event on a reconnect for my project, nvme-stas:
https://github.com/linux-nvme/nvme-stas

This event was needed so that the host could re-register with a CDC on a
reconnect (per TP8010). At your suggestion, I added "NVME_EVENT=connected"
in host/core.c. This has been working great for me. Maybe the udev rule
could be modified to look for this event.

That is exactly what it does, that is why nvme discover unexpectedly
connects to all log entries, because the udev event triggers..

In order to address the problem of missed AEN while controller was
disconnected, we need to re-issue the log-page on a re-conect, not
a first connect.