So the one thing I'm not even sure about is if just ignoring the
errors was a good idea to start with. They obviously are if we just
did a rescan and did run into an error while rescanning a namespace
that didn't change. But what if it actually did change?
Right, we don't know, so if we failed without DNR, we assume that
we will retry again and ignore the error. The assumption is that
we will retry when we will reconnect as we don't have a retry mechanism
for these requests.
Yes. And I think for anything related to namespace (re-)scanning
we can actually trivially build a sane retry mechanism. That is give
up on the current scan_work, and just rescan one after a short wait.
There is no point in doing that if we are disconnected and will in
the future reconnect, which will trigger a scan that can actually
work.
So I think a logic like in this patch kinda makes sense, but I think
we also need to retry and scan again on these kinds of errors.
So you are OK with keeping nvme_submit_sync_cmd returning -ENODEV for
cancelled requests and have the scan flow assume that these are
cancelled requests?
How does nvme_submit_sync_cmd return -ENODEV? As far as I can tell
-ENODEV is our special escape for expected-ish errors in namespace
scanning.
One of these escapes I guess :)
At the very least we need a good comment to say what is going on there.
Absolutely.
Btw,
did you ever actually see -ENOMEM in practice? With the small
allocations that we do it really should not happen normally, so
special casing for it always felt a little strange.
Never seen it, it's there just because we have allocations in the path.
FYI, I've started rebasing various bits of work I've done to start
untangling the mess. Here is my current WIP, which in this form
is completely untested:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-scanning-cleanup
This does not yet contain sorting out what is discussed here correct?
No, but all the infrastructure needed to implement my above idead. Most
importanty the crazy revalidate callchains are pretty much gone and we're
down to just a few functions with reasonable call chains.
OK, that makes sense. I'm still not convinced the retry makes sense
though...