On 3/7/24 13:01, Sagi Grimberg wrote:
On 07/03/2024 13:29, Hannes Reinecke wrote:
On 3/7/24 11:10, Sagi Grimberg wrote:
On 19/02/2024 10:59, hare@xxxxxxxxxx wrote:
From: Hannes Reinecke <hare@xxxxxxx>
FPIN LI (link integrity) messages are received when the attached
fabric detects hardware errors. In response to these messages the
affected ports should not be used for I/O, and only put back into
service once the ports had been reset as then the hardware might
have been replaced.
Does this mean it cannot service any type of communication over
the wire?
It means that the service is impacted, and communication cannot be
guaranteed (CRC errors, packet loss, you name it).
So the link should be taken out of service until it's been (manually)
replaced.
OK, that's what I assumed.
This patch adds a new controller flag 'NVME_CTRL_TRANSPORT_BLOCKED'
which will be checked during multipath path selection, causing the
path to be skipped.
While this looks sensible to me, it also looks like this introduces a
ctrl state
outside of ctrl->state... Wouldn't it make sense to move the
controller to
NVME_CTRL_DEAD ? or is it not a terminal state?
Actually, I was trying to model it alongside the
'devloss_tmo'/'fast_io_fail' mechanism we have in SCSI.
Technically the controller is still present, it's just that we shouldn't
send I/O to it.
Sounds like a dead controller to me.
Sort of, yes.
And I'd rather not disconnect here as we're trying to
do an autoconnect on FC, so manually disconnect would interfere with
that and we probably end in a death spiral doing disconnect/reconnect.
I suggested just transitioning the state to DEAD... Not sure how
keep-alives behave though...
Hmm. The state machine has the transition LIVE->DELETING->DEAD,
ie a dead controller is on the way out, with all resources being
reclaimed.
A direct transition would pretty much violate that.
If we were going that way I'd prefer to have another state
('IMPACTED' ? 'LIVE_NOIO' ?) with the transitions
LIVE->IMPACTED->DELETING->DEAD
We could 'elevate' it to a new controller state, but wasn't sure how big
an appetite there is. And we already have flags like 'stopped' which
seem to fall into the same category.
stopped is different because it is not used to determine if it is capable
for IO (admin or io queues). Hence it is ok to be a flag.
Okay.
So yeah, we could introduce a new state, but I guess a direct transition
to 'DEAD' is not really a good idea.
Cheers,
Hannes