On 07/05/2024 09:44, Li, Eric (Honggang) wrote:
Resend this email.
-----Original Message-----
From: Li, Eric (Honggang)
Sent: Monday, May 6, 2024 9:50 AM
To: John Garry <john.g.garry@xxxxxxxxxx>; Jason Yan <yanaijie@xxxxxxxxxx>;
james.bottomley@xxxxxxxxxxxxxxxxxxxxx; Martin K . Petersen <martin.petersen@xxxxxxxxxx>
Cc: linux-scsi@xxxxxxxxxxxxxxx
Subject: RE: Issue in sas_ex_discover_dev() for multiple level of SAS expanders in a domain
-----Original Message-----
From: John Garry <john.g.garry@xxxxxxxxxx>
Sent: Friday, May 3, 2024 4:33 PM
To: Li, Eric (Honggang) <Eric.H.Li@xxxxxxxx>; Jason Yan
<yanaijie@xxxxxxxxxx>; james.bottomley@xxxxxxxxxxxxxxxxxxxxx; Martin K
. Petersen <martin.petersen@xxxxxxxxxx>
Cc: linux-scsi@xxxxxxxxxxxxxxx
Subject: Re: Issue in sas_ex_discover_dev() for multiple level of SAS
expanders in a domain
[EXTERNAL EMAIL]
On 03/05/2024 04:15, Li, Eric (Honggang) wrote:
John,
I agree that the call to sas_ex_join_wide_port() is not mandatory.
In fact, some logic here is similar to that function. We don't need
to do it again.
But just updating the phy_state may not be enough. I suppose you
still need to add that PHY into the corresponding wide port by
calling
sas_port_add_phy() and update phy->port.
Just updating the phy_state may avoid the port disabled in this
issue, but other missing piece of work may cause other issues.
If you check the commit in which that call to sas_ex_join_wide_port()
was originally added - 19252de - it was added together with the
comment "Due to races, the phy might not get added to the wide port,
so we add the phy to the wide port here". However the code to set phy_state =
PHY_STATE_DISCOVERED already existed before that commit.
When all that code was removed in a1b6fb947f923, I am just wondering
if we have kept the phy_state = PHY_STATE_DISCOVERED code.
Anyway, would we really join a wideport with the downstream expander?
I would just assume that we would be creating a new wideport, and a
subsequent scanned phy would be added to it.
[Eric: ] I don't think the code removed in a1b6fb947f923 is the right way to fix this issue.
It just happened avoiding this issue occurring.
Sure, I don't particularly like it as a fix either. But first I would
like to get your setup working again to verify that only this needs
fixing. Indeed something else may be broken since a1b6fb947f923. In
addition, if we need to backport a fix, I would only like to backport a
fix for real known broken topologies, and not theoretical issues not
experienced.
But what exact setup do you have? I am confused, as you seem to be
talking about your setup being broken, but then other setup may also
being broken, but you don't have access to another setup. What is the
other setup?
I think the root cause of this issue is the order of function calls to
sas_dev_present_in_domain() and sas_ex_join_wide_port().
My concern here is whether or not we still need to configure routing on the parent SAS
expander before calling sas_ex_join_wide_port().
This part of logic is not present previously and I don't have environment to test this.
Back to your question, I believe we do need to join a wideport to downstream expander.
Changing the phy_state to PHY_STATE_DISCOVERED will skip scanning those PHYs,
I would have thought that re-adding the code removed in a1b6fb947f923 to
set PHY_STATE_DISCOVERED would only affect the first phy scanned in that
wideport. Other phys scanned would have it set through calls to
sas_ex_join_wide_port()
by
not calling the function sas_ex_discover_dev() to subsequential PHYs within this port (so
this issue wouldn’t hit). But those PHYs are not associated with a SAS port. This may trigger
other issues (for example, any change count changed on that PHY, or SAS topology in sysfs,
etc.)
ok, this can be considered more when I understand exactly what you have
and what else you think may be broken.
Thanks,
John