Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Oct 3, 2006, at 10:18, Maciej W. Rozycki wrote:

Hello,

This patch fixes a couple of problems discovered with interrupt handling
in the phylib core, namely:

1. The driver uses timer and workqueue calls, but does not include
   <linux/timer.h> nor <linux/workqueue.h>.


Good catch.



2. The driver uses schedule_work() for handling interrupts, but does not
   make sure any pending work scheduled thus has been completed before
   driver's structures get freed from memory.  This is especially
important as interrupts may keep arriving if the line is shared with
   another PHY.

The solution is to ignore phy_interrupt() calls if the reported device
   has already been halted and calling flush_scheduled_work() from
phy_stop_interrupts() (but guarded with current_is_keventd() in case
   the function has been called through keventd from the MAC device's
   close call to avoid a deadlock on the netlink lock).


I've been trying to figure out this problem since you posted this, and I'm not sure I understand it fully (And I apologize profusely for the horror that is the PHY interrupt handling code. I'd love to rewrite it if there's some cleaner way.). But let me see if I can follow the chain of reasoning that led to this patch, and see if we can figure out a solution that doesn't involve creating a work queue just for bringing down the PHY.

1) Invoking phy_stop is meant to stop the system from looking for PHY status updates. Currently, another PHY sharing the interrupt can cause the HALTED PHY to reenable interrupts. Checking for HALTED in the interrupt handler fixes this, but it's incorrect. The phy_interrupt handler does not grab the lock, and so you could get this:

phy_stop
	lock
	clear any pending interrupts
	disable interrupts on this PHY
---> interrupt from another PHY causes this PHY's interrupt handler to be invoked
	HALTED isn't set, so phy_change is scheduled
<--- set HALTED, unlock

scheduled work is done:
	interrupt is reenabled

Sadly, I think the only way to fix this problem is to have phy_change check for HALTED, and react appropriately.

2) The PHY lib doesn't clear out remaining work in the work queue when it's bringing down a PHY. This is clearly wrong, but I'm confused how *any* driver does this? It seems to me that any network driver which has a work queue is going to be unable to flush the pending work when it is brought down. So what's the solution to this?

I'm not too enthusiastic about requiring the ethernet drivers to call phy_disconnect in a separate thread after "close" is called. Assuming there's not some sort of "squash work queue" function that can be invoked with rtnl_lock held, I think phy_disconnect should schedule itself to flush the queue. This would also require that mdiobus_unregister hold off on freeing phydevs if any of the phys were still waiting for pending flush_pending calls to finish. Which would, in turn, require mdiobus_unregister to schedule cleaning up memory for some later time.

I'm not enthusiastic about that implementation, either, but it maintains the abstractions I consider important for this code. The ethernet driver should not need to know what structures the PHY lib uses to implement its interrupt handling, and how to work around their failings, IMHO.

And now, to be more specific, a few comments on the patch, directly:


 Please consider.

  Maciej

Signed-off-by: Maciej W. Rozycki <macro@xxxxxxxxxxxxxx>

patch-mips-2.6.18-20060920-phy-irq-16
diff -up --recursive --new-file linux-mips-2.6.18-20060920.macro/ drivers/net/phy/phy.c linux-mips-2.6.18-20060920/drivers/net/phy/phy.c --- linux-mips-2.6.18-20060920.macro/drivers/net/phy/phy.c 2006-08-05 04:58:46.000000000 +0000 +++ linux-mips-2.6.18-20060920/drivers/net/phy/phy.c 2006-10-03 14:19:21.000000000 +0000
@@ -7,6 +7,7 @@
  * Author: Andy Fleming
  *
  * Copyright (c) 2004 Freescale Semiconductor, Inc.
+ * Copyright (c) 2006  Maciej W. Rozycki
  *
* This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the
@@ -32,6 +33,8 @@
 #include <linux/mii.h>
 #include <linux/ethtool.h>
 #include <linux/phy.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>

 #include <asm/io.h>
 #include <asm/irq.h>
@@ -484,6 +487,9 @@ static irqreturn_t phy_interrupt(int irq
 {
 	struct phy_device *phydev = phy_dat;

+	if (PHY_HALTED == phydev->state)
+		return IRQ_NONE;		/* It can't be ours.  */
+


As I mentioned, this doesn't protect it, since it doesn't grab the lock. And it can't grab the lock, or we'd have to disable interrupts while doing phy transactions. And we can't do that, because one design goal is to allow some bus drivers to use interrupts to signal that the transaction has completed. Admittedly, this is still quite broken right now. I'm looking into using semaphores, on the theory that I can sleep when I grab them. But that would still prevent taking the semaphore in the interrupt controller. This needs to be moved to phy_change (which you have done, anyway), and we just have to let the actual handler figure out whether it's safe to do anything.


 	/* The MDIO bus is not allowed to be written in interrupt
 	 * context, so we need to disable the irq here.  A work
 	 * queue will write the PHY to disable and clear the
@@ -577,6 +583,13 @@ int phy_stop_interrupts(struct phy_devic
 	if (err)
 		phy_error(phydev);

+	/*
+	 * Finish any pending work; we might have been scheduled
+	 * to be called from keventd ourselves, though.
+	 */
+	if (!current_is_keventd())
+		flush_scheduled_work();
+


And this is what is making you move your call to phy_disconnect to a work queue function, right? Does this make it so phy_stop_interrupts (and anything that calls it) can't be called with rtnl_lock held? I'd like to avoid that requirement, if at all possible.


 	free_irq(phydev->irq, phydev);

 	return err;
@@ -603,7 +616,8 @@ static void phy_change(void *data)
 	enable_irq(phydev->irq);

 	/* Reenable interrupts */
-	err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED);
+	if (PHY_HALTED != phydev->state)
+		err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED);


Eek! You're checking the phydev->state outside of a lock. That's very unsafe.



 	if (err)
 		goto irq_enable_err;
@@ -624,18 +638,24 @@ void phy_stop(struct phy_device *phydev)
 	if (PHY_HALTED == phydev->state)
 		goto out_unlock;

-	if (phydev->irq != PHY_POLL) {
-		/* Clear any pending interrupts */
-		phy_clear_interrupt(phydev);
+	phydev->state = PHY_HALTED;

+	if (phydev->irq != PHY_POLL) {
 		/* Disable PHY Interrupts */
 		phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED);
-	}

-	phydev->state = PHY_HALTED;
+		/* Clear any pending interrupts */
+		phy_clear_interrupt(phydev);
+	}


This all seems good. I was just thinking a few minutes ago that disabling the interrupts should be done before clearing the interrupts. And setting HALTED first should mean that we can move the unlock call up.



 out_unlock:
 	spin_unlock(&phydev->lock);
+
+	/*
+	 * Cannot call flush_scheduled_work() here as desired because
+	 * of rtnl_lock(), but PHY_HALTED shall guarantee phy_change()
+	 * will not reenable interrupts.
+	 */


Yeah, I don't think we need the comment, necessarily; phy_change will be protected (as long as the HALTED check is moved inside a lock).

Andy


[Index of Archives]     [Linux MIPS Home]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Linux]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux