On 11/23/2015 12:46 PM, Gabriel Krisman Bertazi wrote: > Peter Hurley <peter@xxxxxxxxxxxxxxxxxx> writes: > >> Hi Gabriel, >> >> On 11/11/2015 09:18 AM, Peter Hurley wrote: >>> On 11/11/2015 08:15 AM, Gabriel Krisman Bertazi wrote: >>>> When a PCI error is detected, reset code in 8250_pci.c suspends the uart >>>> ports to block IO while recovery procedure is performed. Nevertheless, >>>> TIOCMGET and TIOCMSET ioctls are still able to reach the device even >>>> when it is suspended, which might cause EEH recovery of serial adapters >>>> to fail on Power systems. This patch blocks any TIOCMGET/TIOCMSET calls >>>> from reaching the device if the port is suspended, to avoid racing with >>>> the recovery procedure. >>> >>> This really should be handled by the 8250_pci sub-driver ticking >>> TTY_IO_ERROR so that almost all tty operations fail (a minor bug fix >>> to properly order the clearing of ASYNC_INITIALIZED w/ setting TTY_IO_ERROR >>> in uart_shutdown() would be req'd). >>> >>> Unfortunately the machinery to do that doesn't exist. >>> I could trivially add a means given a uart_port ptr but the difficulty >>> of _safely_ getting a uart_port from line index still remains; >>> serial8250_get_port() should have just added a full ref count mechanism. >> >> Re-checking the 8250_pci sub-driver, I see it goes right through >> serial8250_suspend_port() which assumes the 8250 port will not be unregistered >> concurrently, so I assume the PCI subsystem serializes i/o error handling >> with device removal, in which case the translation from >> line => uart_8250_port => uart_port is straightforward. >> >> Below is the serial core machinery that can be used to tick TTY_IO_ERROR. > > Sorry for the late response. > > I've been testing both solutions for a while now and they still doesn't > seem to do a good job on stopping access to a device being recovered. > Ticking TTY_IO_ERROR caused my test program to reopen the port, which > went straight to serial8250_do_startup, accessed the device and crashed > the recovery again. Ah, sorry. Didn't realize that was part of the design requirements. > I fixed that, but other places also access the > device even though TTY_IO_ERROR is set. Since > uart_startup/uart_shutdown sets TTY_IO_ERROR as well, we'd need to check > if we're already in an error condition during startup/shutdown, and > abort initialization/shutdown. Considering all of that, I'm starting to > second guess the idea of using TTY_IO_ERROR for this purpose. > > I'm considering another fix that calls pciserial_remove_ports when > detecting a PCI error, which would eventually call uart_remove_one_port, > and actually detach the port from the serial layer. Then, I can > register the port again after recovery, similarly to what is done in > pciserial_init_ports during probe. This is how other serial driver > (jsm) handles PCI errors. That sounds worth exploring. You should be aware that uart_remove_one_port() is not completely bulletproof; for example, if the device is /dev/console with open file descriptors, vhangup() will return with still open, valid file descriptors and the port will not have been shutdown, but the hard detach proceeds anyway. The most serious issue stems from how a port is unlinked from the serial core; the ptr is simply reset, despite it may be in-use at the time. No ref counting or locking is performed (the lock is embedded in the port structure). It's on my todo list to correct these problems, but I haven't yet. A recent patch series of mine, "Fix driver crashes on hangup", makes concurrent access during unlink much more unlikely, but does not eliminate the problem. Just thought you'd want to know. Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe linux-serial" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html