Hi, On Mon, 2024-05-06 at 16:41 +0200, Hans de Goede wrote: > Hi, > > On 5/5/24 3:07 PM, Weifeng Liu wrote: > > v3 changes: > > * better formatting, nothing special > > > > v2 changes: > > * resolves Andy's comments > > > > Original letter: > > > > Greetings, > > > > This series is intended to remedy a race condition where surface > > aggregator module (SAM) which is a serdev driver could fail to probe if > > the underlying UART port is not ready to open. In such circumstance, > > invoking serdev_device_open() gets errno -ENXIO, leading to failure in > > probing of SAM. However, if the probe is retried in a short delay, > > serdev_device_open() would work as expected and everything just goes > > fine. As a workaround, adding the serial driver (8250_dw) into > > initramfs or building it into the kernel image significantly mitigates > > the likelihood of encountering this race condition, as in this way the > > serial device would be initialized much earlier than probing of SAM. > > > > However, IMO we should reliably avoid this sort of race condition. A > > good way is returning -EPROBE_DEFER when serdev_device_open returns > > -ENXIO so that the kernel will be able to retry the probing later. This > > is what the first patch tries to do. > > > > Though this solution might be a good enough solution for this specific > > issue, I am wondering why this kind of race condition could ever happen, > > i.e., why a serdes device could be not ready yet at the moment the > > serdev driver gets called and tries to bind it. And even if this is an > > expected behavior how serdev driver works, I do feel it a little bit > > weird that we need to identify serdev_device_open() returning -ENXIO as > > non-fatal error and thus return -EPROBE_DEFER manually in such case, as > > I don't see this sort of behavior in other serdev driver. Maximilian > > and Hans suggested discussing the root cause of the race condition here. > > I will be grateful if you could provide some reasoning and insights on > > this. > > > > Following is the code path when the issue occurs: > > > > ssam_serial_hub_probe() > > serdev_device_open() > > ctrl->ops->open() /* this callback being ttyport_open() */ > > tty->ops->open() /* this callback being uart_open() */ > > tty_port_open() > > port->ops->activate() /* this callback being uart_port_activate() */ > > Find bit UPF_DEAD is set in uport->flags and fail with errno -ENXIO. > > > > I notice that flag UPF_DEAD would be set in serial_core_register_port() > > during calling serial_core_add_one_port() but don't have much idea > > what's going on under the hood. > > Thank you this explanation + Andy's questions about this were quite > useful. I think I have found the root cause of this problem and I have > attached a patch which should fix this. > > After dropping your own fix from your local kernel you should be able > to reproduce this 100% of the time by making the surface_aggregator > module builtin (CONFIG_SURFACE_AGGREGATOR=y) while making 8250_dw > a module (CONFIG_SERIAL_8250_DW=m). Although I'm not sure if the uart > will then not simply be initialzed as something generic. You could also > try building both into the kernel and see if the issue reproduces then. > As per your instructions, I * removed my fixes, * built surface_aggregator into the kernal, * built 8250_dw as a module and * removed 8250_dw from initramfs as well, and found the occurrence rate of the issue was around 30%. With your patch applied, the issue disappeared completely. Tested-by: Weifeng Liu <weifeng.liu.z@xxxxxxxxx> Really glad to see the final solution in serial core for this issue. This might also help other serdev drivers. Thank you all. Best regards, Weifeng > Once you can reproduce this reliably, give the attached patch a try > to fix this please. > > Regards, > > Hans > > > > > > > Additionally, add logs to the probe procedure of SAM in the second > > patch, hoping it could help provide context to user when something > > miserable happens. > > > > Context of this series is available in [1]. > > > > Best regards, > > Weifeng > > > > Weifeng Liu (2): > > platform/surface: aggregator: Defer probing when serdev is not ready > > platform/surface: aggregator: Log critical errors during SAM probing > > > > drivers/platform/surface/aggregator/core.c | 53 ++++++++++++++++------ > > 1 file changed, 39 insertions(+), 14 deletions(-)