Hello, Last week at the Embedded Linux Conference in Seattle we had an "unconference session", which is a free discussion about a topic. The topic I had proposed is "Hot-Pluggable Hardware with Device Tree Overlays Runtime Loading and Unloading (yes, at runtime)". As suggested by Saravana, here is a brief summary of the discussion. 15 people were present: Luca Ceresoli (Bootlin) Thomas Petazzoni (Bootlin) Alexandre Belloni (Bootlin) Maxime Chevallier (Bootlin) Krzysztof Kozlowski (Linaro) Bartosz Golaszewski (Linaro) Doug Anderson (Google) Chen-Yu Tsai (Google) Matt Coster (Imagination Technologies) Martino Facchin (Arduino) (5 more, I don't know the names) The topic is how to implement in Linux using device tree overlays runtime (un)loading any hardware add-on that: - can be plugged and unplugged to a base board at runtime, without notice - adds hardware on non-discoverable busses - provides a way to detect the add-on model that gets attached. Cold-plug and discoverable busses (e.g. USB) are not in topic. We described 2 use cases we are working on at Bootlin. One use case is for the LAN966x, a classic SoC that can be however be started in "endpoint mode", i.e. with the CPU cores deactivated and a PCI endpoint that allows an external CPU to access all the peripherals over PCIe. In practice the whole SoC would be used as a peripheral chip providing lots of devices for another SoC where the OS runs. This use case has been described by Rob Herring and Lizhi Hou at LPC 2023 [4][5]. The other use case, which was discussed in more detail, is for an industrial product under development by a Bootlin customer, which is a regular, self-standing embedded Linux system with a connector allowing to connect an add-on with additional peripherals. The add-on peripherals are on I2C, MIPI DSI and potentially other non-discoverable busses (there are also peripherals on natively hot-pluggable busses such as USB and Ethernet, but by their nature they don't need special work). For both use cases (and perhaps others we are unaware of) runtime loading/unloading DT overlays appears as the most fitting technique. Except it is not yet ready for real usage. For it to work, we highlighted 3 main areas in need of work in the Linux kernel: 1. how to describe the connector and the add-ons in device tree (bindings etc) -- only relevant for the 2nd use case 2. implementation of DT overlays for adding/removing the add-on peripherals 3. fixing issues with various subsystems and drivers that don't react well on device removal * Topic 1: DT description * I mentioned the DT structure I proposed in [0] which allows decoupling the bus segments, so supporting both different add-on models and different base boards with different SoCs around the same connector definition (think of the Beaglebone family). No objection was raised about this approach. Some mentioned the recently posted patches for Mikrobus support on the Beagle Play [1], which I was unaware of. The proposed connector description appears similar to our proposal. However I later checked the e-mail thread and although the connector description appears similar, there is a big difference: in the Beagle Play proposal the add-on is not described via DT but rather via a greybus manifest, and the connector driver has code to parse it and populate the various devices mentioned in the manifest. * Topic 2: Implementation of the connector and overlay (un)loading * The proposed idea is to have a connector driver that reacts to plug events in two stages. - Stage 1: load a "small" overlay common to all add-on models which describes enough to get the add-on model ID, e.g. from an EEPROM on the add-on itself. - Stage 2: after getting the model ID, load the model-specific overlay that describes everything else. Stage 1 could be unnecessary if the model can be detected without loading any add-on device drivers, e.g. is defined by pulling some GPIOs on the connector. Overlay (un)loading is well known for triggering several issues, the largest one (in terms of lines of code involved) is the memory leaks or use-after-free [6] of nodes and especially properties that happen when an overlay is removed. * Topic 3: fixing drivers/subsystems not handling removal correctly * Bartosz raised the concern that many subsystems crash or hang or are otherwise buggy when a device is removed (I think the quote was "are you guys going to fix them all?") -- a sound concern indeed. We plan to address issues as they appear on the busses we use, which is already a relevant work and is already in progress here. The others (e.g. SPI) can be addressed by whoever needs to hotplug them anytime in the future. It's worth mentioning that Bartosz gave a BoF [2] and talk [3] on the following day, both with useful information for those needing to make a subsystem safe against removals. * Status * In the end there are 3 main areas in need of work: DT description, DT overlay implementation, fixing drivers and subsystems that don't work correctly. Bootlin is actively working on all of these topics and already sent a few patches to fix some issues that were found [7][8][9]. More is under work and will be sent as it is ready. That's all. For those present, please feel free to add any relevant details I have missed. [0] https://lore.kernel.org/all/20240403213327.36d731ec@booty/ [1] https://lore.kernel.org/all/20240317193714.403132-2-ayushdevel1325@xxxxxxxxx/ [2] https://sched.co/1aBGK [3] https://sched.co/1aBGf [4] https://www.youtube.com/watch?v=MVGElnZW7BQ [5] https://lpc.events/event/17/contributions/1421/ [6] https://elinux.org/Frank%27s_Evolving_Overlay_Thoughts#issues_and_what_needs_to_be_completed_--_Not_an_exhaustive_list [7] https://lore.kernel.org/all/20240325152140.198219-1-herve.codina@xxxxxxxxxxx [8] https://lore.kernel.org/all/20240227113426.253232-1-herve.codina@xxxxxxxxxxx [9] https://lore.kernel.org/all/20240220133950.138452-1-herve.codina@xxxxxxxxxxx Best regards, Luca -- Luca Ceresoli, Bootlin Embedded Linux and Kernel engineering https://bootlin.com