On Tue, Jul 14, 2020 at 4:54 PM Bjorn Andersson <bjorn.andersson@xxxxxxxxxx> wrote: > > On Tue 14 Jul 15:13 PDT 2020, Rob Herring wrote: > > > On Tue, Jul 14, 2020 at 10:33 AM Jeffrey Hugo <jeffrey.l.hugo@xxxxxxxxx> wrote: > > > > > > On Mon, Jul 13, 2020 at 5:50 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > > > > > > > Hi, > > > > > > > > On Mon, Jul 13, 2020 at 1:25 PM Rob Herring <robh+dt@xxxxxxxxxx> wrote: > > > > > > > > > > On Mon, Jul 13, 2020 at 9:08 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > On Mon, Jul 13, 2020 at 7:11 AM Rob Herring <robh+dt@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Fri, Jul 10, 2020 at 5:02 PM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > I found that if I ever had a little mistake in my kernel config, > > > > > > > > or device tree, or graphics driver that my system would sit in a loop > > > > > > > > at bootup trying again and again and again. An example log was: > > > > > > > > > > > > > > Why do we care about optimizing the error case? > > > > > > > > > > > > It actually results in a _fully_ infinite loop. That is: if anything > > > > > > small causes a component of DRM to fail to probe then the whole system > > > > > > doesn't boot because it just loops trying to probe over and over > > > > > > again. The messages I put in the commit message are printed over and > > > > > > over and over again. > > > > > > > > > > Sounds like a bug as that's not what should happen. > > > > > > > > > > If you defer during boot (initcalls), then you'll be on the deferred > > > > > list until late_initcall and everything is retried. After > > > > > late_initcall, only devices getting added should trigger probing. But > > > > > maybe the adding and then removing a device is causing a re-trigger. > > > > > > > > Right, I'm nearly certain that the adding and then removing is causing > > > > a re-trigger. I believe the loop would happen for any case where we > > > > have a probe function that: > > > > > > > > 1. Adds devices. > > > > 2. After adding devices it decides that it needs to defer. > > > > 3. Removes the devices it added. > > > > 4. Return -EPROBE_DEFER from its probe function. > > > > > > > > Specifically from what I know about how -EPROBE_DEFER works I'm not > > > > sure how it wouldn't cause an infinite loop in that case. > > > > > > > > Perhaps the missing part of my explanation, though, is why it never > > > > gets out of this infinite loop. In my case I purposely made the > > > > bridge chip "ti-sn65dsi86.c" return an error (-EINVAL) in its probe > > > > every time. Obviously I wasn't going to get a display up like this, > > > > but I just wanted to not loop forever at bootup. I tracked down > > > > exactly why we get an - EPROBE_DEFER over and over in this case. > > > > > > > > You can see it in msm_dsi_host_register(). If some components haven't > > > > shown up when that function runs it will _always_ return > > > > -EPROBE_DEFER. > > > > > > > > In my case, since I caused the bridge to fail to probe, those > > > > components will _never_ show up. That means that > > > > msm_dsi_host_register() will _always_ return -EPROBE_DEFER. > > > > > > > > I haven't dug through all the DRM code enough, but it doesn't > > > > necessarily seem like the wrong behavior. If the bridge driver or a > > > > panel was a module then (presumably) they could show up later and so > > > > it should be OK for it to defer, right? > > > > > > > > So with all that, it doesn't really feel like this is a bug so much as > > > > it's an unsupported use case. The current deferral logic simply can't > > > > handle the case we're throwing at it. You cannot return -EPROBE_DEFER > > > > if your probe function adds devices each time through the probe > > > > function. > > > > > > > > Assuming all the above makes sense, that means we're stuck with: > > > > > > > > a) This patch series, which makes us not add devices. > > > > > > > > b) Some other patch series which rearchitects the MSM graphics stack > > > > to not return -EPROBE_DEFER in this case. > > > > > > This isn't a MSM specific issue. This is an issue with how the DSI > > > interface works, and how software is structured in Linux. I would > > > expect that pretty much any DSI host in the kernel would have some > > > version of this issue. > > > > > > The problem is that DSI is not "hot pluggable", so to give the DRM > > > stack the info it needs, we need both the DSI controller (aka the MSM > > > graphics stack in your case), and the thing it connects to (in your > > > case, the TI bridge, normally the actual panel) because the DRM stack > > > expects that if init completes, it has certain information > > > (resolution, etc), and some of that information is in the DSI > > > controller, and some of it is on the DSI device. > > > > Ah yes, DRM's lack of hot-plug and discrete component support... Is > > that not improved with some of the bridge rework? > > > > Anyways, given there is a child dependency on the parent, I don't > > think we should work-around DRM deficiencies in DT. > > > > BTW, There's also a deferred probe timeout you can use which stops > > deferring probe some number of seconds after late_initcall. > > > > I don't think we can rely on the deferred probe timeout, given that it > was reverted back to 0 seconds past late_initcall - which given that > most of the involved components are modules, means that without the > opt-in command line option we would likely fail to bring up the display. I meant just as a way to make progress booting in this case where the display is never coming up. We're talking only about a better experience for an error case. Maybe a simple solution is just having some delay inserted between delayed probe triggers so progress is made. Rob