On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote: > On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez > <mcgrof@xxxxxxxxxxxxxxxx> wrote: > > From: "Luis R. Rodriguez" <mcgrof@xxxxxxxx> > > Systemd has a general timeout for all workers currently set to 180 > > seconds after which it will send a sigkill signal. Systemd now has a > > warning which is issued once it reaches 1/3 of the timeout. The original > > motivation for the systemd timeout was to help track device drivers > > which do not use asynch firmware loading on init() and the timeout was > > originally set to 30 seconds. > > Please note that the motivation for the timeout in systemd had nothing > to do with async firmware loading (that was just the case where > problems cropped up). *Part *of the original kill logic, according to the commit log, was actually due to the assumption that the issues observed *were* synchronous firmware loading on module init(): commit e64fae5573e566ce4fd9b23c68ac8f3096603314 Author: Kay Sievers <kay.sievers@xxxxxxxx> Date: Wed Jan 18 05:06:18 2012 +0100 udevd: kill hanging event processes after 30 seconds Some broken kernel drivers load firmware synchronously in the module init path and block modprobe until the firmware request is fulfilled. <...> My point here is not to point fingers but to explain why we went on with this and how we failed to realize only until later that the driver core ran probe together with init. When a few folks pointed out the issues with the kill the issue was punted back to kernel developers and the assumption even among some kernel maintainers was that it was init paths with sync behaviour that was causing some delays and they were broken drivers. It is important to highlight these assumptions ended up setting us off on the wrong path for a while in a hunt to try to fix this issue either in driver or elsewhere. > The motivation was to not allow udev-workers to > stay around indefinitely, and hence put an upper-bound on > their duration (initially 180 s). At some point the bound was reduced > to 30 seconds to make sure module-loading would bail out before the > kernel's firmware loading timeout would bail out (60s I believe). Sure, part of it was that, but folks beat on driver developer about the kill insisting it was drivers that were broken. It was only until Chelsie folks called bloody murder becuase their delays were on probe that we realized there was a bit more to this than what was being pushed back on to driver developers. > That > is no longer relevant, which is why it was safe to reset the timeout > to 180 s. Indeed :D > > Since systemd + kernel are heavily tied in for the purposes of this > > patch it is assumed you have merged on systemd the following > > commits: > > > > 671174136525ddf208cdbe75d6d6bd159afa961f udev: timeout - warn after a third of the timeout before killing > > b5338a19864ac3f5632aee48069a669479621dca udev: timeout - increase timeout > > 2e92633dbae52f5ac9b7b2e068935990d475d2cd udev: bump event timeout to 60 seconds > > be2ea723b1d023b3d385d3b791ee4607cbfb20ca udev: remove userspace firmware loading support > > 9f20a8a376f924c8eb5423cfc1f98644fc1e2d1a udev: fixup commit > > dd5eddd28a74a49607a8fffcaf960040dba98479 udev: unify event timeout handling > > 9719859c07aa13539ed2cd4b31972cd30f678543 udevd: add --event-timeout commandline option > > > > Since we bundle together serially driver init() and probe() > > on module initialiation systemd's imposed timeout put a limit on the > > amount of time a driver init() and probe routines can take. There's a > > few overlooked issues with this and the timeout in general: > > > > 0) Not all drivers are killed, the signal is just sent and > > the kill will only be acted upoon if the driver you loaded > > happens to have some code path that either uses kthreads (which > > as of 786235ee are now killable), or uses some code which checks for > > fatal_signal_pending() on the kernel somewhere -- i.e: pci_read_vpd(). > > Shouldn't this be seen as something to be fixed in the kernel? That's a great question. In practice now after CVE-2012-4398 and its series of patches added which enabled OOM to kill things followed by 786235ee to also handle OOM on kthreads it seems imperative we strive towards this, in practive however if you're getting OOMs on boot you have far more serious issue to be concerned over than handling CVE-2012-4398. Another issue is that even if we wanted to address this a critical right now on module loading driver error paths tend to be pretty buggy and we'd probably end up causing more issues than fixing anything if the sigkill that triggered this was an arbitrary timeout, specially if the timeout is not properly justified. Addressing sigkill due to OOM is important, but as noted if you're running out of memory at load time you have a bit other problems to be concerned over. So extending the kill onto more drivers *because* of the timeout is probably not a good reason as it would probably create more issue than fix anything right now. > I mean, > do we not want userspace to have the possibility to kill udev/modprobe > even disregarding the worker timeouts (say at shutdown, or before > switching from the initrd)? That's a good point and I think the merit to handle a kill due to the other reasons (shutdown, switching from the initrd) should be addressed separately. I mean that validating addressing the kill for the other reasons does not validate the existing kill on timeout for synchronous probing. If its important to handle the kill on shutdown / switching initrd that should be dealt with orthogonally. > > 1) Since systemd is the only one logging the sigkill debugging that > > drivers are not loaded or in the worst case *failed to boot* because > > of a sigkill has proven hard to debug. > > Care to clarify this a bit? Are the udev logs somehow unclear? Sure, so the problem is that folks debugging were not aware of what systemd was doing. Let me be clear that the original 30 second sigkill timeout thing was passed down onto driver maintainers as a non-documented new kernel policy slap-in-the-face-you-must-obviously-be-doing-something-wrong (TM) approach. This was a policy decision passed down as a *reactive* measure, not many folks were aware of it and of what systemd was doing. What made the situation even worse was that as noted on 1) even though the sigkill was being sent since commit e64fae55 (January 2012) on systemd the sigkill was not being picked up on many drivers. To be clear the sigkill was being picked up if you had a driver that by chance had some code on init / probe that by chance checked for fatal_signal_pending(), and even when that triggered folks debugging were in no way shape or form expecting a sigkill from userspace on modprobe as it was not well known that this was part of the policy they should follow. Shit started to hit the fan a bit more widely when kernel commit 786235ee (Nov 2013) was merged upstream which allowed kthreads to be killed, and more drivers started failing. An example of an ancient bug that no one handled until recently: https://bugzilla.kernel.org/show_bug.cgi?id=59581 There is a proper fix to this now but the kill was what was causing this in the first place. The kill was justified as theese drivers *should* be using async probe but by no means does that mean the kill was justified for all subsystems / drivers. The bug also really also sent people on the wrong track and it was only until Alexander poked me about the issue we were seeing on cxbg4 likely being related that we started to really zeroe in on the real issue. The first driver reported / studied due to the kill from system was mptsas: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705 A full bisect was done to even try to understand what the issue was.. Then there was the tug of war between either reverting the patch that allowed the kthread to be killed or if this was systemd issue which required increasing the timeout. This was still a storage driver, and increasing the timeout arbitrarily really would not have helped address the root cause of the issue. The next non-storage driver bug that was reported and heavily debugged was cxgb4 and it wasn't easy to debug: https://bugzilla.suse.com/show_bug.cgi?id=877622 Conclusion then is that folks were simply not aware of this new de-facto policy, it was obviously incorrect but well intentioned, and no one really was paying attention to systemd-udevd logs. If we want chatty behaviour that people will pick up we probably instead want a WARN() on the kernel specially before we kill a driver and even then I'm sure this can irritate some folks. > If you think we can improve the logging from udev, please ping me about that > and I'll sort it out. I think the logging done on systemd is fine, there are a few issues with the way things trickled down and what we now need to do. First and foremost there was general communication issue about this new timing policy and obviously it would have helped if this also had more design / review from others. Its no one's fault, but we should learn from it. Design policies on systemd that can affect the kernel / drivers could likely use some bit more review from a wider audience and probably include folks who are probably going to be more critical than those who likely would typically be favorable. Without wider review we could fail to end up with something like a filter bubble [0] but applied to engineering, a design filter bubble, if you will. So apart from addressing logging its important to reflect on this issue and try to aim for having something like a Red Team [1] on design involving systemd and kernel. This is specially true if we are to really marry these two together more and more. The more critical people can be the better, but of course those need to provide constructive criticism, not just rants. In terms of logging: Do we know if distributions / users are reviewing systemd-udevd logs for certain types of issues with as much dilligence as they put to kernel logs when systemd makes decision affecting the kernel? If not we should consider a way so that that happens. In this case the fact that drivers were being killed while being loaded was missed since it was unexpected that would happen so folks didn't know to look for that option, but apart from that the *reason* for the kill probably could have helped too. To help both of these we have to consider if we are going to keep the sigkill on systemd on module loading due to a timeout. As you clarified the goal of the timeout is to avoid having udev workers stay around indefinitely, but I think we need to give kmod workers a bit more consideration. The point of this patch set was partly to give systemd what it assumed was there, but clearly we can't assume all drivers can be loaded asynchronously without issues right now. That means that even with this functionality merged systemd will have to cope with the fact that some drivers will be loaded with synchronous probe. A general timeout and specially with a sigkill is probably not a good idea then, unless of course: 0) those device drivers / subsystem maintainer want a timeout 1) the above decision can distinguish between sync probe / async probe being done To address 0) perhaps one solution is that if subsystem maintainers feel this is needed they can express this on data structure somewhere, perhaps on the bus and/or have a driver value override, for example. For 1) we could expose what we end up doing through sysfs. Of course userspace could also simply want to put in place some requirements but in terms of a timeout / kill it would have to also accept that it cannot get what it might want. For instance we now know it may be that an async probe is not possible on some drivers. Perhaps its best to think about this differently and address now a way to do that efficiently instead of reactively. Apart form having the ability to let systemd ask for async probe, what else do we want to accomplish? [0] http://en.wikipedia.org/wiki/Filter_bubble [1] http://en.wikipedia.org/wiki/Red_team > > 2) When and if the signal is received by the driver somehow > > the driver may fail at different points in its initialization > > and unless all error paths on the driver are implemented > > perfectly this could mean leaving a device in a half > > initialized state. > > > > 3) The timeout is penalizing device drivers that take long on > > probe(), this wasn't the original motivation. Systemd seems > > to have been under assumption that probe was asynchronous, > > this perhaps is true as an *objective* and goal for *some > > subsystems* but by no means is it true that we've been on a wide > > crusade to ensure this for all device drivers. It may be a good > > idea for *many* device drivers but penalizing them with a kill > > for taking long on probe is simply unacceptable specially > > when the timeout is completely arbitrary. > > The point is really not to "penalize" anything, we just need to make > sure we put some sort of restrictions on our workers so they don't > hang around forever. Thanks for clarifying this, can you explain what issues could arise from making an exception to allowing kmod workers to hang around completing init + probe over a certain defined amount of time without being killed? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html