On Mon, Sep 16, 2019 at 6:32 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, 16 Sep 2019, Andrey Konovalov wrote: > > > On Fri, Sep 13, 2019 at 10:35 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > On Fri, 13 Sep 2019, syzbot wrote: > > > > > > > syzbot has found a reproducer for the following crash on: > > > > > > > > HEAD commit: f0df5c1b usb-fuzzer: main usb gadget fuzzer driver > > > > git tree: https://github.com/google/kasan.git usb-fuzzer > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1146550d600000 > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=5c6633fa4ed00be5 > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b24d736f18a1541ad550 > > > > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11203fa5600000 > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=162cd335600000 > > > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > > > Reported-by: syzbot+b24d736f18a1541ad550@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > > > > > yurex 3-1:0.101: yurex_interrupt - unknown status received: -71 > > > > yurex 5-1:0.101: yurex_interrupt - unknown status received: -71 > > > > yurex 6-1:0.101: yurex_interrupt - unknown status received: -71 > > > > rcu: INFO: rcu_sched self-detected stall on CPU > > > > > > Andrey: > > > > > > This problem may be a result of overloading dummy_timer. The kernel > > > config you are using has CONFIG_HZ=100, but dummy-hcd needs > > > CONFIG_HZ=1000 (see the comment on line 1789). That is, lower values > > > of HZ will occasionally lead to trouble, and this may be an example. > > > > > > Can you change the config value for HZ and see if the bug still > > > reproduces? > > > > Hi Alan, > > > > I've tried running the reproducer with CONFIG_HZ=1000 and still got > > the same stall message. It's accompanied by countless "yurex > > 6-1:0.101: yurex_interrupt - unknown status received: -71" messages, > > so I believe this is an issue in the yurex driver. > > Maybe. Depends on exactly what the reproducer is doing, something > which is not at all easy to figure out from the scripts or programs. > > I got the impression that the reproducer connects an emulated yurex > device and then disconnects it -- but maybe that's not right at all. > Maybe the key point is that the reproducer sends a descriptor listing > an endpoint address that doesn't actually exist; that would have a > similar effect. Can you tell? (Trying to understand exactly what a > syzkaller test program does is not for the faint of heart.) > > As far as I can remember, the USB spec doesn't say what a device should > do when the host sends a packet to a non-existent endpoint. Which > means that some devices will do nothing at all, leading to the -71 > (-EPROTO) errors you see in the log. Indeed, there's only one place in > dummy_hcd.c where -EPROTO occurs -- for the case where an URB is sent > to an endpoint not supported by the gadget. > > This leads to the question: How should the yurex driver (or any USB > class driver, in fact) respond to a -EPROTO or similar error? The > thing is, this sort of error typically arises in two circumstances: > > The device was just unplugged, so of course it can't send > any packets back to the host; > > Noise on the bus caused a packet to be lost or corrupted. > > In the first case, it doesn't much matter what the driver does because > the disconnection will be noticed and acted on within a few hundred > milliseconds (although I suppose a driver could generate a lot of > kernel-log spam during that time). > > In the second case, retrying the lost/corrupted packet is the right > response. > > But retrying is _not_ the right response in cases where the device is > never going to respond because the endpoint address is invalid. This > can happen only in situations where the device provides incorrect > information (bad descriptors or something of that sort). The only > suitable approach I can think of is to limit the number of retries. > > Retry-limiting is not the sort of thing we want to add to each > individual USB class driver. Maybe it can be handled in the USB core; > I'll try to write some code for it. Under normal circumstances the > issue just doesn't arise, because normal devices aren't malicious. > > > Why does dumy_hcd require CONFIG_HZ=1000? The comment doesn't really > > explain the reason. > > Oh, that's simple enough. USB events tend to happen at millisecond > intervals. The data on the USB bus is organized into frames (and > microframes for high speed and SuperSpeed); a frame lasts one > millisecond (and a microframe lasts 1/8 ms). Many host controllers > report important events when a frame boundary occurs (that's how > dummy-hcd works). > > So for proper timing of the emulation, dummy-hcd requires timer > interrupts with millisecond resolution. I suppose the driver could be > changed to use a high-res timer instead of a normal kernel timer, but > for now that doesn't seem particularly important. So what are the practical differences between using CONFIG_HZ=100 and 1000 for dummy-hcd? Is is going to be slower or faster? Or can it get overloaded with data and cause stalls? Or something else? We're somewhat hesitant to change CONFIG_HZ as we don't know how it will affect other parts of the kernel (at some point the USB fuzzer will become a part of the main syzbot instance that doesn't only fuzz USB).