On 4/25/18 12:02 PM, Paolo Valente wrote: > > >> Il giorno 25 apr 2018, alle ore 19:34, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >> >> On 4/25/18 11:25 AM, Paolo Valente wrote: >>> >>> >>>> Il giorno 25 apr 2018, alle ore 19:06, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>>> >>>> On 4/25/18 11:03 AM, Paolo Valente wrote: >>>>> >>>>> >>>>>> Il giorno 25 apr 2018, alle ore 18:50, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>>>>> >>>>>> Hi Linus, >>>>>> >>>>>> I ended up sitting on this about a week longer than I wanted to, >>>>>> since we were hashing out details with a timeout change. I've now >>>>>> killed that patch, so we can flush the existing queue in due time. >>>>>> >>>>>> This pull request contains: >>>>>> >>>>>> - Fix for an old regression, where entering the queue can be disturbed >>>>>> by a signal to the process. This can cause spurious EIO. Fix from Alan >>>>>> Jenkins. >>>>>> >>>>>> - cdrom information leak fix from Dan. >>>>>> >>>>>> - Trivial helper for testing queue FUA from Dave Chinner, part of his >>>>>> O_DIRECT FUA series. >>>>>> >>>>>> - Series of swim fixes from Finn that actually makes it work again. >>>>>> >>>>>> - Loop O_DIRECT corruption fix, which caused data corruption in >>>>>> production for us. From me. >>>>>> >>>>>> - BFQ crash fix from me. >>>>>> >>>>> >>>>> For what it's worth, I disagree with this patch. This change is based >>>>> on an apparently buggy motivation, as I wrote in my reply in the >>>>> thread where Jens proposed it. As such, I think it might even bury >>>>> more deeply the actual bug that causes the crash (although of course >>>>> this patch does eliminate the crash for the use case reported in that >>>>> thread). >>>> >>>> The patch fixes the issue and I've explained why. >>> >>> Definitely, but I wrote you that your explanation seems wrong. >>> >>>> If you have a >>>> motivation to fix it differently, for whatever reason, then by all >>>> means submit a patch. So far I haven't seen it, and we still have >>>> the known crash that people are actually hitting. >>>> >>> >>> Unfortunately I don't have a solution, because I don't know what the >>> bug is. I only know that there's a bug in your explanation for the >>> bug you want to fix. >>> >>>> I'll be happy to work with you on that later in the week, when >>>> the LSFMM conference has wound down. >>>> >>> >>> I do thank you for that. If you can, please start by answering my >>> concerns on your explanation. Maybe your explanation can be fixed (or >>> I'm simply wrong), and all will be fine. Or, if your explanation is >>> really buggy and we don't find the actual bug in the code, we can >>> just enrich your proposed change with a comment, and state that this >>> is a crutch to walk on, while waiting for the actual bug to show up. >> >> The reproduction case was there, so should not be hard to run with >> that and add some instrumentation to verify or disprove theories. > > Simple code check seems enough for that, see below. > >> For the flush case, you start out with a regular request, and assign >> the pointers. Then when we transform that into a flush, it's >> bypass inserted, and later retrieved in __bfq_dispatch_request(). > > Before the transformation, the request undergoes a finish or a > requeue, right? If so, bfq clears the pointers there. If the request has an end_io hook, we go through there and don't finish the request. The request is reused. >> As >> I said earlier, you could either check that in there (add a ->elv.icq >> non-NULL check before doing the bfqq check), > > They must be already cleared, because of the above. Wrong >> or you can have it >> cleaner and just always clear the pointers if ->elv.icq isn't assigned. >> > > It is the cleanest solution, if it is ok that those pointers are > dirtied (for reasons I don't know) before or during the > transformation. Of course it is, they are not valid if ->elv.icq isn't valid. You must clear them, or they can be stale. -- Jens Axboe