> Il giorno 25 apr 2018, alle ore 20:18, Jens Axboe <axboe@xxxxxxxxx> ha scritto: > > On 4/25/18 12:02 PM, Paolo Valente wrote: >> >> >>> Il giorno 25 apr 2018, alle ore 19:34, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>> >>> On 4/25/18 11:25 AM, Paolo Valente wrote: >>>> >>>> >>>>> Il giorno 25 apr 2018, alle ore 19:06, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>>>> >>>>> On 4/25/18 11:03 AM, Paolo Valente wrote: >>>>>> >>>>>> >>>>>>> Il giorno 25 apr 2018, alle ore 18:50, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>>>>>> >>>>>>> Hi Linus, >>>>>>> >>>>>>> I ended up sitting on this about a week longer than I wanted to, >>>>>>> since we were hashing out details with a timeout change. I've now >>>>>>> killed that patch, so we can flush the existing queue in due time. >>>>>>> >>>>>>> This pull request contains: >>>>>>> >>>>>>> - Fix for an old regression, where entering the queue can be disturbed >>>>>>> by a signal to the process. This can cause spurious EIO. Fix from Alan >>>>>>> Jenkins. >>>>>>> >>>>>>> - cdrom information leak fix from Dan. >>>>>>> >>>>>>> - Trivial helper for testing queue FUA from Dave Chinner, part of his >>>>>>> O_DIRECT FUA series. >>>>>>> >>>>>>> - Series of swim fixes from Finn that actually makes it work again. >>>>>>> >>>>>>> - Loop O_DIRECT corruption fix, which caused data corruption in >>>>>>> production for us. From me. >>>>>>> >>>>>>> - BFQ crash fix from me. >>>>>>> >>>>>> >>>>>> For what it's worth, I disagree with this patch. This change is based >>>>>> on an apparently buggy motivation, as I wrote in my reply in the >>>>>> thread where Jens proposed it. As such, I think it might even bury >>>>>> more deeply the actual bug that causes the crash (although of course >>>>>> this patch does eliminate the crash for the use case reported in that >>>>>> thread). >>>>> >>>>> The patch fixes the issue and I've explained why. >>>> >>>> Definitely, but I wrote you that your explanation seems wrong. >>>> >>>>> If you have a >>>>> motivation to fix it differently, for whatever reason, then by all >>>>> means submit a patch. So far I haven't seen it, and we still have >>>>> the known crash that people are actually hitting. >>>>> >>>> >>>> Unfortunately I don't have a solution, because I don't know what the >>>> bug is. I only know that there's a bug in your explanation for the >>>> bug you want to fix. >>>> >>>>> I'll be happy to work with you on that later in the week, when >>>>> the LSFMM conference has wound down. >>>>> >>>> >>>> I do thank you for that. If you can, please start by answering my >>>> concerns on your explanation. Maybe your explanation can be fixed (or >>>> I'm simply wrong), and all will be fine. Or, if your explanation is >>>> really buggy and we don't find the actual bug in the code, we can >>>> just enrich your proposed change with a comment, and state that this >>>> is a crutch to walk on, while waiting for the actual bug to show up. >>> >>> The reproduction case was there, so should not be hard to run with >>> that and add some instrumentation to verify or disprove theories. >> >> Simple code check seems enough for that, see below. >> >>> For the flush case, you start out with a regular request, and assign >>> the pointers. Then when we transform that into a flush, it's >>> bypass inserted, and later retrieved in __bfq_dispatch_request(). >> >> Before the transformation, the request undergoes a finish or a >> requeue, right? If so, bfq clears the pointers there. > > If the request has an end_io hook, we go through there and don't > finish the request. The request is reused. > Then, neither elv->finish_request or elv->requeue_request hooks are invoked, right? This is one of the possibilities I feared, because it just breaks the balance of bfq counters, making bfq fail in more or less unpredictable ways. So, if I did get you right, then I'll have a look at the code involved outside bfq, to fully see and learn what you wrote above; after that, I'll start working on how to make bfq comply with this new possibility. Thanks for answering my questions, Paolo >>> As >>> I said earlier, you could either check that in there (add a ->elv.icq >>> non-NULL check before doing the bfqq check), >> >> They must be already cleared, because of the above. > > Wrong > >>> or you can have it >>> cleaner and just always clear the pointers if ->elv.icq isn't assigned. >>> >> >> It is the cleanest solution, if it is ok that those pointers are >> dirtied (for reasons I don't know) before or during the >> transformation. > > Of course it is, they are not valid if ->elv.icq isn't valid. You > must clear them, or they can be stale. > > -- > Jens Axboe