On Tue, Mar 05 2019 at 4:46am -0500, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > On Mon, 4 Mar 2019, Mike Snitzer wrote: > > > Hi, > > > > Alexander reported this same boot hang in another thread. I was able to > > reproduce using an x86_64 .config that Alexander provided. > > > > I've pushed this fix out to linux-next: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.1&id=be2c5301817833f692aadb2e5fa209582db01d3b > > > > If you could verify this fix works for you I'd appreciate it. > > > > Thanks, > > Mike > > So, remove this no-clone optimization and stage it for the next merge > window. There's something we don't understand, so don't merge it. This > patch is just papering over the problem. I layered changes that extended your initial noclone support and that seems to have upset you. Yes the evolution could've been cleaner but to say this is papering over anything is purely wrong. Clearly we're reentering dm_noclone_process_bio() and the relaxed negative check that only concerned itself with whether we were in make_request_fn lost sight of the potential for losing an 'struct dm_noclone' that was already attached to the bio. _THAT_ is cause for the hang. Full stop. I didn't have time to sort out why we rentered dm_noclone_process_bio() yesterday. But I can easily do so today. Just to fully appreciate _why_ it happened. To categorize any of this as papering over is just _wrong_ and I don't understand why you think it OK to accuse me with that. Rather than dig in to help you've sat back and attacked me. > "Stacking noclone targets creates more complexity than is tolerable" just > means that no one knows what is hapenning there. Not constructive. What it means is: I wrote the code that enables splitting + stacking no_clone + dm_work_fn rentry in dm_process_bio. The duality of use in the shared code paths and flag day to support all of it made this noclone optimization brittle. I'll grant you that. Given the mental gymnastics I had to do to reason through what _could_ be going on in different stacking scenarios just made me uneasy to support such noclone complexity from the start. Could revisit allowing stacking at a later date though. > Meanwile, we could get access to the system that reports hangs and test it > there. What part of "I was able to reproduce using an x86_64 .config that Alexander provided" don't you understand? It wasn't even that the code was just superficially broken. It required Alexander's .config to tease out the problem. It took quite a while to get my kvm guest testbed sorted out and zero in on reproducing. I did that work. I then asked those who reported the problem to confirm it fixes the issue for them.