Hi, On Mon, May 13, 2019 at 10:15 AM Mike Snitzer <snitzer@xxxxxxxxxx> wrote: > On Mon, May 13 2019 at 12:18pm -0400, > Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > > Hi, > > > > I wanted to jump on the bandwagon of people reporting problems with > > commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt > > workqueues"). > > > > Specifically I've been tracking down communication errors when talking > > to our Embedded Controller (EC) over SPI. I found that communication > > errors happened _much_ more frequently on newer kernels than older > > ones. Using ftrace I managed to track the problem down to the dm > > crypt patch. ...and, indeed, reverting that patch gets rid of the > > vast majority of my errors. > > > > If you want to see the ftrace of my high priority worker getting > > blocked for 7.5 ms, you can see: > > > > https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715 > > > > > > In my case I'm looking at solving my problems by bumping the CrOS EC > > transfers fully up to real time priority. ...but given that there are > > other reports of problems with the dm-crypt priority (notably I found > > https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should > > also come up with a different solution for dm-crypt? > > > > And chance you can test how behaviour changes if you remove > WQ_CPU_INTENSIVE? e.g.: > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > index 692cddf3fe2a..c97d5d807311 100644 > --- a/drivers/md/dm-crypt.c > +++ b/drivers/md/dm-crypt.c > @@ -2827,8 +2827,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) > > ret = -ENOMEM; > cc->io_queue = alloc_workqueue("kcryptd_io/%s", > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, > - 1, devname); > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname); > if (!cc->io_queue) { > ti->error = "Couldn't create kcryptd io queue"; > goto bad; > @@ -2836,11 +2835,10 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) > > if (test_bit(DM_CRYPT_SAME_CPU, &cc->flags)) > cc->crypt_queue = alloc_workqueue("kcryptd/%s", > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, > - 1, devname); > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname); > else > cc->crypt_queue = alloc_workqueue("kcryptd/%s", > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND, > + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, > num_online_cpus(), devname); > if (!cc->crypt_queue) { > ti->error = "Couldn't create kcryptd queue"; It's not totally trivially easy for me to test. My previous failure cases were leaving a few devices "idle" over a long period of time. I did that on 3 machines last night and didn't see any failures. Thus removing "WQ_CPU_INTENSIVE" may have made things better. Before I say for sure I'd want to test for longer / redo the test a few times, since I've seen the problem go away on its own before (just by timing/luck) and then re-appear. Do you have a theory about why removing WQ_CPU_INTENSIVE would help? --- NOTE: in trying to reproduce problems more quickly I actually came up with a better test case for the problem I was seeing. I found that I can reproduce my own problems much better with this test: dd if=/dev/zero of=/var/log/foo.txt bs=4M count=512& while true; do ectool version > /dev/null; done It should be noted that "/var" is on encrypted stateful on my system so throwing data at it stresses dm-crypt. It should also be noted that somehow "/var" also ends up traversing through a loopback device (this becomes relevant below): With the above test: 1. With a mainline kernel that has commit 37a186225a0c ("platform/chrome: cros_ec_spi: Transfer messages at high priority"): I see failures. 2. With a mainline kernel that has commit 37a186225a0c plus removing WQ_CPU_INTENSIVE in dm-crypt: I still see failures. 3. With a mainline kernel that has commit 37a186225a0c plus removing high priority (but keeping CPU intensive) in dm-crypt: I still see failures. 4. With a mainline kernel that has commit 37a186225a0c plus removing high priority (but keeping CPU intensive) in dm-crypt plus removing set_user_nice() in loop_prepare_queue(): I get a pass! 5. With a mainline kernel that has commit 37a186225a0c plus removing set_user_nice() in loop_prepare_queue() plus leaving dm-crypt alone: I see failures. 6. With a mainline kernel that has commit 37a186225a0c plus removing set_user_nice() in loop_prepare_queue() plus removing WQ_CPU_INTENSIVE in dm-crypt: I still see failures 7. With my new "cros_ec at realtime" series and no other patches, I get a pass! tl;dr: High priority (even without CPU_INTENSIVE) definitely causes interference with my high priority work starving it for > 8 ms, but dm-crypt isn't unique here--loopback devices also have problems. -Doug -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel