On Fri, Oct 8, 2010 at 7:02 PM, Tejun Heo <tj@xxxxxxxxxx> wrote: > Hello, again. > > On 10/07/2010 10:13 PM, Milan Broz wrote: >> Yes, XFS is very good to show up problems in dm-crypt:) >> >> But there was no change in dm-crypt which can itself cause such problem, >> planned workqueue changes are not in 2.6.36 yet. >> Code is basically the same for the last few releases. >> >> So it seems that workqueue processing really changed here under memory pressure. >> >> Milan >> >> p.s. >> Anyway, if you are able to reproduce it and you think that there is problem >> in per-device dm-crypt workqueue, there are patches from Andi for shared >> per-cpu workqueue, maybe it can help here. (But this is really not RC material.) >> >> Unfortunately not yet in dm-devel tree, but I have them here ready for review: >> http://mbroz.fedorapeople.org/dm-crypt/2.6.36-devel/ >> (all 4 patches must be applied, I hope Alasdair will put them in dm quilt soon.) > > Okay, spent the whole day reproduing the problem and trying to > determine what's going on. In the process, I've found a bug and a > potential issue (not sure whether it's an actual issue which should be > fixed for this release yet) but the hang doesn't seem to have anything > to do with workqueue update. All the queues are behaving exactly as > expected during hang. > > Also, it isn't a regression. I can reliably trigger the same deadlock > on v2.6.35. > > Here's the setup, which should be mostly similar to Torsten's setup I > used to trigger the problem. > > The machine is dual quad-core Opteron (8 phys cores) w/ 4GiB memory. > > * 80GB raid1 of two SATA disks > * On top of that, luks encrypted device w/ twofish-cbc-essiv:sha256 > * In the encrypted device, xfs filesystem which hosts 8GiB swapfile > * 12GiB tmpfs > > The workload is v2.6.35 allyesconfig -j 128 build in the tmpfs. Not > too long after swap starts being used (several tens of secs), the > system hangs. IRQ handling and all are fine but no IO gets through > with a lot of tasks stuck in bio allocation somewhere. > > I suspected that with md and dm stacked together, something in the > upper layer ended up exhausting a shared bio pool and tried a couple > of things but haven't succeeded at finding where the culprit is. It > probably would be best to run blktrace together and analyze how IO > gets stuck. > > So, well, we seem to be broken the same way as before. No need to > delay release for this one. I instrument mm/mempool.c, trying to find what shared pool gets exhausted. On the last run, it seemed that the fs_bio_set from fs/bio.c runs dry. As far as I can see, that pool is used by bio_alloc() and bio_clone(). Above bio_alloc() a dire warning says, that any bio allocated that way needs to be submitted from IO, otherwise the system could livelock. bio_clone() does not have this warning, but as it uses the same pool in the same way, I would expect the same rule applies. Looking for uses of bio_allow() and bio_clone() in drivers/md it looks like dm-crypt uses its own pools and not the fs_bio_set. But drivers/md/raid1.c uses this pool, and in my eyes it does it wrong. When writing to a RAID1 array the function make_request() in raid1.c does a bio_clone() for each drive (lines 967-1001 in 2.6.36-rc7) and only after all bios are allocates they will be merged into the pending_bio_list. So a RAID1 with 3 mirrors is a sure way to lock up a system as soon as the mempool is needed? (The fs_bio_set pool only allocates BIO_POOL_SIZE entries and that is defined as 2) >From the use of atomic_inc(&r1_bio->remaining) and the use of the spin_lock_irqsave(&conf->device_lock, flags) when merging the bio list, I would suspect that its even possible that multiple CPUs concurrently get into this allocation loop, or that the use of multiple RAID1 devices each with only 2 drives could lock up the same way. What am I missing, or is the use of bio_clone() really the wrong thing? Torsten -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel