On Tue, Oct 19, 2010 at 1:11 AM, Neil Brown <neilb@xxxxxxx> wrote: > On Mon, 18 Oct 2010 12:58:17 +0200 > Torsten Kaiser <just.for.lkml@xxxxxxxxxxxxxx> wrote: > >> On Mon, Oct 18, 2010 at 6:14 AM, Neil Brown <neilb@xxxxxxx> wrote: >> > Testing shows that this patch seems to work. >> > The test load (essentially kernbench) doesn't deadlock any more, though it >> > does get bogged down thrashing in swap so it doesn't make a lot more >> > progress :-) I guess that is to be expected. >> >> I just noticed this thread, as your mail from today pushed it up. >> >> In your original mail you wrote: " I recently had a customer (running >> 2.6.32) report a deadlock during very intensive IO with lots of >> processes. " and " Some threads that are blocked there, hold some IO >> lock (probably in the filesystem) and are trying to allocate memory >> inside the block device (md/raid1 to be precise) which is allocating >> with GFP_NOIO and has a mempool to fall back on." >> >> I recently had the same problem (intense IO due to swapstorm created >> by 20 gcc processes hung my system) and after initially blaming the >> workqueue changes in 2.6.36 Tejun Heo determined that my problem was >> not the workqueues getting locked up, but that it was cause by an >> exhausted mempool: >> http://marc.info/?l=linux-kernel&m=128655737012549&w=2 >> >> Instrumenting mm/mempool.c and retrying my workload showed that >> fs_bio_set from fs/bio.c looked like the mempool to blame and the code >> in drivers/md/raid1.c to be the misuser: >> http://marc.info/?l=linux-kernel&m=128671179817823&w=2 >> >> I was even able to reproduce this hang with only using a normal RAID1 >> md device as swapspace and then using dd to fill a tmpfs until >> swapping was needed: >> http://marc.info/?l=linux-raid&m=128699402805191&w=2 >> >> Looking back in the history of raid1.c and bio.c I found the following >> interesting parts: >> >> * the change to allocate more then one bio via bio_clone() is from >> 2005, but it looks like it was OK back then, because at that point the >> fs_bio_set was allocation 256 entries >> * in 2007 the size of the mempool was changed from 256 to only 2 >> entries (5972511b77809cb7c9ccdb79b825c54921c5c546 "A single unit is >> enough, lets scale it down to 2 just to be on the safe side.") >> * only in 2009 the comment "To make this work, callers must never >> allocate more than 1 bio at the time from this pool. Callers that need >> to allocate more than 1 bio must always submit the previously allocate >> bio for IO before attempting to allocate a new one. Failure to do so >> can cause livelocks under memory pressure." was added to bio_alloc() >> that is the base from my reasoning that raid1.c is broken. (And such a >> comment was not added to bio_clone() although both calls use the same >> mempool) >> >> So could please look someone into raid1.c to confirm or deny that >> using multiple bio_clone() (one per drive) before submitting them >> together could also cause such deadlocks? >> >> Thank for looking >> >> Torsten > > Yes, thanks for the report. > This is a real bug exactly as you describe. > > This is how I think I will fix it, though it needs a bit of review and > testing before I can be certain. > Also I need to check raid10 etc to see if they can suffer too. > > If you can test it I would really appreciate it. I did test it, but while it seemed to fix the deadlock, the system still got unusable. The still running "vmstat 1" showed that the swapout was still progressing, but at a rate of ~20k sized bursts every 5 to 20 seconds. I also tried to additionally add Wu's patch: --- linux-next.orig/mm/vmscan.c 2010-10-13 12:35:14.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-10-19 00:13:04.000000000 +0800 @@ -1163,6 +1163,13 @@ static int too_many_isolated(struct zone isolated = zone_page_state(zone, NR_ISOLATED_ANON); } + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so that + * they won't get blocked by normal ones and form circular deadlock