On Tue, 27 Feb 2007 23:38:13 +0100 Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > From: Miklos Szeredi <mszeredi@xxxxxxx> > > This deadlock is similar to the one in balance_dirty_pages, but > instead of waiting in balance_dirty_pages after submitting a write > request, it happens during a memory allocation for filesystem B before > submitting a write request. > > It is easy to reproduce on a machine with not too much memory. > E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as > well): > > dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40 > mke2fs -j -F /tmp/tmp.img > mkdir /tmp/img > mount -oloop /tmp/tmp.img /tmp/img > bash-shared-mapping /tmp/img/foo 30000000 > > The deadlock doesn't happen immediately, sometimes only after a few > minutes. > > Simplified stack trace for bash-shared-mapping after the deadlock: > > io_schedule_timeout > congestion_wait > balance_dirty_pages > balance_dirty_pages_ratelimited_nr > generic_file_buffered_write > __generic_file_aio_write_nolock > generic_file_aio_write > ext3_file_write > do_sync_write > vfs_write > sys_pwrite64 > > and for [loop0]: > > io_schedule_timeout > congestion_wait > throttle_vm_writeout > shrink_zone > shrink_zones > try_to_free_pages > __alloc_pages > find_or_create_page > do_lo_send_aops > lo_send > do_bio_filebacked > loop_thread > > The requirement for the deadlock is that > > nr_writeback > dirty_thresh * 1.1 + margin > > Again margin seems to be in the 100 page range. > > The task of throttle_vm_writeout is to limit the rate at which > under-writeback pages are created due to swapping. There's no other > way direct reclaim can increase the nr_writeback + nr_file_dirty. > > So when there are few or no under-swap pages, it is safe for this > function to return. This ensures, that there's progress with writing > back dirty pages. > Would this also be solved by the below just-submitted bugfix? I guess not. I think the basic problem here is that the loop thread is reponsible for cleaning memory, but in throttle_vm_writeout(), the loop thread can get stuck waiting for some other thread to clean memory, but that ain't going to happen. throttle_vm_writeout() wasn't very well thought through, I suspect. I suspect we can just delete throttle_vm_writeout() now. The original rationale was: [PATCH] vm: pageout throttling With silly pageout testcases it is possible to place huge amounts of memory under I/O. With a large request queue (CFQ uses 8192 requests) it is possible to place _all_ memory under I/O at the same time. This means that all memory is pinned and unreclaimable and the VM gets upset and goes oom. The patch limits the amount of memory which is under pageout writeout to be a little more than the amount of memory at which balance_dirty_pages() callers will synchronously throttle. This means that heavy pageout activity can starve heavy writeback activity completely, but heavy writeback activity will not cause starvation of pageout. Because we don't want a simple `dd' to be causing excessive latencies in page reclaim. but now that we limit the amount of dirty MAP_SHARED memory, and given that the various pieces of the dirty-memory limiting puzzle also take the number of under-writeback pages into account, we should no longer be able to get in the situation where the total number of writeback+dirty pages exceeds dirty_ratio. From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> throttle_vm_writeout() is designed to wait for the dirty levels to subside. But if the caller holds IO or FS locks, we might be holding up that writeout. So change it to take a single nap to give other devices a chance to clean some memory, then return. Cc: Nick Piggin <nickpiggin@xxxxxxxxxxxx> Cc: OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx> Cc: Kumar Gala <galak@xxxxxxxxxxxxxxxxxxx> Cc: Pete Zaitcev <zaitcev@xxxxxxxxxx> Cc: <stable@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- include/linux/writeback.h | 2 +- mm/page-writeback.c | 13 +++++++++++-- mm/vmscan.c | 2 +- 3 files changed, 13 insertions(+), 4 deletions(-) diff -puN include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations include/linux/writeback.h --- a/include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations +++ a/include/linux/writeback.h @@ -84,7 +84,7 @@ static inline void wait_on_inode(struct int wakeup_pdflush(long nr_pages); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(void); +void throttle_vm_writeout(gfp_t gfp_mask); /* These are exported to sysctl. */ extern int dirty_background_ratio; diff -puN mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/page-writeback.c --- a/mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations +++ a/mm/page-writeback.c @@ -296,11 +296,21 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(void) +void throttle_vm_writeout(gfp_t gfp_mask) { long background_thresh; long dirty_thresh; + if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) { + /* + * The caller might hold locks which can prevent IO completion + * or progress in the filesystem. So we cannot just sit here + * waiting for IO to complete. + */ + congestion_wait(WRITE, HZ/10); + return; + } + for ( ; ; ) { get_dirty_limits(&background_thresh, &dirty_thresh, NULL); @@ -317,7 +327,6 @@ void throttle_vm_writeout(void) } } - /* * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. diff -puN mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/vmscan.c --- a/mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations +++ a/mm/vmscan.c @@ -952,7 +952,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(); + throttle_vm_writeout(sc->gfp_mask); atomic_dec(&zone->reclaim_in_progress); return nr_reclaimed; _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html