On 12/21/18 7:48 PM, James Bottomley wrote: > On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote: >> When boxes are run near (or to) OOM, we have a problem with the >> discard page allocation in sd. If we fail allocating the special >> page, we return busy, and it'll get retried. But since ordering is >> honored for dispatch requests, we can keep retrying this same IO and >> failing. Behind that IO could be requests that want to free memory, >> but they never get the chance. This means you get repeated spews of >> traces like this: >> >> [1201401.625972] Call Trace: >> [1201401.631748] dump_stack+0x4d/0x65 >> [1201401.639445] warn_alloc+0xec/0x190 >> [1201401.647335] __alloc_pages_slowpath+0xe84/0xf30 >> [1201401.657722] ? get_page_from_freelist+0x11b/0xb10 >> [1201401.668475] ? __alloc_pages_slowpath+0x2e/0xf30 >> [1201401.679054] __alloc_pages_nodemask+0x1f9/0x210 >> [1201401.689424] alloc_pages_current+0x8c/0x110 >> [1201401.699025] sd_setup_write_same16_cmnd+0x51/0x150 >> [1201401.709987] sd_init_command+0x49c/0xb70 >> [1201401.719029] scsi_setup_cmnd+0x9c/0x160 >> [1201401.727877] scsi_queue_rq+0x4d9/0x610 >> [1201401.736535] blk_mq_dispatch_rq_list+0x19a/0x360 >> [1201401.747113] blk_mq_sched_dispatch_requests+0xff/0x190 >> [1201401.758844] __blk_mq_run_hw_queue+0x95/0xa0 >> [1201401.768653] blk_mq_run_work_fn+0x2c/0x30 >> [1201401.777886] process_one_work+0x14b/0x400 >> [1201401.787119] worker_thread+0x4b/0x470 >> [1201401.795586] kthread+0x110/0x150 >> [1201401.803089] ? rescuer_thread+0x320/0x320 >> [1201401.812322] ? kthread_park+0x90/0x90 >> [1201401.820787] ? do_syscall_64+0x53/0x150 >> [1201401.829635] ret_from_fork+0x29/0x40 >> >> Ensure that the discard page allocation has a mempool backing, so we >> know we can make progress. >> >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >> >> --- >> >> We actually hit this in production, it's not a theoretical issue. >> >> >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >> index 4a6ed2fc8c71..a1a44f52e0e8 100644 >> --- a/drivers/scsi/sd.c >> +++ b/drivers/scsi/sd.c > [...] >> @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct >> scsi_cmnd *cmd) >> unsigned int data_len = 24; >> char *buf; >> >> - rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | >> __GFP_ZERO); >> + rq->special_vec.bv_page = mempool_alloc(sd_page_pool, >> GFP_ATOMIC); >> if (!rq->special_vec.bv_page) >> return BLK_STS_RESOURCE; >> + clear_highpage(rq->special_vec.bv_page); > > Since the kernel never accesses this page and you take pains to kmap it > when clearing, shouldn't the allocation have __GFP_HIGHMEM? It's not going to great lengths to kmap it, clear_page() conveniently takes an address, not a page. So it's either clear_page(page_address(rq->special_vec.bv_page)); or the current clear_highpage(). Shrug. -- Jens Axboe