On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote: > When boxes are run near (or to) OOM, we have a problem with the > discard page allocation in sd. If we fail allocating the special > page, we return busy, and it'll get retried. But since ordering is > honored for dispatch requests, we can keep retrying this same IO and > failing. Behind that IO could be requests that want to free memory, > but they never get the chance. This means you get repeated spews of > traces like this: > > [1201401.625972] Call Trace: > [1201401.631748] dump_stack+0x4d/0x65 > [1201401.639445] warn_alloc+0xec/0x190 > [1201401.647335] __alloc_pages_slowpath+0xe84/0xf30 > [1201401.657722] ? get_page_from_freelist+0x11b/0xb10 > [1201401.668475] ? __alloc_pages_slowpath+0x2e/0xf30 > [1201401.679054] __alloc_pages_nodemask+0x1f9/0x210 > [1201401.689424] alloc_pages_current+0x8c/0x110 > [1201401.699025] sd_setup_write_same16_cmnd+0x51/0x150 > [1201401.709987] sd_init_command+0x49c/0xb70 > [1201401.719029] scsi_setup_cmnd+0x9c/0x160 > [1201401.727877] scsi_queue_rq+0x4d9/0x610 > [1201401.736535] blk_mq_dispatch_rq_list+0x19a/0x360 > [1201401.747113] blk_mq_sched_dispatch_requests+0xff/0x190 > [1201401.758844] __blk_mq_run_hw_queue+0x95/0xa0 > [1201401.768653] blk_mq_run_work_fn+0x2c/0x30 > [1201401.777886] process_one_work+0x14b/0x400 > [1201401.787119] worker_thread+0x4b/0x470 > [1201401.795586] kthread+0x110/0x150 > [1201401.803089] ? rescuer_thread+0x320/0x320 > [1201401.812322] ? kthread_park+0x90/0x90 > [1201401.820787] ? do_syscall_64+0x53/0x150 > [1201401.829635] ret_from_fork+0x29/0x40 > > Ensure that the discard page allocation has a mempool backing, so we > know we can make progress. > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > > --- > > We actually hit this in production, it's not a theoretical issue. > > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 4a6ed2fc8c71..a1a44f52e0e8 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c [...] > @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct > scsi_cmnd *cmd) > unsigned int data_len = 24; > char *buf; > > - rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | > __GFP_ZERO); > + rq->special_vec.bv_page = mempool_alloc(sd_page_pool, > GFP_ATOMIC); > if (!rq->special_vec.bv_page) > return BLK_STS_RESOURCE; > + clear_highpage(rq->special_vec.bv_page); Since the kernel never accesses this page and you take pains to kmap it when clearing, shouldn't the allocation have __GFP_HIGHMEM? James