Re: [PATCH] sd: use mempool for discard special page

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Fri, 21 Dec 2018 18:48:07 -0800

On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
> When boxes are run near (or to) OOM, we have a problem with the
> discard page allocation in sd. If we fail allocating the special
> page, we return busy, and it'll get retried. But since ordering is
> honored for dispatch requests, we can keep retrying this same IO and
> failing. Behind that IO could be requests that want to free memory,
> but they never get the chance. This means you get repeated spews of
> traces like this:
> 
> [1201401.625972] Call Trace:
> [1201401.631748]  dump_stack+0x4d/0x65
> [1201401.639445]  warn_alloc+0xec/0x190
> [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
> [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
> [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
> [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
> [1201401.689424]  alloc_pages_current+0x8c/0x110
> [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
> [1201401.709987]  sd_init_command+0x49c/0xb70
> [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
> [1201401.727877]  scsi_queue_rq+0x4d9/0x610
> [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
> [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
> [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
> [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
> [1201401.777886]  process_one_work+0x14b/0x400
> [1201401.787119]  worker_thread+0x4b/0x470
> [1201401.795586]  kthread+0x110/0x150
> [1201401.803089]  ? rescuer_thread+0x320/0x320
> [1201401.812322]  ? kthread_park+0x90/0x90
> [1201401.820787]  ? do_syscall_64+0x53/0x150
> [1201401.829635]  ret_from_fork+0x29/0x40
> 
> Ensure that the discard page allocation has a mempool backing, so we
> know we can make progress.
> 
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> 
> ---
> 
> We actually hit this in production, it's not a theoretical issue.
> 
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 4a6ed2fc8c71..a1a44f52e0e8 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
[...]
> @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct
> scsi_cmnd *cmd)
>  	unsigned int data_len = 24;
>  	char *buf;
>  
> -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
> __GFP_ZERO);
> +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
> GFP_ATOMIC);
>  	if (!rq->special_vec.bv_page)
>  		return BLK_STS_RESOURCE;
> +	clear_highpage(rq->special_vec.bv_page);

Since the kernel never accesses this page and you take pains to kmap it
when clearing, shouldn't the allocation have __GFP_HIGHMEM?

James