Re: [PATCH] sd: use mempool for discard special page

Jens Axboe <axboe@xxxxxxxxx> · Fri, 21 Dec 2018 20:49:16 -0700

On 12/21/18 7:48 PM, James Bottomley wrote:
> On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
>> When boxes are run near (or to) OOM, we have a problem with the
>> discard page allocation in sd. If we fail allocating the special
>> page, we return busy, and it'll get retried. But since ordering is
>> honored for dispatch requests, we can keep retrying this same IO and
>> failing. Behind that IO could be requests that want to free memory,
>> but they never get the chance. This means you get repeated spews of
>> traces like this:
>>
>> [1201401.625972] Call Trace:
>> [1201401.631748]  dump_stack+0x4d/0x65
>> [1201401.639445]  warn_alloc+0xec/0x190
>> [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
>> [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
>> [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
>> [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
>> [1201401.689424]  alloc_pages_current+0x8c/0x110
>> [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
>> [1201401.709987]  sd_init_command+0x49c/0xb70
>> [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
>> [1201401.727877]  scsi_queue_rq+0x4d9/0x610
>> [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
>> [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
>> [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
>> [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
>> [1201401.777886]  process_one_work+0x14b/0x400
>> [1201401.787119]  worker_thread+0x4b/0x470
>> [1201401.795586]  kthread+0x110/0x150
>> [1201401.803089]  ? rescuer_thread+0x320/0x320
>> [1201401.812322]  ? kthread_park+0x90/0x90
>> [1201401.820787]  ? do_syscall_64+0x53/0x150
>> [1201401.829635]  ret_from_fork+0x29/0x40
>>
>> Ensure that the discard page allocation has a mempool backing, so we
>> know we can make progress.
>>
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>
>> ---
>>
>> We actually hit this in production, it's not a theoretical issue.
>>
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 4a6ed2fc8c71..a1a44f52e0e8 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
> [...]
>> @@ -759,9 +760,10 @@ static blk_status_t sd_setup_unmap_cmnd(struct
>> scsi_cmnd *cmd)
>>  	unsigned int data_len = 24;
>>  	char *buf;
>>  
>> -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
>> __GFP_ZERO);
>> +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
>> GFP_ATOMIC);
>>  	if (!rq->special_vec.bv_page)
>>  		return BLK_STS_RESOURCE;
>> +	clear_highpage(rq->special_vec.bv_page);
> 
> Since the kernel never accesses this page and you take pains to kmap it
> when clearing, shouldn't the allocation have __GFP_HIGHMEM?

It's not going to great lengths to kmap it, clear_page() conveniently
takes an address, not a page. So it's either

clear_page(page_address(rq->special_vec.bv_page));

or the current clear_highpage(). Shrug.

-- 
Jens Axboe