Re: [PATCH] sd: use mempool for discard special page

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sat, 22 Dec 2018 09:42:25 -0800

On Fri, 2018-12-21 at 20:49 -0700, Jens Axboe wrote:
> On 12/21/18 7:48 PM, James Bottomley wrote:
> > On Wed, 2018-12-12 at 06:46 -0700, Jens Axboe wrote:
> > > When boxes are run near (or to) OOM, we have a problem with the
> > > discard page allocation in sd. If we fail allocating the special
> > > page, we return busy, and it'll get retried. But since ordering
> > > is
> > > honored for dispatch requests, we can keep retrying this same IO
> > > and
> > > failing. Behind that IO could be requests that want to free
> > > memory,
> > > but they never get the chance. This means you get repeated spews
> > > of
> > > traces like this:
> > > 
> > > [1201401.625972] Call Trace:
> > > [1201401.631748]  dump_stack+0x4d/0x65
> > > [1201401.639445]  warn_alloc+0xec/0x190
> > > [1201401.647335]  __alloc_pages_slowpath+0xe84/0xf30
> > > [1201401.657722]  ? get_page_from_freelist+0x11b/0xb10
> > > [1201401.668475]  ? __alloc_pages_slowpath+0x2e/0xf30
> > > [1201401.679054]  __alloc_pages_nodemask+0x1f9/0x210
> > > [1201401.689424]  alloc_pages_current+0x8c/0x110
> > > [1201401.699025]  sd_setup_write_same16_cmnd+0x51/0x150
> > > [1201401.709987]  sd_init_command+0x49c/0xb70
> > > [1201401.719029]  scsi_setup_cmnd+0x9c/0x160
> > > [1201401.727877]  scsi_queue_rq+0x4d9/0x610
> > > [1201401.736535]  blk_mq_dispatch_rq_list+0x19a/0x360
> > > [1201401.747113]  blk_mq_sched_dispatch_requests+0xff/0x190
> > > [1201401.758844]  __blk_mq_run_hw_queue+0x95/0xa0
> > > [1201401.768653]  blk_mq_run_work_fn+0x2c/0x30
> > > [1201401.777886]  process_one_work+0x14b/0x400
> > > [1201401.787119]  worker_thread+0x4b/0x470
> > > [1201401.795586]  kthread+0x110/0x150
> > > [1201401.803089]  ? rescuer_thread+0x320/0x320
> > > [1201401.812322]  ? kthread_park+0x90/0x90
> > > [1201401.820787]  ? do_syscall_64+0x53/0x150
> > > [1201401.829635]  ret_from_fork+0x29/0x40
> > > 
> > > Ensure that the discard page allocation has a mempool backing, so
> > > we
> > > know we can make progress.
> > > 
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> > > 
> > > ---
> > > 
> > > We actually hit this in production, it's not a theoretical issue.
> > > 
> > > 
> > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > > index 4a6ed2fc8c71..a1a44f52e0e8 100644
> > > --- a/drivers/scsi/sd.c
> > > +++ b/drivers/scsi/sd.c
> > 
> > [...]
> > > @@ -759,9 +760,10 @@ static blk_status_t
> > > sd_setup_unmap_cmnd(struct
> > > scsi_cmnd *cmd)
> > >  	unsigned int data_len = 24;
> > >  	char *buf;
> > >  
> > > -	rq->special_vec.bv_page = alloc_page(GFP_ATOMIC |
> > > __GFP_ZERO);
> > > +	rq->special_vec.bv_page = mempool_alloc(sd_page_pool,
> > > GFP_ATOMIC);
> > >  	if (!rq->special_vec.bv_page)
> > >  		return BLK_STS_RESOURCE;
> > > +	clear_highpage(rq->special_vec.bv_page);
> > 
> > Since the kernel never accesses this page and you take pains to
> > kmap it
> > when clearing, shouldn't the allocation have __GFP_HIGHMEM?
> 
> It's not going to great lengths to kmap it, clear_page() conveniently
> takes an address, not a page. So it's either
> 
> clear_page(page_address(rq->special_vec.bv_page));
> 
> or the current clear_highpage(). Shrug.

Yes, I guessed that and agree our semantics are slightly problematic.

However, the question was, since this is about starvation, why not use
__GFP_HIGHMEM in the allocation?  I realise it makes no difference on
64 bit, but it does on 32 where it allows access to the HIGHMEM zone
which is often quite large and would lead to better behaviour under
memory pressure.

James