Re: [PATCH] xfsprogs: Issue smaller discards at mkfs

Keith Busch <keith.busch@xxxxxxxxx> · Thu, 26 Oct 2017 15:24:15 -0600

On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote:
> 
> Sure, but now you have to go fix mke2fs and everything /else/ that
> issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and
> until you fix every program to work around this weird thing in the
> kernel there'll still be someone somewhere with this timeout problem...

e2progs already splits large discards in a loop. ;)

> ...so I started digging into what the kernel does with a BLKDISCARD
> request, which is to say that I looked at blkdev_issue_discard.  That
> function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn
> splits the request into a chain of UINT_MAX-sized struct bios.
> 
> 128G's worth of 4G ios == 32 chained bios.
> 
> 2T worth of 4G ios == 512 chained bios.
> 
> So now I'm wondering, is the problem more that the first bio in the
> chain times out because the last one hasn't finished yet, so the whole
> thing gets aborted because we chained too much work together?

You're sort of on the right track. The timeouts are set on an individual
request in the chain rather than one timeout for the entire chain.

All the bios in the chain get turned into 'struct request' and sent
to the low-level driver. The driver calls blk_mq_start_request before
sending to hardware. That starts the timer on _that_ request,
independent of the other requests in the chain.

NVMe supports very large queues. A 4TB discard becomes 1024 individual
requests started at nearly the same time. The last ones in the queue are
the ones that risk timeout.

When we're doing read/write, latencies at the same depth are well within
tolerance, and high queue depths are good for throughput. When doing
discard, though, tail latencies fall outside the timeout tolerance at
the same queue depth.

> Would it make sense to fix __blkdev_issue_discard to chain fewer bios
> together?  Or just issue the bios independently and track the
> completions individually?
>
> > > I'm not wise in the ways of queueing and throttling, but from my naiive
> > > perspective, it seems like something to be fixed in the kernel, or if it
> > > can't, export some new "maximum discard request size" which can be trusted?
> 
> I would've thought that's what max_discard_sectors was for, but... eh.
> 
> How big is the device you were trying to mkfs, anyway?

The largest single SSD I have is 6.4TB. Other folks I know have larger.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html