On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote: > > Sure, but now you have to go fix mke2fs and everything /else/ that > issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and > until you fix every program to work around this weird thing in the > kernel there'll still be someone somewhere with this timeout problem... e2progs already splits large discards in a loop. ;) > ...so I started digging into what the kernel does with a BLKDISCARD > request, which is to say that I looked at blkdev_issue_discard. That > function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn > splits the request into a chain of UINT_MAX-sized struct bios. > > 128G's worth of 4G ios == 32 chained bios. > > 2T worth of 4G ios == 512 chained bios. > > So now I'm wondering, is the problem more that the first bio in the > chain times out because the last one hasn't finished yet, so the whole > thing gets aborted because we chained too much work together? You're sort of on the right track. The timeouts are set on an individual request in the chain rather than one timeout for the entire chain. All the bios in the chain get turned into 'struct request' and sent to the low-level driver. The driver calls blk_mq_start_request before sending to hardware. That starts the timer on _that_ request, independent of the other requests in the chain. NVMe supports very large queues. A 4TB discard becomes 1024 individual requests started at nearly the same time. The last ones in the queue are the ones that risk timeout. When we're doing read/write, latencies at the same depth are well within tolerance, and high queue depths are good for throughput. When doing discard, though, tail latencies fall outside the timeout tolerance at the same queue depth. > Would it make sense to fix __blkdev_issue_discard to chain fewer bios > together? Or just issue the bios independently and track the > completions individually? > > > > I'm not wise in the ways of queueing and throttling, but from my naiive > > > perspective, it seems like something to be fixed in the kernel, or if it > > > can't, export some new "maximum discard request size" which can be trusted? > > I would've thought that's what max_discard_sectors was for, but... eh. > > How big is the device you were trying to mkfs, anyway? The largest single SSD I have is 6.4TB. Other folks I know have larger. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html