On 2019/07/29 8:42, Dave Chinner wrote: > On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote: >> On 2019/07/27 7:55, Theodore Y. Ts'o wrote: >>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>>>> >>>>> This looks like something that could hit every file systems, so >>>>> shouldn't we fix this in common code? We could also look into >>>>> just using memalloc_nofs_save for the page cache allocation path >>>>> instead of the per-mapping gfp_mask. >>>> >>>> I think it has to be the entire IO path - any allocation from the >>>> underlying filesystem could recurse into the top level filesystem >>>> and then deadlock if the memory reclaim submits IO or blocks on >>>> IO completion from the upper filesystem. That's a bloody big hammer >>>> for something that is only necessary when there are stacked >>>> filesystems like this.... >>> >>> Yeah.... that's why using memalloc_nofs_save() probably makes the most >>> sense, and dm_zoned should use that before it calls into ext4. >> >> Unfortunately, with this particular setup, that will not solve the problem. >> dm-zoned submit BIOs to its backend drive in response to XFS activity. The >> requests for these BIOs are passed along to the kernel tcmu HBA and end up in >> that HBA command ring. The commands themselves are read from the ring and >> executed by the tcmu-runner user process which executes them doing >> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different >> context than the dm-zoned worker thread issuing the BIO, >> memalloc_nofs_save/restore() calls in dm-zoned will have no effect. > > Right, I'm talking about using memalloc_nofs_save() as a huge hammer > in the pread/pwrite() calling context, not the bio submission > context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO > below). Yes, I understood your point. And I agree that it indeed would be a big hammer. We should be able to do better than that :) >> One simple hack would be an fcntl() or mount option to tell the FS to use >> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the >> applications or system setup is correct. So not so safe. > > Wasn't there discussion at some point in the past about an interface > for special processes to be able to mark themselves as PF_MEMALLOC > (some kind of prctl, I think) for things like FUSE daemons? That > would prevent direct reclaim recursion for these userspace daemons > that are in the kernel memory reclaim IO path. It's the same > situation there, isn't it? How does fuse deal with this problem? I do not recall such discussion. But indeed FUSE may give some hints. Good idea. Thanks. I will check. -- Damien Le Moal Western Digital Research