On Mon, Oct 14, 2024 at 4:57 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Mon, Oct 14, 2024 at 02:04:07PM GMT, Joanne Koong wrote: > > On Mon, Oct 14, 2024 at 11:38 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > > > On Mon, Oct 14, 2024 at 11:22:27AM GMT, Joanne Koong wrote: > > > > Currently in shrink_folio_list(), reclaim for folios under writeback > > > > falls into 3 different cases: > > > > 1) Reclaim is encountering an excessive number of folios under > > > > writeback and this folio has both the writeback and reclaim flags > > > > set > > > > 2) Dirty throttling is enabled (this happens if reclaim through cgroup > > > > is not enabled, if reclaim through cgroupv2 memcg is enabled, or > > > > if reclaim is on the root cgroup), or if the folio is not marked for > > > > immediate reclaim, or if the caller does not have __GFP_FS (or > > > > __GFP_IO if it's going to swap) set > > > > 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag > > > > set and the caller did not have __GFP_FS (or __GFP_IO if swap) set > > > > > > > > In cases 1) and 2), we activate the folio and skip reclaiming it while > > > > in case 3), we wait for writeback to finish on the folio and then try > > > > to reclaim the folio again. In case 3, we wait on writeback because > > > > cgroupv1 does not have dirty folio throttling, as such this is a > > > > mitigation against the case where there are too many folios in writeback > > > > with nothing else to reclaim. > > > > > > > > The issue is that for filesystems where writeback may block, sub-optimal > > > > workarounds need to be put in place to avoid potential deadlocks that may > > > > arise from the case where reclaim waits on writeback. (Even though case > > > > 3 above is rare given that legacy cgroupv1 is on its way to being > > > > deprecated, this case still needs to be accounted for) > > > > > > > > For example, for FUSE filesystems, when a writeback is triggered on a > > > > folio, a temporary folio is allocated and the pages are copied over to > > > > this temporary folio so that writeback can be immediately cleared on the > > > > original folio. This additionally requires an internal rb tree to keep > > > > track of writeback state on the temporary folios. Benchmarks show > > > > roughly a ~20% decrease in throughput from the overhead incurred with 4k > > > > block size writes. The temporary folio is needed here in order to avoid > > > > the following deadlock if reclaim waits on writeback: > > > > * single-threaded FUSE server is in the middle of handling a request that > > > > needs a memory allocation > > > > * memory allocation triggers direct reclaim > > > > * direct reclaim waits on a folio under writeback (eg falls into case 3 > > > > above) that needs to be written back to the fuse server > > > > * the FUSE server can't write back the folio since it's stuck in direct > > > > reclaim > > > > > > > > This commit adds a new flag, AS_NO_WRITEBACK_RECLAIM, to "enum > > > > mapping_flags" which filesystems can set to signify that reclaim > > > > should not happen when the folio is already in writeback. This only has > > > > effects on the case where cgroupv1 memcg encounters a folio under > > > > writeback that already has the reclaim flag set (eg case 3 above), and > > > > allows for the suboptimal workarounds added to address the "reclaim wait > > > > on writeback" deadlock scenario to be removed. > > > > > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > > > --- > > > > include/linux/pagemap.h | 11 +++++++++++ > > > > mm/vmscan.c | 6 ++++-- > > > > 2 files changed, 15 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > > > > index 68a5f1ff3301..513a72b8451b 100644 > > > > --- a/include/linux/pagemap.h > > > > +++ b/include/linux/pagemap.h > > > > @@ -210,6 +210,7 @@ enum mapping_flags { > > > > AS_STABLE_WRITES = 7, /* must wait for writeback before modifying > > > > folio contents */ > > > > AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */ > > > > + AS_NO_WRITEBACK_RECLAIM = 9, /* Do not reclaim folios under writeback */ > > > > > > Isn't it "Do not wait for writeback completion for folios of this > > > mapping during reclaim"? > > > > I think if we make this "don't wait for writeback completion for > > folios of this mapping during reclaim", then the > > mapping_no_writeback_reclaim check in shrink_folio_list() below would > > need to be something like this instead: > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 885d496ae652..37108d633d21 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1190,7 +1190,8 @@ static unsigned int shrink_folio_list(struct > > list_head *folio_list, > > /* Case 3 above */ > > } else { > > folio_unlock(folio); > > - folio_wait_writeback(folio); > > + if (mapping && > > !mapping_no_writeback_reclaim(mapping)) > > + folio_wait_writeback(folio); > > /* then go back and try same folio again */ > > list_add_tail(&folio->lru, folio_list); > > continue; > > The difference between the outcome for Case 2 and Case 3 is that in Case > 2 the kernel is putting the folio in an active list and thus the kernel > will not try to reclaim it in near future but in Case 3, the kernel is > putting back in the list from which it is currently reclaiming meaning > the next iteration will try to reclaim the same folio. > > We definitely don't want it in Case 3. > > > > > which I'm not sure if that would be the correct logic here or not. > > I'm not too familiar with vmscan, but it seems like if we are going to > > reclaim the folio then we should wait on it or else we would just keep > > trying the same folio again and again and wasting cpu cycles. In this > > current patch (if I'm understanding this mm code correctly), we skip > > reclaiming the folio altogether if it's under writeback. > > > > Either one (don't wait for writeback during reclaim or don't reclaim > > under writeback) works for mitigating the potential fuse deadlock, > > but I was thinking "don't reclaim under writeback" might also be more > > generalizable to other filesystems. > > > > I'm happy to go with whichever you think would be best. > > Just to be clear that we are on the same page that this scenario should > be handled in Case 2. Our difference is on how to describe the scenario. > To me the reason we are taking the path of Case 2 is because we don't > want what Case 3 is doing and thus wrote that. Anyways I don't think it > is that importatnt, use whatever working seems reasonable to you. Gotcha, thanks for clarifying. Your point makes sense to me - if we go this route we should also probably change the name to AS_NO_RECLAIM_WAIT_WRITEBACK or something like that to make it more congruent. For now, I'll keep it as AS_NO_WRITEBACK_RECLAIM because I think that might be more generalizable of a use case for other filesystems too. > > BTW you will need to update the comment for Case 2 which is above code > block. Great point, I will do this in v3. Thanks, Joanne > > thanks, > Shakeel