On Thu, Oct 12, 2023 at 7:13 AM 贺中坤 <hezhongkun.hzk@xxxxxxxxxxxxx> wrote: > > Hi Nhat, thanks for your detailed reply. > > > We're currently trying to solve this exact problem. Our approach is to > > add a shrinker that automatically shrinks the size of the zswap pool: > > > > https://lore.kernel.org/lkml/20230919171447.2712746-1-nphamcs@xxxxxxxxx/ > > > > It is triggered on memory-pressure, and can perform reclaim in a > > workload-specific manner. > > > > I'm currently working on v3 of this patch series, but in the meantime, > > could you take a look and see if it will address your issues as well? > > > > Comments and suggestions are always welcome, of course :) > > > > Thanks, I've seen both patches. But we hope to be able to reclaim memory > in advance, regardless of memory pressure, like memory.reclaim in memcg, > so we can offload memory in different tiers. As Johannes pointed out, with a zswap shrinker, we can just push on the memory.reclaim knob, and it'll automatically get pushed down the pipeline: memory -> swap -> zswap That seems to be a bit more natural and user-friendly to me than making the users manually decide to push zswap out to swap. My ideal vision of how all of this should go is that users provide an abstract declaration of requirement, and the specific decision of what to be done is left to the kernel to perform, as transparently to the user as possible. This philosophy extends to multi-tier memory management in general, not just the above 3-tier model. > > > > > My concern with this approach is that this value seems rather arbitrary. > > I imagine that it is workload- and memory access pattern- dependent, > > and will have to be tuned. Other than a couple of big users, no one > > will have the resources to do this. > > > > And since this is a one-off knob, there's another parameter users > > will have to decide - frequency, i.e how often should the userspace > > agent trigger this reclaim action. This is again very hard to determine > > a priori, and most likely has to be tuned as well. > > > > I totally agree with you, this is the key point of this approach.It depends > on how we define cold pages, which are usually measured in time, > such as not being accessed for 600 seconds, etc. So the frequency > should be greater than 600 seconds. I guess my main concern here is - how do you determine the value 600 seconds in the first place? And yes, the frequency should be greater than the oldness cutoff, but how much greater? We can run experiments to decide what cutoff will hurt performance the least (or improve the performance the most), but that value will be specific to our workload and memory access patterns. Other users might need a different value entirely, and they might not have the resources to find out. If it's just a binary decision (on or off), then at least it could be one A/B experiment (per workload/service). But the range here could vary wildly. Is there at least a default value that works decently well across workload/service, in your experience? > > > I think there might be some issues with just storing the store time here > > as well. IIUC, there might be cases where the zswap entry > > is accessed and brought into memory, but that entry (with the associated > > compressed memory) still hangs around. For e.g and more context, > > see this patch that enables exclusive loads: > > > > https://lore.kernel.org/lkml/20230607195143.1473802-1-yosryahmed@xxxxxxxxxx/ > > > > If that happens, this sto_time field does not tell the full story, right? > > For instance, if an object is stored a long time ago, but has been > > accessed since, it shouldn't be considered a cold object that should be > > a candidate for reclaim. But the old sto_time would indicate otherwise. > > > > Thanks for your review,we should update the store time when it was loaded. > But it confused me, there are two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap if > 'zswap_exclusive_loads_enabled' was disabled. I didn't notice any difference > when turning that option on or off because the frontswap_ops has been removed > and there is no frontswap_map anymore. Sorry, am I missing something? I believe Johannes has explained the case where this could happen. But yeah, this should be fixable with by updating the stored time field on access (maybe rename it to something a bit more fitting as well - last_accessed_time?) Regardless, it is incredibly validating to see that other parties share the same problems as us :) It's not a super invasive change as well. I just don't think it solves the issue that well for every zswap user.