On Wed, May 31, 2023 at 3:06 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote: > > > Maybe ENOMEM is a bad example. How about if the swap device > > > just went bad and can't complete new IO writes? > > > > This is actually outside the scope of zswap, and handled by the > > swapcache (end_swap_bio_write). > > > > Once the IO is submitted, zswap will ax its copy and leave the rest to > > the swapcache. It behaves the same way as if zswap had never been > > involved to begin with when the swap out fails on IO errors. > > > > From a zswap perspective, there are no persistent errors in moving a > > zswap entry back into the swapcache. Not just currently, but generally. > Again, you are right that this zswap writeback is async. > So the writeback error is NOT going to propagate to the > shrink function. > > With the current three pool backends that I looked at{zbud, > z3fold,zsmalloc} they all have internal retry 8 times. > Adding more retry did not fundamentally change the existing > behavior. > > I look at all the possible error codes generated inside > the reclaim code, the only noticeable errors are ENOMEM > and concurrent swap invalidation or a racing swapin fault. > > BTW, zswap reclaim consumes memory. Keep on looping ENOMEM > might cause more OOM. But that can exist in current code > as well. > > > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > > > > invalidation or a racing swapin fault. In both cases we should > > > > absolutely keep trying other entries until the goal is met. > > > > > > How about a narrower fix recognizing those error cases and making > > > the inner loop continue in those errors? > > > > Right, I just I don't really see the value proposition of this > > complication, and I see some downsides (see below). No single entry > > error should ever cause us to stop the wider reclaim loop. > > That is until the current LRU list has been through once. > I expect repeating the same list yields less reclaimed pages. > > > > > > > > > extreme case where it's the only page left on the list, I again don't > > > > > > see how retrying a few times will make the situation worse. > > > > > > > > > > > > In practice, IMO there is little upside in trying to be more > > > > > > discerning about the error codes. Simple seems better here. > > > > > > > > > > Just trying to think about what should be the precise loop termination > > > > > condition here. > > > > > > > > > > I still feel blindly trying a few times is a very imprecise condition. > > > > > > > > The precise termination condition is when can_accept() returns true > > > > again. The safety cap is only added as precaution to avoid infinite > > > > loops if something goes wrong or unexpected, now or in the future. > > > > > > In my mind, that statement already suggests can_accept() is not > > > *precise*, considering the avoid infinite loop. > > > e.g. Do we know what is the optimal cap value and why that value > > > is optical? > > > > Oh but it is precise. That's the goal we want to accomplish. > > I understand it is the goal, the precise condition I am > talking about is the loop termination condition, can_accept() > is not the only one. Anyway, let's move on. > > > > The cap is just so that in case something unexpectedly goes wrong (a > > bug), we fail gracefully and don't lock up the machine. The same > > reason we prefer WARN_ONs over BUG_ONs if we can, to avoid > > crashes. That's really all there is too it, and it strikes me as > > reasonable and robust design choice. It's fine to limp along or be > > suboptimal after such a bug happens; the bar is avoiding an infinite > > loop, nothing else. > > > > Your suggestion of whitelisting certain errors is more complicated, > > but also less robust: in case an entry error does by some accident > > become persistent for the whole LRU, we're locking up the host. We'd > > rather catch a bug like this by seeing spikes in the reclaim failure > > rate than by losing production machines. > > > > > Putting the definition of precise aside, I do see the unconditional > > > retry can have unwanted effects. > > > > I hope I could address this above. But if not, please share your > > concerns. > Thanks for the discussion. I am less concerned about the retry now. > Retry on EAGAIN might be the simplest way to proceed. > > Outside of the scope of this patch, I am still surprised to see > such a high number of retries caused by race conditions. There are > 8 inner loop retry already. The actual pages retried need to > times 8. > > If there is a reproducer script, I want to local repo this > to understand better. Wish there are ways to reduce the retry. With `stress` you can reproduce the errors quite easily, if you're not going to see many (if any at all) writebacks _without_ this patch, you can let the zpool fill, keeping the default max_pool_percent value of 20, then once full switch max_pool_percent to 1. > > Another idea is that we can start the shrinking once the > pool max was reached. Try to reduce to the threshold earlier. The shrinking already starts when pool max is reached, with and without the proposed patch. If you're referring to a new threshold that would kick off the shrinking, that could be a nice improvement to the overall mechanism, and it would fit nicely on top of this patch. > > Chris