Re: [PATCH v5 3/4] zram: zram memory size limitation

Minchan Kim <minchan@xxxxxxxxxx> · Thu, 28 Aug 2014 12:04:03 +0900

On Wed, Aug 27, 2014 at 03:04:32PM -0400, Dan Streetman wrote:
> On Wed, Aug 27, 2014 at 12:59 PM, David Horner <ds2horner@xxxxxxxxx> wrote:
> > On Wed, Aug 27, 2014 at 12:29 PM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> >> On Wed, Aug 27, 2014 at 11:35 AM, David Horner <ds2horner@xxxxxxxxx> wrote:
> >>> On Wed, Aug 27, 2014 at 11:14 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> >>>> On Wed, Aug 27, 2014 at 10:44 AM, David Horner <ds2horner@xxxxxxxxx> wrote:
> >>>>> On Wed, Aug 27, 2014 at 10:03 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> >>>>>> On Tue, Aug 26, 2014 at 10:51 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >>>>>>> Hey Joonsoo,
> >>>>>>>
> >>>>>>> On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote:
> >>>>>>>> Hello, Minchan and David.
> >>>>>>>>
> >>>>>>>> On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote:
> >>>>>>>> > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >>>>>>>> > > Hey Joonsoo,
> >>>>>>>> > >
> >>>>>>>> > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote:
> >>>>>>>> > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote:
> >>>>>>>> > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >>>>>>>> > >> >             ret = -ENOMEM;
> >>>>>>>> > >> >             goto out;
> >>>>>>>> > >> >     }
> >>>>>>>> > >> > +
> >>>>>>>> > >> > +   if (zram->limit_pages &&
> >>>>>>>> > >> > +           zs_get_total_pages(meta->mem_pool) > zram->limit_pages) {
> >>>>>>>> > >> > +           zs_free(meta->mem_pool, handle);
> >>>>>>>> > >> > +           ret = -ENOMEM;
> >>>>>>>> > >> > +           goto out;
> >>>>>>>> > >> > +   }
> >>>>>>>> > >> > +
> >>>>>>>> > >> >     cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> >>>>>>>> > >>
> >>>>>>>> > >> Hello,
> >>>>>>>> > >>
> >>>>>>>> > >> I don't follow up previous discussion, so I could be wrong.
> >>>>>>>> > >> Why this enforcement should be here?
> >>>>>>>> > >>
> >>>>>>>> > >> I think that this has two problems.
> >>>>>>>> > >> 1) alloc/free happens unnecessarilly if we have used memory over the
> >>>>>>>> > >> limitation.
> >>>>>>>> > >
> >>>>>>>> > > True but firstly, I implemented the logic in zsmalloc, not zram but
> >>>>>>>> > > as I described in cover-letter, it's not a requirement of zsmalloc
> >>>>>>>> > > but zram so it should be in there. If every user want it in future,
> >>>>>>>> > > then we could move the function into zsmalloc. That's what we
> >>>>>>>> > > concluded in previous discussion.
> >>>>>>>>
> >>>>>>>> Hmm...
> >>>>>>>> Problem is that we can't avoid these unnecessary overhead in this
> >>>>>>>> implementation. If we can implement this feature in zram efficiently,
> >>>>>>>> it's okay. But, I think that current form isn't.
> >>>>>>>
> >>>>>>>
> >>>>>>> If we can add it in zsmalloc, it would be more clean and efficient
> >>>>>>> for zram but as I said, at the moment, I didn't want to put zram's
> >>>>>>> requirement into zsmalloc because to me, it's weird to enforce max
> >>>>>>> limit to allocator. It's client's role, I think.
> >>>>>>>
> >>>>>>> If current implementation is expensive and rather hard to follow,
> >>>>>>> It would be one reason to move the feature into zsmalloc but
> >>>>>>> I don't think it makes critical trobule in zram usecase.
> >>>>>>> See below.
> >>>>>>>
> >>>>>>> But I still open and will wait others's opinion.
> >>>>>>> If other guys think zsmalloc is better place, I am willing to move
> >>>>>>> it into zsmalloc.
> >>>>>>
> >>>>>> Moving it into zsmalloc would allow rejecting new zsmallocs before
> >>>>>> actually crossing the limit, since it can calculate that internally.
> >>>>>> However, with the current patches the limit will only be briefly
> >>>>>> crossed, and it should not be crossed by a large amount.  Now, if this
> >>>>>> is happening repeatedly and quickly during extreme memory pressure,
> >>>>>> the constant alloc/free will clearly be worse than a simple internal
> >>>>>> calculation and failure.  But would it ever happen repeatedly once the
> >>>>>> zram limit is reached?
> >>>>>>
> >>>>>> Now that I'm thinking about the limit from the perspective of the zram
> >>>>>> user, I wonder what really will happen.  If zram is being used for
> >>>>>> swap space, then when swap starts getting errors trying to write
> >>>>>> pages, how damaging will that be to the system?  I haven't checked
> >>>>>> what swap does when it encounters disk errors.  Of course, with no
> >>>>>> zram limit, continually writing to zram until memory is totally
> >>>>>> consumed isn't good either.  But in any case, I would hope that swap
> >>>>>> would not repeatedly hammer on a disk when it's getting write failures
> >>>>>> from it.
> >>>>>>
> >>>>>> Alternately, if zram was being used as a compressed ram disk for
> >>>>>> regular file storage, it's entirely up to the application to handle
> >>>>>> write failures, so it may continue to try to write to a full zram
> >>>>>> disk.
> >>>>>>
> >>>>>> As far as what the zsmalloc api would look like with the limit added,
> >>>>>> it would need a setter and getter function (adding it as a param to
> >>>>>> the create function would be optional i think).  But more importantly,
> >>>>>> it would need to handle multiple ways of specifying the limit.  In our
> >>>>>> specific current use cases, zram and zswap, each handles their
> >>>>>> internal limit differently - zswap currently uses a % of total ram as
> >>>>>> its limit (defaulting to 20), while with these patches zram will use a
> >>>>>> specific number of bytes as its limit (defaulting to no limit).  If
> >>>>>> the limiting mechanism is moved into zsmalloc (and possibly zbud),
> >>>>>> then either both users need to use the same units (bytes or %ram), or
> >>>>>> zsmalloc/zbud need to be able to set their limit in either units.  It
> >>>>>> seems to me like keeping the limit in zram/zswap is currently
> >>>>>> preferable, at least without both using the same limit units.
> >>>>>>
> >>>>>
> >>>>> zswap knows what 20% (or whatever % it currently uses , and perhaps it too
> >>>>> will become a tuning knob) of memory is in bytes.
> >>>>>
> >>>>> So, if the interface to establish a limit for a pool (or pool set, or whatever
> >>>>> zsmalloc sets up for its allocation mechanism) is stipulated in bytes
> >>>>> (to actually use pages internally, of visa-versa) , then both can use
> >>>>> that interface.
> >>>>> zram with its native page stipulation, and zswap with calculated % of memory).
> >>>>
> >>>> No, unless zswap monitors memory hotplug and updates the limit on each
> >>>> hotplug event, 20% of the *current* total ram at zswap initialization
> >>>> is not equal to an actual 20% of ram limit.  zswap checks its size
> >>>> against totalram_pages for each new allocation. I don't think we would
> >>>> prefer adding memory hotplug monitoring to zswap just to update the
> >>>> zpool size limit.
> >>>>
> >>>
> >>> OK - I see the need to retain the limits where they are in the using
> >>> components so that
> >>> zsmalloc is not unnecessarily complicated (keeping track of 2 limit methods).
> >>>
> >>> So, zswap has the same race conditions and possible transient over-allocations?
> >>> It looks like I will have to check on how zswap implements it.
> >>> But perhaps you can answer the question that is not in the code:
> >>> Have there been reported thrashing behaviour around the 20% limit for zswap?
> >>
> >> zswap does a simple over-allocation check before allocating anything.
> >> So during page store, it checks if (total_ram * 0.20) < used.  This
> >> actually places the effective limit higher than the specified limit,
> >> but only by a single allocation.  This approach could be taken with
> >> zram as well.
> >>
> >> The amount of over-allocation (past the specified limit) would vary
> >> between zsmalloc and zbud.  Since zbud increases itself in page
> >> increments, any over-allocation past the zswap limit would be by only
> >> 1 page.  However, zsmalloc is variable in its allocation increments,
> >> as it depends on which class needs to be grown; zsmalloc is divided
> >> into many "classes", each of contains some number of "zspages" which
> >> try to precisely contain some number of N-sized areas; e.g. one class
> >> might use zspages that are 2 pages to store 3 separate areas which are
> >> each 2/3 of a page number of bytes; if that class needed to be grown,
> >> it would add one zspage that is 2 pages.  The max number of actual
> >> pages per zspage is defined by ZS_MAX_PAGES_PER_ZSPAGE which is
> >> currently set to 1<<2, so 4.
> >>
> >> So with zswap, it will over-allocate memory past its specified limit,
> >> up to 1 page (with zbud) or up to 4 pages (with zsmalloc).  zram could
> >> do the same, simply check if its size > limit before each write, and
> >> fail if so; that would remove the alloc/free issue, and would only
> >> over-allocate by at most 4 pages (with the current zsmalloc settings).
> >> Alternately, zram could check if its (current_size + 4pages > limit),
> >> which would then stop it short of the limit by up to 4 pages.  Really
> >> though, 4 pages either above or under the limit probably doesn't
> >> matter.
> >
> > I agree that it isn't the over allocation of a few pages, per se, but if
> > there is potential for concurrent allocation through either zram of zswap
> > there is the possibility of high contention thrashing as the processes
> > jocky for those final pages. Perhaps not a thundering herd, but
> >  it could readily become a hot spot as the pressure is on and there is
> >  no explicit mechanism to throttle back retries and new efforts.
> >
> > I agree that a precheck before zsmalloc request in zram may be a valuable
> > optimization, but only if there is a real possibility of threshold thrashing.
> > I suggested before that it might be best to wait and see....
> > Do you have any suggestions on a stress test approach to empirically assess
> > the risk?
> 
> Minchan is the expert here so he may have suggestions and/or existing
> tests, but I think just creating a zram device with a limit and any
> filesystem and mounting it, then filling up the device, would be a
> good basic test.  Maybe once it gets close to the limit, try writing
> to it with multiple threads concurrently.  Also, while near/at the
> limit, try concurrently writing to it and deleting from it - probably
> writing many small files as well as deleting many small files would be
> best to keep it right at its limit.  However, I think
> zram_bvec_write() is already protected inside the init_lock semphore
> (via zram_make_request), so it seems like there will never be any
> concurrency issue.

Strictly speaking, it's read-side lock so that it could be concurrent.

> 
> Since the threads writing to files at the limit will be getting
> (expected) errors during write, the test is really looking at is
> whether there is any significant performance difference between a
> alloc/free-on-full approach or a pre-check/fail-on-full approach.
> That might not become apparent until there is significant memory
> pressure, so you might set the zram limit high enough, or you could
> otherwise artificially use up system memory (e.g. stress -m).
> 
> Also, as a general test of the new zram limit, it would be interesting
> to see how various filesystem drivers handle unexpected write
> failures...I'm no fs expert, but I would expect that this is different
> than a typical ENOSPC condition, which I assume comes from the
> filesystem, not the block layer.  This error would be more like a
> hardware error.

True. Acutally, I found a BUG_ON on ext4 but didn't look it in depth
yet. Maybe I will report.

> 
> I'd also be interested in what happens when a zram-backed swap device
> reaches its limit...I would hope swap device write errors would lead
> to a normal OOM condition and invoke the oom-killer, but from
> page_io.c end_swap_bio_write:
> 
>  * We failed to write the page out to swap-space.
>  * Re-dirty the page in order to avoid it being reclaimed.
>  * Also print a dire warning that things will go BAD (tm)
>  * very quickly.
> 

Yeb, bad. as I said, VM doesn't have congestion control for swap at the
moment and it could be improved and I am seeing that.

> 
> 
> >
> > Thanks again.
> >
> >
> >>
> >>>
> >>> thanks.
> >>>
> >>>>>
> >>>>> Both would need a mechanism to change the max as need change,
> >>>>>  so the API has to handle this.
> >>>>>
> >>>>>
> >>>>> Or am I way off base?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> > >
> >>>>>>>> > > Another idea is we could call zs_get_total_pages right before zs_malloc
> >>>>>>>> > > but the problem is we cannot know how many of pages are allocated
> >>>>>>>> > > by zsmalloc in advance.
> >>>>>>>> > > IOW, zram should be blind on zsmalloc's internal.
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>> > We did however suggest that we could check before hand to see if
> >>>>>>>> > max was already exceeded as an optimization.
> >>>>>>>> > (possibly with a guess on usage but at least using the minimum of 1 page)
> >>>>>>>> > In the contested case, the max may already be exceeded transiently and
> >>>>>>>> > therefore we know this one _could_ fail (it could also pass, but odds
> >>>>>>>> > aren't good).
> >>>>>>>> > As Minchan mentions this was discussed before - but not into great detail.
> >>>>>>>> > Testing should be done to determine possible benefit. And as he also
> >>>>>>>> > mentions, the better place for it may be in zsmalloc, but that
> >>>>>>>> > requires an ABI change.
> >>>>>>>>
> >>>>>>>> Why we hesitate to change zsmalloc API? It is in-kernel API and there
> >>>>>>>> are just two users now, zswap and zram. We can change it easily.
> >>>>>>>> I think that we just need following simple API change in zsmalloc.c.
> >>>>>>>>
> >>>>>>>> zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op)
> >>>>>>>> =>
> >>>>>>>> zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops
> >>>>>>>> *zpool_op)
> >>>>>>>>
> >>>>>>>> It's pool allocator so there is no obstacle for us to limit maximum
> >>>>>>>> memory usage in zsmalloc. It's a natural idea to limit memory usage
> >>>>>>>> for pool allocator.
> >>>>>>>>
> >>>>>>>> > Certainly a detailed suggestion could happen on this thread and I'm
> >>>>>>>> > also interested
> >>>>>>>> > in your thoughts, but this patchset should be able to go in as is.
> >>>>>>>> > Memory exhaustion avoidance probably trumps the possible thrashing at
> >>>>>>>> > threshold.
> >>>>>>>> >
> >>>>>>>> > > About alloc/free cost once if it is over the limit,
> >>>>>>>> > > I don't think it's important to consider.
> >>>>>>>> > > Do you have any scenario in your mind to consider alloc/free cost
> >>>>>>>> > > when the limit is over?
> >>>>>>>> > >
> >>>>>>>> > >> 2) Even if this request doesn't do new allocation, it could be failed
> >>>>>>>> > >> due to other's allocation. There is time gap between allocation and
> >>>>>>>> > >> free, so legimate user who want to use preallocated zsmalloc memory
> >>>>>>>> > >> could also see this condition true and then he will be failed.
> >>>>>>>> > >
> >>>>>>>> > > Yeb, we already discussed that. :)
> >>>>>>>> > > Such false positive shouldn't be a severe problem if we can keep a
> >>>>>>>> > > promise that zram user cannot exceed mem_limit.
> >>>>>>>> > >
> >>>>>>>>
> >>>>>>>> If we can keep such a promise, why we need to limit memory usage?
> >>>>>>>> I guess that this limit feature is useful for user who can't keep such promise.
> >>>>>>>> So, we should assume that this false positive happens frequently.
> >>>>>>>
> >>>>>>>
> >>>>>>> The goal is to limit memory usage within some threshold.
> >>>>>>> so false positive shouldn't be harmful unless it exceeds the threshold.
> >>>>>>> In addition, If such false positive happens frequently, it means
> >>>>>>> zram is very trobule so that user would see lots of write fail
> >>>>>>> message, sometime really slow system if zram is used for swap.
> >>>>>>> If we protect just one write from the race, how much does it help
> >>>>>>> this situation? I don't think it's critical problem.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> > And we cannot avoid the race, nor can we avoid in a low overhead competitive
> >>>>>>>> > concurrent process transient inconsistent states.
> >>>>>>>> > Different views for different observers.
> >>>>>>>> >  They are a consequence of the theory of "Special Computational Relativity".
> >>>>>>>> >  I am working on a String Unification Theory of Quantum and General CR in LISP.
> >>>>>>>> >  ;-)
> >>>>>>>>
> >>>>>>>> If we move limit logic to zsmalloc, we can avoid the race by commiting
> >>>>>>>> needed memory size before actual allocation attempt. This commiting makes
> >>>>>>>> concurrent process serialized so there is no race here. There is
> >>>>>>>> possibilty to fail to allocate, but I think this is better than alloc
> >>>>>>>> and free blindlessly depending on inconsistent states.
> >>>>>>>
> >>>>>>> Normally, zsmalloc/zsfree allocates object from existing pool so
> >>>>>>> it's not big overhead and if someone continue to try writing  once limit is
> >>>>>>> full, another overhead (vfs, fs, block) would be bigger than zsmalloc
> >>>>>>> so it's not a problem, I think.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>>>>>> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> >>>>>>>> see: http://www.linux-mm.org/ .
> >>>>>>>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Kind regards,
> >>>>>>> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>