Re: 3.13-rc breaks MEMCG_SWAP

Hugh Dickins <hughd@xxxxxxxxxx> · Mon, 16 Dec 2013 18:26:18 -0800 (PST)

On Mon, 16 Dec 2013, Michal Hocko wrote:
> On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
> [...]
> 
> OK, I went through the patch and it looks good except for suspicious
> ctrl->lock handling in swap_cgroup_reassign (see below). I am just
> suggesting to split it into 4 parts

Thanks a lot for studying it, and responding so quickly.
As I remark in reply to Tejun, I'm not nearly so keen on this
approach as you are, and would prefer something short and sweet
from the cgroup end, at least for now; but let's go through your
comments, to keep both options open until we're surer.

> 
> * swap_cgroup_mutex -> swap_cgroup_lock
> * swapon cleanup
> * drop irqsave when taking ctrl->lock
> * mem_cgroup_reparent_swap
> 
> but I will leave the split up to you. Just make sure that the fix is a
> separate patch, please.

Yes indeed, some split like that, maybe even more pieces.  The difficult
part is the ordering: for going into 3.13, we'd prefer a small fix first
and cleanup to follow after in 3.14, but it will be hard to force myself
not to do the cleanup ones first, and until getting down to it I won't
recall how much of the cleanup was essential e.g. to avoid lock ordering
problems or long lock hold times, perhaps.

> 
> [...]
> > --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> > +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> > @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> >  
> > -static DEFINE_MUTEX(swap_cgroup_mutex);
> > +static DEFINE_SPINLOCK(swap_cgroup_lock);
> > +
> 
> This one is worth a separate patch IMO.

Agreed.

> 
> >  struct swap_cgroup_ctrl {
> >  	struct page **map;
> >  	unsigned long length;
> > @@ -353,14 +354,11 @@ struct swap_cgroup {
> >  /*
> >   * allocate buffer for swap_cgroup.
> >   */
> > -static int swap_cgroup_prepare(int type)
> > +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
> >  {
> >  	struct page *page;
> > -	struct swap_cgroup_ctrl *ctrl;
> >  	unsigned long idx, max;
> >  
> > -	ctrl = &swap_cgroup_ctrl[type];
> > -
> >  	for (idx = 0; idx < ctrl->length; idx++) {
> >  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >  		if (!page)
> 
> This with swap_cgroup_swapon should be in a separate patch as a cleanup.

Agreed.

> 
> > @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
> >  {
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> > -	unsigned long flags;
> >  	unsigned short retval;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	retval = sc->id;
> >  	if (retval == old)
> >  		sc->id = new;
> >  	else
> >  		retval = 0;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  	return retval;
> >  }
> >  
> > @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> >  	unsigned short old;
> > -	unsigned long flags;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	old = sc->id;
> >  	sc->id = id;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  
> >  	return old;
> >  }
> 
> I would prefer these two in a separate patch as well. I have no idea why
> these were IRQ aware as this was never needed AFAICS.
> e9e58a4ec3b10 is not very specific...

Agreed.  Yes, I couldn't work out any justification for the _irqsave
variants, and preferred to avoid that clutter rather than add to it.

> 
> > @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
> >   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
> >   * @ent: swap entry to be looked up.
> >   *
> > - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> > + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> >   */
> >  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> >  {
> >  	return lookup_swap_cgroup(ent, NULL)->id;
> >  }
> >  
> > +/**
> > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > + * @old: id of emptied memcg whose entries are now to be reassigned
> > + * @new: id of parent memcg to which those entries are to be assigned
> > + *
> > + * Returns number of entries reassigned, for debugging or for statistics.
> > + */
> > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > +{
> > +	long reassigned = 0;
> > +	int type;
> > +
> > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > +		unsigned long idx;
> > +
> > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > +			struct swap_cgroup *sc, *scend;
> > +
> > +			spin_lock(&swap_cgroup_lock);
> > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > +				goto unlock;
> > +			sc = page_address(ctrl->map[idx]);
> > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > +				if (sc->id != old)
> > +					continue;
> 
> Is this safe? What prevents from race when id is set to old?

I am assuming that when this is called, we shall not be making any
new charges to old (if you see what I mean :) - this is called after
mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
my head around Hannes's later safety-net call; perhaps these patches
would even make that one redundant - dunno).

Quite a lot is made simpler by the fact that we do not need to call
this from mem_cgroup_force_empty(), with all the races that would
entail: there was never any attempt to move these swap charges before,
so it's a pleasure not to have to deal with that possibility now.

> 
> > +				spin_lock(&ctrl->lock);
> > +				if (sc->id == old) {
> 
> Also it seems that compiler is free to optimize this test away, no?
> You need ACCESS_ONCE here as well, I guess.

Oh dear, you're asking me to read again through memory-barriers.txt
(Xmas 2013 edition), and come to a conclusion.  I think this pattern
of test outside spinlock, spinlock, test again inside spinlock is
used in very many places, without any ACCESS_ONCE.  I'll have to
go and search through the precedents.

I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
a few lines above, which I added at a late stage to match the one above
it; but now I'm arguing that's unnecessary.

One of the problems with ACCESS_ONCE is that one easily falls into
a mistaken state in which it seems to be necessary everywhere;
but that illusion must be resisted.

The spinlock should make it unnecessary, but I'll have to muse on
semi-permeable membranes, osmosis, stuff like that.

> 
> > +					sc->id = new;
> > +					reassigned++;
> > +				}
> > +				spin_unlock(&ctrl->lock);
> > +			}
> > +unlock:
> > +			spin_unlock(&swap_cgroup_lock);
> > +			cond_resched();
> > +		}
> > +	}
> > +	return reassigned;
> > +}
> > +
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> >  {
> >  	void *array;
> >  	unsigned long array_size;
> >  	unsigned long length;
> > -	struct swap_cgroup_ctrl *ctrl;
> > +	struct swap_cgroup_ctrl ctrl;
> >  
> >  	if (!do_swap_account)
> >  		return 0;
> [...]
> -- 
> Michal Hocko
> SUSE Labs

Thanks: let's see if Tejun and Zefan can come up with something
simpler than this at their end.  If we do all decide that this
swap_cgroup_reassign() kind of change is desirable, it would still
be better to make it later on as a cleanup than rush it in now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>