Re: [PATCH v2] vmpressure: make sure memcg stays alive until all users are signaled

Tejun Heo <tj@xxxxxxxxxx> · Fri, 12 Jul 2013 11:34:04 -0700

Hello, Michal.

On Fri, Jul 12, 2013 at 10:40:39AM +0200, Michal Hocko wrote:
> > It's not something white and black but for things which can be made
> > trivially synchrnous, it usually is better to do it that way,
> 
> True in general but it is also true (in general) that once we have a
> reference counting for controlling life cycle for an object we should
> not bypass it.

When you have a reference count with staged destruction like cgroup,
the object goes two stagages before being released.  It first gets
deactivated or removed from further access and the base ref is
dropped.  This state presists until the left-over references drain.
When all the references are dropped, the object is actually released.

Now, between the initiation of destruction and actual release, the
object often isn't in the same state as before the destruction
started.  It depends on the specifics of the subsystem but usually
only subset of the original functionalities are accessible.  At times,
it becomes tricky discerning what's safe and what's not and we do end
up with latent subtle bugs which are only triggered when tricky
conditions are met - things like certain operations stretching across
destruction point.

It is important to distinguish the boundary between things which
remain accessible after destruction point and it often is easier and
less dangerous to avoid expanding that set unless necessary, so it is
not "bypassing" an existing mechanism at all.  It is an inherent part
of that model and various kernel subsystems have been doing that
forever.

> So I guess we are safe with the code as is but this all is really
> _tricky_ and deserves a fat comment. So rather than adding flushing work
> item code we should document it properly.
> 
> Or am I missing something?

Two things.

* I have no idea what would prevent the work item execution kicking in
  way after the inode is released.  The execution is not flushed and
  the work item isn't pinning the underlying data structure.  There's
  nothing preventing the work item execution to happen after or
  stretch across inode release.

* You're turning something which has a clearly established pattern of
  usage - work item shutdown which is usually done by disabling
  issuance and then flushing / canceling - into something really
  tricky and deserving a fat comment.  In general, kernel engineering
  is never about finding the most optimal / minimal / ingenious
  solution for each encountered instances of problems.  In most cases,
  it's usually about identifying established patterns and applying
  them so that the code is not only safe and correct but also remains
  accessible to the fellow kernel developers.

  Here, people who are familiary with workqueue usage would
  automatically look for shutdown sequence as the work item is
  contained in a dynamic object.  There are cases where
  synchronization happens externally and you don't necessarily need
  flush / cancel, but even then, if you consider readability and
  maintainability, following the established pattern is often then
  right thing to do.  It makes the code easier to follow and verify
  for others and avoids unintentional breakages afterwards as you
  separate out work item draining from the external synchronization
  which may change later on without noticing how it's interlocked with
  work item draining.

> OK, I haven't realized the action waits for finishing. /me is not
> regular work_queue user...

So, please, follow the established patterns.  This really isn't the
place for ingenuity.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html