Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 16 Oct 2012 08:02:17 +1100

On Mon, Oct 15, 2012 at 03:51:34PM +0900, Fernando Luis Vazquez Cao wrote:
> On 2012年10月15日 15:36, Dave Chinner wrote:
> >On Mon, Oct 15, 2012 at 12:24:59PM +0900, Fernando Luis Vazquez Cao wrote:
> >>On 2012/10/13 10:06, Dave Chinner wrote:
> >>>On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
> >>>>Any process attempting to write to a frozen filesystem uninterruptibly and
> >>>>unkillably waits for the filesystem to be thawed. This wait is of unbounded
> >>>>length. Ignore such waits in the hung_task detector.
> >>>Filesystems should not be frozen for long enough to trigger the hung
> >>>task detector under normal usage. IMO, if you are freezing a
> >>>filesystem for that long, then you're either doing something wrong
> >>>or something has gone wrong, and in either case I think we should be
> >>>emitting warnings...
> >>The problem is that in production systems situations where
> >>a filesystem remains brozen for long periods are not uncommon.
> >>A typical example is as follows: the control daemon or script that
> >>controls the freeze/thaw using the fsfreeze ioctls dies, the next
> >There's your problem. Fix that, don't turn off useful warnings that
> >indicate something has gone wrong.
> 
> It is not my problem. It is the enterprise distro's user's problem.

It's your problem because you are trying to change the code :)

> As I mentioned in my previous email if you want to emit a
> warning do it in the right place and make sure that it is
> something informative. hung_check certainly isn't the
> right place to do it.

So, how do we now know when a freeze fails to complete, as opposed
to a thaw that hasn't occurred? We won't get any reports from
threads that are stuck waiting for the freeze to complete, and so
we'll end up with a silent hang. This is *exactly* what the hung
task messages are supposed to avoid by being verbose - we know what
hung rather than having stuff just silently stop.

If you want to remove verbose warnings, replace them with concise,
targeted and *equivalent* warnings before removing the only warnings
we currently have that indicate a problem....

> A failure in a user space script should not lead to a kernel
> panic or to a flood of process stack dumps in the system log

An administrator can cause that to happen in many, many ways by
having a script or a daemon fail to do the right thing. freeze/thaw
is not unique in that respect.  Removing an entire class of warnings
because something is broken in userspace and fails to be handled
correctly is not the right solution.

Indeed, if you have a daemon that freezes the filesystem, and you
haven't architected it with a watchdog to handle restarts due to
failures, then you don't have a resilient system at all, regardless
of these warnings. If it's a HA daemon/agent that doesn't get
restarted and clean up it's mess automatically, then IMO it is
fundamentally broken and that's the problem that needs fixing.
Removing kernel warnings doesn't change the fact that the
application doing freeze/thaw is broken by design...

> administrators cannot interpret (a common complaint from
> our customers). This is the behaviour this patch is trying to
> fix.

Educate your customers through documentation then - FAQs exist for a
reason.

Removing warnings that we (developers) rely on for debugging issues
with freeze/thaw because customers don't understand what they are is
a terrible solution.  It means we don't hear about problems (because
there are no warnings), and when we do we hear about silent hangs we
can't diagnose them (because there are no warnings). It's a
lose/lose situation.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html