Re: [PATCH] xfs: fix incorrect log_flushed on fsync

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 19 Sep 2017 08:31:37 +0300

On Tue, Sep 19, 2017 at 12:24 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Sep 18, 2017 at 09:00:30PM +0300, Amir Goldstein wrote:
>> On Mon, Sep 18, 2017 at 8:11 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
>> > On Fri, Sep 15, 2017 at 03:40:24PM +0300, Amir Goldstein wrote:
>> >> The disclosure of the security bug fix (commit b31ff3cdf5) made me wonder
>> >> if possible data loss bug should also be disclosed in some distros forum?
>> >> I bet some users would care more about the latter than the former.
>> >> Coincidentally, both data loss and security bugs fix the same commit..
>> >
>> > Yes the the patch ought to get sent on to stable w/ fixes tag.  One
>> > would hope that the distros will pick up the stable fixes from there.
>
> Yup, that's the normal process for data integrity/fs corruption
> bugs.

Makes sense. I'm convinced that the normal process is sufficient for this
sort of bug fix.

>
>> > That said, it's been in the kernel for 12 years without widespread
>> > complaints about corruption, so I'm not sure this warrants public
>> > disclosure via CVE/Phoronix vs. just fixing it.
>> >
>>
>> I'm not sure either.
>> My intuition tells me that the chances of hitting the data loss bug
>> given a power failure are not slim, but the chances of users knowing
>> about the data loss are slim.
>
> The chances of hitting it are slim. Power-fail vs fsync data
> integrity testing is something we do actually run as part of QE and
> have for many years.  We've been running such testing for years and
> never tripped over this problem, so I think the likelihood that a
> user will hit it is extremely small.

This sentence make me unease.
Who is We and what QE testing are you referring to?
Are those tests in xfstests or any other public repository?
My first reaction to the corruption was "no way, I need to check the test"
Second reaction after checking the test was "this must very very hard to hit"
But from closer inspection, it looks like it doesn't take more than running
a couple of fsync in parallel to get to the "at risk" state, which may persist
for seconds.
Of course the chances of users being that unlucky to also get a power
failure during "at risk" state are low, but I am puzzled how power fail tests
you claim that exists, didn't catch this sooner.

Anyway, not sure there is much more to discuss, just wanted to see
if there is a post mortem lesson to be learned from this, beyond the fact that
dm-log-writes is a valuable testing tool.

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html