Re: [PATCH 2/2] enable bypass raid5 journal for full stripe writes

NeilBrown <nfbrown@xxxxxxxxxx> · Thu, 11 Feb 2016 15:07:32 +1100

On Thu, Feb 11 2016, Song Liu wrote:

> Summary:
>
> Resending the patch to see whether we can get another chance...
>
> When testing current SATA SSDs as the journal device, we have
> seen 2 challenges: throughput of long sequential writes, and
> SSD life time.
>
> To ease the burn on the SSD, we tested bypassing journal for
> full stripe writes. We understand that bypassing journal will
> re-introduce write hole to the md layer. However, with
> well-designed application and file system, such write holes
> should not result in any data loss.

"should not" isn't very encouraging.
I've always thought that the "write hole" was more of a theoretical than
a practical concern.  But if we go to the trouble of preventing them, I
really think we need "will not".

I don't suppose we could ask the filesystem to tag certain requests with
a flag which say "If the system crashes before this write completes, I
promise to re-write the entire region (possibly with different data)
before trusting any data I read from here".
That is what you really need.
But it wouldn't help for in-place overwrites.

>
> Our test systems have 2 RAID-6 array per server and 15 HDDs
> per array. These 2 arrays shared 1 SSD as the journal (2
> partitions). Btrfs is created on both array.
>
> For squential write benchmarks, we observe significant
> performance gain (250MB/s per volume vs. 150M/s) from
> bypassing journal for full stripes.

You need a faster SSD :-)

>
> We all performed power cycle tests on these systems while
> running a write workload. For more than 50 power cycles,
> we have seen zero data loss.

That doesn't fill me with any confidence.
To get even close to a credible test, you would need to instrument the
kernel so that at some (random) point it would never send another IO
request until a reboot, and so that this random point was always
somewhere in the middle of submitting a set of writes to a stripe.

Instrument a kernel to do this, demonstrate that the write-hole causes
problems, then demonstrate that it doesn't with the various patches
applied.  Then do this across a range of tests (write-in-place, append,
etc) on a range of filesystems.
Testing for data integrity is *hard*.

And as you say, it may be up to the application to ensure it doesn't use
corrupt data (like a partial write).  Testing that is not really
possible.
If you were to make it possible to do this, you would need to be very
explicit about what sort of write-hole corruption could still occur, so
that users could make an informed decision.

(I assume your test had a degraded array... you didn't explicitly say
so, but the test would be meaningless on a non-degraded array so you
probably did).

>  
> +static ssize_t
> +r5l_show_bypass_full_stripe(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf;
> +	int ret = 0;
> +
> +	spin_lock(&mddev->lock);
> +	conf = mddev->private;
> +	if (conf) {
> +		if (conf->log)
> +			ret = sprintf(page, "%d\n",
> +				      conf->log->bypass_full_stripe);
> +		else
> +			ret = sprintf(page, "n/a\n");
> +	}

Given that this is a boolean, I would much prefer it was consistently a
boolean value, never "n/a".
Not sure if 0/1 Y/N T/F is best, but one of those.

Otherwise the code looks sensible (assuming the idea is sensible, which
I'm not 100% convinced of).

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature