On Thu, Feb 11 2016, Song Liu wrote: > Summary: > > Resending the patch to see whether we can get another chance... > > When testing current SATA SSDs as the journal device, we have > seen 2 challenges: throughput of long sequential writes, and > SSD life time. > > To ease the burn on the SSD, we tested bypassing journal for > full stripe writes. We understand that bypassing journal will > re-introduce write hole to the md layer. However, with > well-designed application and file system, such write holes > should not result in any data loss. "should not" isn't very encouraging. I've always thought that the "write hole" was more of a theoretical than a practical concern. But if we go to the trouble of preventing them, I really think we need "will not". I don't suppose we could ask the filesystem to tag certain requests with a flag which say "If the system crashes before this write completes, I promise to re-write the entire region (possibly with different data) before trusting any data I read from here". That is what you really need. But it wouldn't help for in-place overwrites. > > Our test systems have 2 RAID-6 array per server and 15 HDDs > per array. These 2 arrays shared 1 SSD as the journal (2 > partitions). Btrfs is created on both array. > > For squential write benchmarks, we observe significant > performance gain (250MB/s per volume vs. 150M/s) from > bypassing journal for full stripes. You need a faster SSD :-) > > We all performed power cycle tests on these systems while > running a write workload. For more than 50 power cycles, > we have seen zero data loss. That doesn't fill me with any confidence. To get even close to a credible test, you would need to instrument the kernel so that at some (random) point it would never send another IO request until a reboot, and so that this random point was always somewhere in the middle of submitting a set of writes to a stripe. Instrument a kernel to do this, demonstrate that the write-hole causes problems, then demonstrate that it doesn't with the various patches applied. Then do this across a range of tests (write-in-place, append, etc) on a range of filesystems. Testing for data integrity is *hard*. And as you say, it may be up to the application to ensure it doesn't use corrupt data (like a partial write). Testing that is not really possible. If you were to make it possible to do this, you would need to be very explicit about what sort of write-hole corruption could still occur, so that users could make an informed decision. (I assume your test had a degraded array... you didn't explicitly say so, but the test would be meaningless on a non-degraded array so you probably did). > > +static ssize_t > +r5l_show_bypass_full_stripe(struct mddev *mddev, char *page) > +{ > + struct r5conf *conf; > + int ret = 0; > + > + spin_lock(&mddev->lock); > + conf = mddev->private; > + if (conf) { > + if (conf->log) > + ret = sprintf(page, "%d\n", > + conf->log->bypass_full_stripe); > + else > + ret = sprintf(page, "n/a\n"); > + } Given that this is a boolean, I would much prefer it was consistently a boolean value, never "n/a". Not sure if 0/1 Y/N T/F is best, but one of those. Otherwise the code looks sensible (assuming the idea is sensible, which I'm not 100% convinced of). Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature