Re: RAID 5: low sequential write performance?

Corey Hickey <bugfood-ml@xxxxxxxxxx> · Sun, 16 Jun 2013 23:39:14 -0700

On 2013-06-16 14:27, Peter Grandi wrote:
>> I know this doesn't have anything to do with the filesystem--
>> I was able to reproduce the behavior on a test system, writing
>> directly to an otherwise unused array, using a single 768 MB
>> write() call.
> 
> Usually writes via a filesystem are more likely to avoid RMW
> issues, as suitabky chosen filesystem designs take into account
> stripe alignment.

Yeah.... but if I issue a single write() directly to the array without
seeking anywhere, it should be aligned, at least I would expect.

> Some time ago I did some tests and I was also writing to a
> '/dev/md' device, but I found I got RMW only if using
> 'O_DIRECT', while buffered writes ended up being aligned.
> Without going into details, it looked like that the Linux IO
> subsystem does significant reordering of requests, sometimes
> surprisingly, when directly accessing the block device, but not
> when writing files after creating a filesystem in that block
> device. Perhaps currently MD expects to be fronted by a
> filesystem.

Hmm. I tried some more simple tests with dd right now, for chunk size
512 (I'm only reporting single runs here, but I ran them a few times to
make sure they were representative).

Without O_DIRECT:

icebox:~# dd if=/dev/zero bs=768M count=1 of=/dev/md3 conv=fdatasync

1+0 records in
1+0 records out
805306368 bytes (805 MB) copied, 16.1938 s, 49.7 MB/s

With O_DIRECT:

icebox:~# dd if=/dev/zero bs=768M count=1 of=/dev/md3 oflag=direct
conv=fdatasync
1+0 records in
1+0 records out
805306368 bytes (805 MB) copied, 18.1964 s, 44.3 MB/s

So, that doesn't seem to help. Interestingly, it did help when I tested
with 16384 KB chunk size (34 MB/s --> 39 MB/s); but, I'll stick with 512
KB chunks for now.

>> I measured chunk sizes at each power of 2 from 2^2 to 2^14
>> KB. The results of this are that smaller chunks performed the
>> best, [ ... ]
> 
> Your Perl script is a bit convoluted. I prefer to keep it simple
> and use 'dd' advisedly to get upper boundaries.

Yeah... I had used dd for the early testing, but I wanted to try random
data and /dev/urandom was so slow it added a large baseline to each test
unless I read it into memory before starting the time measurements.

The rest of the script is just for the setup, reporting, etc.

> Anyhow, try using a stripe-aware filesystem like XFS, and also
> perhaps increase significantly the size of the stripe cache.
> That seems to help scheduling too. Changing the elevator on the
> member devices sometimes helps too (but is not necessarily
> related to RMW issues).

The underlying devices were using cfq. noop and deadline were available,
but they didn't make a noticeable difference.

The stripe cache, however, made a huge difference. It was 256 (KB,
right?) by default. Here are some average-of-three dd results (without
O_DIRECT, as above).

  256 KB: 50.2 MB/s
  512 KB: 61.0 MB/s
 1024 KB: 72.7 MB/s
 2048 KB: 79.6 MB/s
 4096 KB: 87.5 MB/s
 8192 KB: 87.3 MB/s
16384 KB: 89.8 MB/s
32768 KB: 91.3 MB/s

...then I tried O_DIRECT again with 32768 KB stripe cache, and it
consistently gets slightly better results: 92.7 MB/s.

This is just on some old dissimilar drives I stuffed into my old
desktop, so I'm not expecting stellar performance. sdd is the slowest,
and it only writes at 54.6 MB/s on its own, so 92.7 MB/s is not too
shabby for the RAID, especially compared to the 49.7 MB/s I was getting
before.

I've been watching a dstat run this whole time, and increasing stripe
cache sizes do indeed result in fewer reads, until they go away entirely
at 32768 KB (except for a few reads at the end, which appear to be
unrelated to RAID).

32768 seems to be the maximum for the stripe cache. I'm quite happy to
spend 32 MB for this. 256 KB seems quite low, especially since it's only
half the default chunk size.

Out of curiosity, I did some tests with xfs vs. ext4 on an empty
filesystem. I'm not familiar with xfs, so I may be missing out on some
tuning. This isn't really meant to be a comprehensive benchmark, but for
a single sequential write with dd:

mkfs.xfs /dev/m3
direct: 89.8 MB/s  not direct: 90.0 MB/s

mkfs.ext4 /dev/md3 -E lazy_itable_init=0
direct: 86.2 MB/s  not direct: 85.4 MB/s

mkfs.ext4 /dev/md3 -E lazy_itable_init=0,stride=128,stripe_width=256
direct: 89.0 MB/s  not direct: 85.6 MB/s

Anyway, xfs indeed did slightly better, so I may evaluate it further
next time I rebuild my main array (the one that actually matters for all
this testing. :) As far as that array goes, write performance went from
30.2 MB/s to 53.1 MB/s. Not that great, unfortunately, but I'm using
dm-crypt and that may be the bottleneck now. Reads are still present
during a write (due to fragmentation, perhaps?), but they are minimal.

Thanks Peter, your email was a great help to me. I'm still interested if
you or anyone else has anything to comment on here, but I'm satisfied
that I've managed to eliminate unnecessary read-modify-write as a source
of slowness.

Thanks,
Corey
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html