On 18-09-12 22:06, Peter Grandi wrote: > [ ... ] > >> I noticed in iostat something I personally find very weird. >> All the disks in the RAID set (minus the spare) seem to read >> 6-7 times as much as they write. Since there is no other I/O >> (so there aren't really any reads issued besides some very >> occasional overhead for NTFS perhaps once in a while) I find >> this really weird. Note also that iostat doesn't show the >> reads in iostat on the md device (which is the case if the >> initiator issues reads) but only on the active disks in the >> RAID set, which to me (unknowingly as I am :)) indicates mdadm >> in the kernel is issuing those reads. [ ... ] > It is not at all weird. The performance of MD ('mdadm' is just > the user level tool to configure it) is pretty good in this case > even if the speed is pretty low. MD is working as expected when > read-modify-write (or some kind of resync or degraded operation) > is occurring. BTW I like your use of the term "RAID set" because > that's what I use myself (because "RAID array" is redundant > :->). Yes, with read-modify-write this is to be expected. However, I'm copying a large file (largely sequentially thus especially since there was no other I/O), which is buffered in which case most of the writes should be entire stripes. I wouldn't have mentioned it if IETD didn't perform this much better. Since I'm not aware of kernel internals I figured it might have something to with the ways the buffers are committed. As in with IETD it recognizes it's a large contiguous part of data and consolidates it to a single stripe write and with LIO it writes 64k chunks synchronously causing it to write say chunk 1 out of 6, reads chunks 2-6, calculates parity and writes chunk 1 and the parity block and thus has a lot more operations to write the same stripe. It is also why for example vmware, which does nothing but sync I/O, has extremely lousy performance on non-caching RAID controllers. By far the easiest way to show people in my experience. Anyway, to me it looks like that's what going on - but I didn't want to jump to conclusions w/o in-depth kernel knowledge. As far as I know there aren't any separate buffers though, so LIO and IETD would be using the same buffers / infrastructure in which case my assumption will be embarrassingly wrong. Concerning the RAID set terminology, I didn't even realise that. > Apparently awareness of the effects of RMW )or resyncing or > degraded operation) is sort of (euphemism) unspecial RAID > knowledge, but only the very elite of sysadms seem to be aware > of it :-). A recent similar enquiry was the (euphemism) strange > concern about dire speed by someone who had (euphemism) bravely > setup RAID6 running deliberately in degraded mode. Even less seem to be aware of the write-hole :D > My usual refrain is: if you don't know better, never use parity > RAID, only use RAID1 or RAID10 (if you want redundancy). > > But while the performance of MD you report is good, the speed is > bad even for a mere RMW/resync/degraded issue, so this detail > matters: > >> Do note - I'm running somewhat unorthodox. I've created a >> RAID-5 of 7 disks + hotspare > One could (euphemism) wonder how well a 6x stripe/stripelet size > is going to play with 4KiB aligned NTFS operations... It's formatted with a 64kiB blocksize (or cluster size in NTFS terminology) . This is also the RAID's chunk size. >> (it was originally a RAID-6 w/o hotspare but converted it to >> RAID-5 in hopes of improving performance). > A rather (euphemism) audacious operation, especially because of > the expectation that reshaping a RAID set leaves the content in > an optimal stripe layout. I am guessing that you reshaped rather > than recreated because you did not want to dump/reload the > content, rather (euphemism) optimistically. Correct. Also, with exception of 3 machines it actually does back up itself, other jobs replicate to it. If it really would go wrong I'd have to drive by a customer or 6 to seed-load the other ~50 servers on it again. Given the risk (never had issues with reshaping) we took it. Loosing the data is thus not life threatening, just very very annoying / time consuming. So I was either guaranteed to loose a lot of time moving the data (and an investment on something to store it on), or take the risk of loosing a bit more time. It turned out well :). Did use a log file for the reshaping btw. So it could survive reboots (would have had to manually restart/bring it up). > There are likely to be other (euphemism) peculiarities in your > setup, probably to do with network flow control, but the above > seems enough... > > Sometimes it is difficult for me to find sufficiently mild yet > suggestive euphemisms to describe some of the stuff that gets > reported here. This is one of those cases. > > Unless you are absolutely sure you know better: > > * Never grow or reshape a RAID set or a filetree. > * Just use RAID1 or RAID10 (or a 3 member RAID5 in some cases > where writes are rare). > * Don't partition the member or array devices or use GPT for > both if you must. Not really an option to use msdos partitioning (way larger than 2TiB) :). Did look seriously at the offsets being a multiple of 64 so it doesn't start somewhere in the middle of a chunk. Parted also reports aligning is properly (I do seriously despise these kind of tools, that work with blocksizes specifically, using decimal instead of binary k/M/G's etc). > If you are absolutely sure you know better then you will not > need to ask for help here :-). Do note I'm specifically asking about the interaction with LIO :). Don't have the benchmarks any more, but several local tests (on the machine itself w/o the iSCSI layer in between) showed very acceptable numbers. I just need some descent performance (~50MB/s sequential and in the order of 10MB/s random - the local benches were way faster than that) and a lot of storage in this case as it's just to store back-ups (image based - very large files) and their incrementals. On a daily basis some incrementals are merged but a lot of the I/O should be sequential and large (often causing entire stripe writes when buffered/cached - or well they should and it seems to do this just fine with IET, but not with LIO). I'll never put anything (random) I/O intensive on anything but 10 :). Or there must be some really fancy new developments (haven't digged into ZFS and the likes deep enough yet). >> This disk is about 12TB. It's partitioned with GPT in ~9TB > At least you used GPT partitioning, which is commendable, even > if you regret it below... Yea, not that I had a choice :P. >> and ~2.5TB (there's huge rounding differences at these sizes >> 1000 vs 1024et al :)). > It is very nearly 5%/7% depending which way. > >> With msdos partitions I could easily mess with it myself. [ >> ... ] > MSDOS style labels are fraught with subtle problem that require > careful handling. But they're very easy to backup/restore with dd :). Thanks for the re' :). -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html