On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote: > Hi pg, > With MD raid striping all the writes not only does it keep ALL disks > spinning to read/write the current content, it also leads to > catastrophic data loss in case the rebuild/disk failure exceeds the > number of parity disks. Hi Anshuman, yes but do you have hard evidence that this is a common RAID-6 problem? Considering that we have now bad block list, write intent bitmap and proactive replacement, it does not seem to me really the main issue, having a triple fail in RAID-6. Considering that there are available libraries for more that 2 parities, I think the multiple failure case is quite a rarity. Furthermore, I suspect there are other type of catastrophic situation (lighting, for example) that can destroy an array completely. > But more importantly, I find myself setting up multiple RAID levels > (at least RAID6 and now thinking of more) just to make sure that MD > raid will recover my data and not lose the whole cluster if an > additional disk fails above the number of parity!!! The biggest > advantage of the scheme that I have outlined is that with a single > check sum I am mostly assure of a failed disk restoration and worst > case only the media (movies/music) on the failing disk are lost not on > the whole cluster. Each disk will have its own filesystem? If this is not the case, you cannot say if a single disk failure will lose only some files. > Also in my experience about disks and usage, while what you are saying > was true a while ago when storage capacity had not hit multiple TBs. > Now if I am buying 3-4 TB disks they are likely to last a while > especially since the incremental % growth in sizes seem to be slowing > down. As wrote above, you can safely replace disks before they fail, without compromising the array. bye, pg > Regards, > Anshuman > > On 30 October 2014 22:55, Piergiorgio Sartor > <piergiorgio.sartor@xxxxxxxx> wrote: > > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: > >> What you are suggesting will work for delaying writing the checksum > >> (but still making 2 disks work non stop and lead to failure, cost > >> etc). > > > > Hi Anshuman, > > > > I'm a bit missing the point here. > > > > In my experience, with my storage systems, I change > > disks because they're too small, way long before they > > are too old (way long before they fail). > > That's why I end up with a collection of small HDDs. > > which, in turn, I recycled in some custom storage > > system (using disks of different size, like explained > > in one of the links posted before). > > > > Honestly, the only reason to spin down the disks, still > > in my experience, is for reducing power consumption. > > And this can be done with a RAID-6 without problems > > and in a extremely flexible way. > > > > So, the bottom line, still in my experience, is that > > this you're describing seems quite a nice situation. > > > > Or, I did not understood what you're proposing. > > > > Thanks, > > > > bye, > > > > pg > > > >> I am proposing N independent disks which are rarely accessed. When > >> parity has to be written to the remaining 1,2 ...X disks ...it is > >> batched up (bcache is feasible) and written out once in a while > >> depending on how much write is happening. N-1 disks stay spun down and > >> only X disks wake up periodically to get checksum written to (this > >> would be tweaked by the user based on how up to date he needs the > >> parity to be (tolerance of rebuilding parity in case of crash) and vs > >> disk access for each parity write) > >> > >> It can't be done using any RAID6 because RAID5/6 will stripe all the > >> data across the devices making any read access wake up all the > >> devices. Ditto for writing to parity on every write to a single disk. > >> > >> The architecture being proposed is a lazy write to manage parity for > >> individual disks which won't suffer from RAID catastrophic data loss > >> and concurrent disk. > >> > >> > >> > >> > >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@xxxxxxxxxxxxx> wrote: > >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote: > >> >> > >> >> Right on most counts but please see comments below. > >> >> > >> >> On 29 October 2014 14:35, NeilBrown <neilb@xxxxxxx> wrote: > >> >>> > >> >>> Just to be sure I understand, you would have N + X devices. Each of the > >> >>> N > >> >>> devices contains an independent filesystem and could be accessed directly > >> >>> if > >> >>> needed. Each of the X devices contains some codes so that if at most X > >> >>> devices in total died, you would still be able to recover all of the > >> >>> data. > >> >>> If more than X devices failed, you would still get complete data from the > >> >>> working devices. > >> >>> > >> >>> Every update would only write to the particular N device on which it is > >> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger > >> >>> than X for the spin-down to be really worth it. > >> >>> > >> >>> Am I right so far? > >> >> > >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 > >> >> devices to 1 data) so spin down is totally worth it for data > >> >> protection but more on that below. > >> >> > >> >>> For some reason the writes to X are delayed... I don't really understand > >> >>> that part. > >> >> > >> >> This delay is basically designed around archival devices which are > >> >> rarely read from and even more rarely written to. By delaying writes > >> >> on 2 criteria ( designated cache buffer filling up or preset time > >> >> duration from last write expiring) we can significantly reduce the > >> >> writes on the parity device. This assumes that we are ok to lose a > >> >> movie or two in case the parity disk is not totally up to date but are > >> >> more interested in device longevity. > >> >> > >> >>> Sounds like multi-parity RAID6 with no parity rotation and > >> >>> chunksize == devicesize > >> >> > >> >> RAID6 would present us with a joint device and currently only allows > >> >> writes to that directly, yes? Any writes will be striped. > >> > > >> > > >> > I am not totally sure I understand your design, but it seems to me that the > >> > following solution could work for you: > >> > > >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, > >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so > >> > expensive that you can't scrub) > >> > > >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those > >> > two will never spin-down) in writeback mode with writeback_running=off . > >> > This will prevent writes to backend and leave the backend array spun down. > >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on > >> > and writethrough: it will wake up the backend raid6 array and flush all > >> > dirty data. You can then then revert to writeback and writeback_running=off. > >> > After this you can spin-down the backend array again. > >> > > >> > You also get read caching for free, which helps the backend array to stay > >> > spun down as much as possible. > >> > > >> > Maybe you can modify bcache slightly so to implement an automatic switching > >> > between the modes as described above, instead of polling the state from > >> > outside. > >> > > >> > Would that work, or you are asking something different? > >> > > >> > EW > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> > the body of a message to majordomo@xxxxxxxxxxxxxxx > >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > piergiorgio -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html