just some points that we shouldn´t forget... thinking like a end user of mdadm, not as a developer... a disk fail occur about 1 time after 2 years of heavy use in a desktop sata disk a complex structure just for 1 minute of mdadm --remove, mdadm --add should be accepted by end users... it´s just 1 minute of 2 years... 2 years=730 days=17520 hours=1051200 minutes, in other works 1 minute ~= 1/1.000.000=0.0001% of stop time, 99.9999% of online time, if we consider turn server off add a new disk and remove older, let we consider 10minutes? 0.001% = 99.999% of online time it´s well accepted for desktop and servers... for raid1 and linear- i don´t see a real complex logic telling what block isn´t ok, just a counter telling what disk have more recent data is wellcome for raid10, raid5 and raid6- ok we can allow a block specific ,since we could consider a bad disk like many bad blocks and many good blocks (in the good disk) 2011/12/15 NeilBrown <neilb@xxxxxxx>: > On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Neil, >> thanks for the review, and for detailed answers to my questions. >> >> > When we mark a device 'failed' it should stay marked as 'failed'. When the >> > array is optimal again it is safe to convert all 'failed' slots to >> > 'spare/missing' but not before. >> I did not understand all that reasoning. When you say "slot", you mean >> index in the dev_roles[] array, correct? If yes, I don't see what >> importance the index has, compared to the value of the entry itself >> (which is "role" in your terminology). >> Currently, 0xFFFE means both "failed" and "missing", and that makes >> perfect sense to me. Basically this means that this entry of >> dev_roles[] is unused. When a device fails, it is kicked out of the >> array, so its entry in dev_roles[] becomes available. >> (You once mentioned that for older arrays, their dev_roles[] index was >> also their role, perhaps you are concerned about those too). >> In any case, I will be watching for changes in this area, if you >> decide to make them (although I think this might break backwards >> compatibility, unless a new version of superblock will be used). > > Maybe... as I said, "confusing" is a relevant word in this area. > >> >> > If you have a working array and you initiate a write of a data block and the >> > parity block, and if one of those writes fails, then you no longer have a >> > working array. Some data blocks in that stripe cannot be recovered. >> > So we need to make sure that admin knows the array is dead and doesn't just >> > re-assemble and think everything is OK. >> I see your point. I don't know what's better: to know the "last known >> good" configuration, or to know that the array has failed. I guess, I >> am just used to the former. > > Possibly an 'array-has-failed' flag in the metadata would allow us to keep > the last known-good config. But as it isn't any good any more I don't really > see the point. > > >> >> > I think to resolve this issue we need 2 thing. >> > >> > 1/ when assembling an array if any device thinks that the 'chosen' device has >> > failed, then don't trust that devices. >> I think that if any device thinks that "chosen" has failed, then >> either it has a more recent superblock, and then this device should be >> "chosen" and not the other. Or, the "chosen" device's superblock is >> the one that counts, then it doesn't matter what current device >> thinks, because array will be assembled according to the "chosen" >> superblock. > > This is exactly what the current code does and it allows you to assemble an > array after a split-brain experience. This is bad. Checking what other > devices think of the chosen device lets you detect the effect of a > split-brain. > > >> >> > 2/ Don't erase 'failed' status from dev_roles[] until the array is >> > optimal. >> >> Neil, I think both these points don't resolve the following simple >> scenario: RAID1 with drive A and B. Drive A fails, array continues to >> operate on drive B. After reboot, only drive A is accessible. If we go >> ahead with assemble, we will see stale data. If after reboot, we, >> however, see only drive A, then (since B is "faulty" in A's >> superblock), we can go ahead and assemble. The change I suggested will >> abort in the first case, but will assemble in the second case. > > Using --no-degraded will do what you want in both cases. So no code change > is needed! > >> >> But obviously, you know better what MD users expect and want. > > Don't bet on it. > So far I have one vote - from you - that --no-degraded should be he default > (I think that is what you are saying). If others agree I'll certainly > consider it more. > > Note that "--no-degraded" doesn't exactly mean "not assemble a degraded > array". It means "don't assemble an array more degraded that it was last > time it was working". i.e. require that all devices that are working > according to the metadata are actually available. > > NeilBrown > > > >> Thanks again for taking time and reviewing the proposal! And yes, next >> time, I will put everything in the email. >> >> Alex. > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html