On Tuesday April 28, dan.j.williams@xxxxxxxxx wrote: > Neil Brown wrote: > > So I'm leaning towards: > > - a kernel change so that you can only set 'inactive' if the > > array is already clean or readonly. > > i.e. if (mddev->ro || mddev->in_sync) > > which possible is the same as simply > > if (mddev->in_sync) > > - an mdadm change so that is doesn't just set 'inactive' > > but rather > > sysfs_set_str(mdi, NULL, "array_state", "clean"); > > ping_monitor(mdi->text_version); > > sysfs_set_str(mdi, NULL, "array_state", "inactive") > > I recall that when implementing WaitClean there was a reason not to > write 'clean' from mdadm... /me looks in the logs. Yes, from commit > 0dd3ba30: > > "--wait-clean: shorten timeout > > Set the safemode timeout to a small value to get the array marked clean > as soon as possible. We don't write 'clean' directly as it may cause > mdmon to miss a 'write-pending' event." > > So we do not want a thread to hang because we missed its write-pending > event, or will it receive an error after the device has been torn down? I don't see that writing 'clean' can cause a 'write-pending' state to be missed. If the array is in 'write-pending', it is because some thread is in md_write_start waiting for MD_CHANGE_CLEAN to be cleared. So ->writes_pending is elevated. So an attempt to write 'clean' will result in -EBUSY. ??? > > > - make sure that if mdmon sees the array as 'clean', it will > > update it's metadata etc so that it feels no further urge > > to mark the array clean. > > > > Also, we need to fix the WARNING, which I think means moving > > list_for_each_entry(rdev, &mddev->disks, same_set) > > if (rdev->raid_disk >= 0) { > > char nm[20]; > > sprintf(nm, "rd%d", rdev->raid_disk); > > sysfs_remove_link(&mddev->kobj, nm); > > } > > It would also involve a flush_scheduled_work() somewhere to make sure > the md_redundancy_group has been deleted. But it occurs to me that > do_md_run() should still be failing in the case where userspace gets the > ordering of 'inactive' wrong... > Uhmm..... I think we should split the removal of md_redundancy_group out in to a separate delayed_work. But yes, we need to deal with that somehow. > > > > in do_md_stop up in to the > > case 0: /* disassemble */ > > case 2: /* stop */ > > > > section. > > > > My one concern about this conclusion is that it seems to sidestep the > > core problem rather than address it. > > The core problem is that setting the state to 'clean' means very > > different thing depending on which side you come from, so an app which > > writes 'clean' might get more than it bargained for. All I'm doing is > > making that confusion avoidable rather than impossible.... > > > > I guess I could invent a syntax: > > writing "a->b" to array_state sets the state to 'b' but only if it > > was already in 'a'. > > But that feels needlessly complex. > > > > So: do you agree with my leaning? or my concern? or both or neither? > > First let me say that I really appreciate when you do these > implementation contingency brain dumps, it really helps to get all the > cards on the table (something I would do well to emulate). > > I agree with your concern and I am currently more in line with the idea > that the confusion should be impossible rather than avoidable. I now > think it is a bug that the 'inactive' state has two meanings. A new > state like 'shutdown' or 'defunct' would imply that 'clear' is the only > valid next state, all other writes return an error. > > So, it is not quite as ugly as randomly preventing some > 'inactive->clean' transitions in that userspace can see why its attempt > to write 'clean' is returning -EINVAL. > > Thoughts? I don't like the idea of a trap-door where once you get to 'shutdown' the only way out is 'clear'. But I could possibly live with a state like 'shutdown' which can transition to either 'clear' or 'inactive', but not directly to 'clean'. I cannot think of an immediate use for the 'shutdown'->'inactive' transition, but I wouldn't want to exclude it. But I really think the 'problem' is more around 'clean' then it is around 'inactive'. I think I would like writing 'clean' never to start the array. If we want to start the array in the 'clean' state, then write something else. I think we can just write 'active'. That will actually leave the array in a 'clean' state I think. I'd need to check that... So now I think we just arrange that 'clean' returns -EINVAL if mddev->pers == NULL. We never currently write 'clean' to start an array. We either use RUN_ARRAY, or mdmon writes either read_auto or active. So that is safe. So we need to: - remove the rd%d links earlier - remove the redundancy group earlier - disable starting an array by writing 'clean'. I think I've convinced myself. Have I convinced you? NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html