Hello everyone, Stefan Bader and I are currently looking into the device mapper and its mirror target, with the intention to improve its response characteristic. (We already got some early feedback from Alasdair and others, so just assume that the bad ideas are from us and the good ones are from those helpful people :-) The current implementation of device mapper relies entirely on the underlying devices to report success or failure within a short time frame. With SAN devices however, there may be longer delays due to network congestion or error recovery within the storage device or the network. For the mirror target this means that a single delayed physical device will cause a delay on the virtual device. Applications on the other hand expect certain response times before taking recovery actions as well. We want to improve the response time of a virtual device by allowing the mirror volume to become out of sync to a certain degree if one of the underlying devices doesn't respond within a user defined time frame but other devices in the mirror group do. Devices that do not respond will become 'degraded' and will not be used for further reading or writing until all of the outstanding requests have been completed. Of course there are several issues that have to be solved: First issue: Risk of data corruption. Bios that are still running on some of the devices but are returned as finished to the upper layer may cause data inconsistency. The memory referred to by the bios may be changed before the device is actually performing any read or write action. So the device will update memory or disk in an inconsistent way. We see two ways how to deal with this problem: 1. Implement a way to control a request still running on a device, so that it can be stopped before it is finished. A possible implementation would be to introduce some new functionality in the block layer that allows to cancel a submitted request. If one of the low level devices doesn't return in the requested time frame, then the device mapper could actively cancel the request on that device. The low level driver of that device must make sure that a canceled request will not be executed and thus will not change disk or memory in any unexpected way. 2. Make sure that the memory used for any request running on the device is not changed. This can be achieved by making a deep copy of every bio and sub those clones to the devices instead of the original bios. This could be implemented within device mapper or the mirror target itself. Instead of just creating a set of bios that all refer to the original memory pages, the memory pages would be copied as well. When one of the low level devices does not return within the requested time frame, that device can be simply ignored. Device mapper can return the original bio without risk of data inconsistency. Once this happens to a device we should stop cloning bios for that device until we know it is fully operational again and the devices were resynchronized again. Some further considerations: - Cloning is expensive and we probably want a cancellation method in the long term. The cancellation approach though will need a lot of discussion and it may take some time before the needed infrastructure is there. We should design the interfaces in a way that allows us to do either cloning or canceling. - Cloning or rather caching of data is something that is needed in other places as well. The raid4 and raid5 targets already implement a data cache. This code might be moved out of those targets for generic use. Second issue: Where should the timeout be handled? We want to improve the mirror target, but other targets might profit from a timeout interface as well. For example, on a multipath target a bio is send via one path to a target, if that path fails, the bio must be resent via a different path. But if a path does not react at all, the bio can't be resent. In this case a timeout could help to break up such a situation. So the timeout interface should be implemented in a generic way, that makes its use possible for all device mapper targets. Third issue: Logging and error recovery. With our approach disks will go out of sync not just because of error situations but also because of bad performance, which is much more likely to happen. So the mechanisms for resynchronization must be able to deal with this. We also need to think about error recovery. What will happen if we reboot a system and the master disk of a mirror fails and only an out of sync copy comes up? What will happen if our log device fails? As long as we only have one log we have a single point of failure, but if we had two logs how would we recognize an up to date log in case of an error recovery? Well, these are our ideas and intentions so far. Do you think this makes sense at all? Any issues that we overlooked? We appreciate any kind of feedback. :-) As the next step we will think about the new interfaces that would be needed and how they could look like. Best Regards / Mit freundlichen Grüßen Stefan Weinhuber ------------------------------------------------------------------- IBM Deutschland Entwicklung GmbH Linux for zSeries Development & Services -- dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel