On Thu, 27 Oct 2016, Tomasz Majchrzak wrote: > If an update of acknowledged bad blocks file is notified, read entire > bad block list from sysfs file and compare it against local list of bad > blocks. If any obsolete entries are found, remove them from metadata. > > As mdmon cannot perform any memory allocation, new superswitch method > get_bad_blocks is expected to return a list of bad blocks in metadata > without allocating memory. It's up to metadata handler to allocate all > required memory in advance. hi, only 5 years late to this party :-) I recently had cause the look at this code and ... there are problems. Primarily, it assumes that the "bad_blocks" file contains a complete list of bad blocks known to the kernel. This is not correct. As the documentation and nearby comments say, the contents of this file is truncated to PAGE_SIZE. It is not meant to be a complete list, only an indicative list. There is no way to get a complete list from the kernel once the list gets too large. Probably we should design and implement a reliable way to extract this information. I imagine it would be something like unacknowledged_bad_blocks, in that mdmon could read some information, then confirm that it has been read, then read some more. But until that it done, this code should be careful not to assume that the list is complete - at least not without checking. Secondly, the interface with the metadata handler is a bit odd. The 'check_for_cleared_bb' essentially does: - call ->record_bad_block for all blocks known to the kernel - call ->clear_bad_block for all blocks that were in the metadata in that order. It isn't quite that simple as there are optimisations: if a range from kernel exactly matches a range in metadata, the range is neither recorded or cleared If a range from the kernel is a subrange of a range in metadata, then the larger range is cleared before the new range is added, AS WELL AS after. If there are other overlaps, then the kernel range is recorded before the metadata range is cleared. This *seems* wrong. I would expect this to clear part of the range that had just been added. However, it doesn't. The ->clear_bad_block interface *DOESN'T* remove all the block in the range from the bbl. Rather, if the exact range given appears as one of the ranges in the bbl, then that range is deleted. Otherwise no change happens. These semantics are surprising. The net result is that the code probably works with the imsm backend. However if someone else wrote a different backend which implemented ->clear_bad_block to actually remove the entire range from the bbl, then it would clear more blocks than it should. I think it would be really good to re-implement this code in a way that was more maintainable. I don't think "check_for_cleared_bb()" should *ever* record new bad block ranges. They get recorded through the unacknowledged_bad_block processing. "check_for_cleared_bb()" should ONLY delete blocks from the bbl, and it should ONLY do that if it certain that the information in "bad_blocks" is complete. It should read bad_blocks in a single read(). If the returned data ends with a newline, and is not a power-of-2 in size, then it is safe to assume that it is complete. If it doesn't end with a newline, then it is definitely not complete. If it is a power-of-2 less than 4096, then it can be assumed to be complete. If it is exactly 4096 bytes, or a larger power of two, then it is not safe to assume that it is complete. Thanks, NeilBrown