Hi Neil,
Thanks for your analysis and suggestions.
I will try to address them.
We are testing current implementation and it is working (at least for
tested scenarios).
Thanks,
Mariusz
On 29.09.2021 01:56, NeilBrown wrote:
On Thu, 27 Oct 2016, Tomasz Majchrzak wrote:
If an update of acknowledged bad blocks file is notified, read entire
bad block list from sysfs file and compare it against local list of bad
blocks. If any obsolete entries are found, remove them from metadata.
As mdmon cannot perform any memory allocation, new superswitch method
get_bad_blocks is expected to return a list of bad blocks in metadata
without allocating memory. It's up to metadata handler to allocate all
required memory in advance.
hi,
only 5 years late to this party :-)
I recently had cause the look at this code and ... there are problems.
Primarily, it assumes that the "bad_blocks" file contains a complete
list of bad blocks known to the kernel. This is not correct.
As the documentation and nearby comments say, the contents of this file
is truncated to PAGE_SIZE. It is not meant to be a complete list, only
an indicative list.
There is no way to get a complete list from the kernel once the list
gets too large. Probably we should design and implement a reliable way
to extract this information. I imagine it would be something like
unacknowledged_bad_blocks, in that mdmon could read some information,
then confirm that it has been read, then read some more. But until
that it done, this code should be careful not to assume that the list
is complete - at least not without checking.
Secondly, the interface with the metadata handler is a bit odd.
The 'check_for_cleared_bb' essentially does:
- call ->record_bad_block for all blocks known to the kernel
- call ->clear_bad_block for all blocks that were in the metadata
in that order.
It isn't quite that simple as there are optimisations:
if a range from kernel exactly matches a range in metadata, the
range is neither recorded or cleared
If a range from the kernel is a subrange of a range in metadata,
then the larger range is cleared before the new range is added,
AS WELL AS after.
If there are other overlaps, then the kernel range is recorded
before the metadata range is cleared. This *seems* wrong. I would
expect this to clear part of the range that had just been added.
However, it doesn't. The ->clear_bad_block interface *DOESN'T* remove
all the block in the range from the bbl. Rather, if the exact range
given appears as one of the ranges in the bbl, then that range is
deleted. Otherwise no change happens.
These semantics are surprising. The net result is that the code
probably works with the imsm backend. However if someone else wrote a
different backend which implemented ->clear_bad_block to actually
remove the entire range from the bbl, then it would clear more blocks
than it should.
I think it would be really good to re-implement this code in a way
that was more maintainable.
I don't think "check_for_cleared_bb()" should *ever* record new bad
block ranges. They get recorded through the unacknowledged_bad_block
processing. "check_for_cleared_bb()" should ONLY delete blocks from the
bbl, and it should ONLY do that if it certain that the information in
"bad_blocks" is complete.
It should read bad_blocks in a single read(). If the returned data
ends with a newline, and is not a power-of-2 in size, then it is
safe to assume that it is complete.
If it doesn't end with a newline, then it is definitely not complete.
If it is a power-of-2 less than 4096, then it can be assumed to be
complete. If it is exactly 4096 bytes, or a larger power of two, then
it is not safe to assume that it is complete.
Thanks,
NeilBrown