Re: [PATCH 4/4 v4] mdmon: bad block support for external metadata - clear bad blocks

"Tkaczyk, Mariusz" <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Wed, 29 Sep 2021 10:04:28 +0200

Hi Neil,
Thanks for your analysis and suggestions.
I will try to address them.

We are testing current implementation and it is working (at least for
tested scenarios).

Thanks,
Mariusz

On 29.09.2021 01:56, NeilBrown wrote:
On Thu, 27 Oct 2016, Tomasz Majchrzak wrote:
If an update of acknowledged bad blocks file is notified, read entire
bad block list from sysfs file and compare it against local list of bad
blocks. If any obsolete entries are found, remove them from metadata.

As mdmon cannot perform any memory allocation, new superswitch method
get_bad_blocks is expected to return a list of bad blocks in metadata
without allocating memory. It's up to metadata handler to allocate all
required memory in advance.

hi,
  only 5 years late to this party :-)

  I recently had cause the look at this code and ... there are problems.

  Primarily, it assumes that the "bad_blocks" file contains a complete
  list of bad blocks known to the kernel.  This is not correct.
  As the documentation and nearby comments say, the contents of this file
  is truncated to PAGE_SIZE.  It is not meant to be a complete list, only
  an indicative list.
  There is no way to get a complete list from the kernel once the list
  gets too large.  Probably we should design and implement a reliable way
  to extract this information.  I imagine it would be something like
  unacknowledged_bad_blocks, in that mdmon could read some information,
  then confirm that it has been read, then read some more.  But until
  that it done, this code should be careful not to assume that the list
  is complete - at least not without checking.

  Secondly, the interface with the metadata handler is a bit odd.
  The 'check_for_cleared_bb' essentially does:
    - call ->record_bad_block  for all blocks known to the kernel
    - call ->clear_bad_block   for all blocks that were in the metadata
   in that order.
   It isn't quite that simple as there are optimisations:
     if a range from kernel exactly matches a range in metadata, the
       range is neither recorded or cleared
     If a range from the kernel is a subrange of a range in metadata,
       then the larger range is cleared before the new range is added,
       AS WELL AS after.

   If there are other overlaps, then the kernel range is recorded
   before the metadata range is cleared.  This *seems* wrong.  I would
   expect this to clear part of the range that had just been added.

   However, it doesn't.  The ->clear_bad_block interface *DOESN'T* remove
   all the block in the range from the bbl.  Rather, if the exact range
   given appears as one of the ranges in the bbl, then that range is
   deleted.  Otherwise no change happens.
   These semantics are surprising.  The net result is that the code
   probably works with the imsm backend.  However if someone else wrote a
   different backend which implemented ->clear_bad_block to actually
   remove the entire range from the bbl, then it would clear more blocks
   than it should.

   I think it would be really good to re-implement this code in a way
   that was more maintainable.
   I don't think "check_for_cleared_bb()" should *ever* record new bad
   block ranges.  They get recorded through the unacknowledged_bad_block
   processing.  "check_for_cleared_bb()" should ONLY delete blocks from the
   bbl, and it should ONLY do that if it certain that the information in
   "bad_blocks" is complete.

   It should read bad_blocks in a single read().  If the returned data
   ends with a newline, and is not a power-of-2 in size, then it is
   safe to assume that it is complete.
   If it doesn't end with a newline, then it is definitely not complete.
   If it is a power-of-2 less than 4096, then it can be assumed to be
   complete.  If it is exactly 4096 bytes, or a larger power of two, then
   it is not safe to assume that it is complete.

Thanks,
NeilBrown