Re: [PATCH 00/24] Clustered MD RAID1

NeilBrown <neilb@xxxxxxx> · Wed, 11 Feb 2015 15:17:09 +1100

On Tue, 10 Feb 2015 11:00:31 -0600 Goldwyn Rodrigues <rgoldwyn@xxxxxxx> wrote:

> Hi Neil,
> 
> >
> >
> > hi Goldwyn,
> >   thanks for these - and sorry for the long delay.   Lots of leave over
> >   southern summer, and the lots of email etc to deal with.
> >
> > This patch set is very close and I am tempted to just apply it and then
> > fix things up with subsequent patches.  In order to allow that, could you
> > please:
> >    - rebase against current upstream
> >    - fix the checkpatch.pl errors and warnings.
> >      The "WARNING: line over 80 characters" are often a judgement call
> >      so I'm not particularly worried about those.  Most, if not all, of
> >      the others should be followed just to have consistent layout.
> 
> Done.

ERROR: code indent should use tabs where possible

  when you use spaces, they show up in red for me!!
Ditto for
WARNING: please, no space before tabs

WARNING: quoted string split across lines
   It really is best to fix those, even though it makes the line long.
   When grepping to find out where a message comes from, it is very annoying
   if the grep fails because the line was split.

WARNING: Missing a blank line after declarations
   Worth fixing I think.

WARNING: printk() should include KERN_ facility level
   Definitely should be fixed, maybe make it pr_warn()??

> 
> >
> > Then I'll queue them up for 3.21, providing I don't find anything that would
> > hurt non-cluster usage ....
> >   On that topic: why initialise rv to -EINVAL in "metadata_update sends
> >   message...".  That looks wrong.
> 
> Yes, this is fixed.
> 
> >
> > I noticed that a number of times a patch will revert something that a
> > previous patch added.  It would be much nicer to fold these changes back into
> > the original patch.  Often this is just extra blank lines, but occasionally
> > variable names are changed (md -> mddev).  It should be given the final name
> > when introduced.  Every chunk in every patch should be directly relevant to
> > that patch.
> 
> I have cross-checked this and I did not find anything with respect to 
> variable names. I did some cleanup with respect to the code though.
> 
> There is one instance where I have used a variable: cluster_setup_done 
> and then removed it. I think this is required to understand the patch 
> and a smooth transition to subsequent patches. However, if you want me 
> to aggressively remove that part, I should be able to do that.

No, "cluster_setup_done" makes sense. It is scaffolding that you later need
to remove.

I'm probably letting me OCD tendencies get carried away, but these some of
the things that I noticed:

 "Introduce md_cluster_info" moves 'bast' to a new location in
 dlm_lock_resource for no apparent reason.
 Also 'leave()' has a parameter which is changed from 'md' to 'mddev',
 as does 'join'.

 "Add node recovery callbacks" adds a comment to the 'nodes' field of 'struct
 mddev'.  Why not add the comment when the field is added?
 Oh, and it mis-spells "unmber".

 In "Gather on-going resync information of other nodes" you have:

 static struct md_cluster_operations cluster_ops = {
        .join   = join,
        .leave  = leave,
-       .slot_number = slot_number
+       .slot_number = slot_number,
+       .resync_info_update = resync_info_update
 };

It is really best to put a comma at the end of each entry, even the last.
Then the patch would have been:

 static struct md_cluster_operations cluster_ops = {
        .join   = join,
        .leave  = leave,
        .slot_number = slot_number,
+       .resync_info_update = resync_info_update,
 };

which is much nicer to read.  You finally get this right in
  "Suspend writes in RAID1 if within range" :-)

> 
> >
> > Some other issues, that could possibly be fixed up afterwards:
> >
> >   - Is a clustername 64 bytes or 63 bytes?  I would have thought 64,
> >     but the use of strlcpy make is 63 plus a nul.  Is that really what is
> >     wanted?
> 
> Yes, it is 64 bytes. I haven't fixed this as yet.
> 
> >
> >   - Based on https://lkml.org/lkml/2012/10/23/580 it might be good to add
> >     "default n" to Kconfig, and possible add a WARN() if anyone tries to use
> >     the code.
> 
> Done. Added pr_warn while loading the module.
> 
> >
> >   - I'm a bit concerned about the behaviour on node failure.
> >     When a node fails, two things must happen w.r.t the bits in that node's
> >     bitmap.
> >     1/ The corresponding regions of the array need to be resynced.  You do have
> >        code to do this.
> >     2/ Other nodes must avoid read-balancing on those regions until the
> >        resync has completed.
> >
> >     You do have code for this second bit, but it looks wrong.  It avoids
> >     read-balancing if ->area_resyncing().  That isn't sufficient.
> >     The "area_resyncing" is always (I assume) a relatively small region of
> >     the array which will be completely resynced quite quickly.  It must be
> >     because writes are blocked to this area.  However the region in which
> >     we must disable re-balancing can be much larger.  It covers *all* bits
> >     that are set in any unsynced bitmap.  So it isn't just the area that is
> >     currently being synced, but all areas that will be synced.
> 
> What are unsynced bitmaps? Are they bitmaps which are associated with an 
> active node or dirty bitmaps with dead nodes? If it is the former, I 
> agree this is not enough. If it is latter, all nodes maintain a linked 
> list of all the nodes which are currently performing resync (probably 
> because of multiple nodes died simultaneously). One node performs the 
> recovery (aka bitmap resync) of exactly one "dead" node at a time. 
> area_resyncing goes through all the nodes which are performing resync.

The later - bitmaps associated with a dead node.
Bitmaps associated with an active node contain transient information, and the
filesystem will ensure that it never reads from somewhere that someone else
might be writing (or if it does, it will know that the data cannot be
trusted).

I looked at the code again, and discovered that I had the problem backwards.
But there is still a problem.

when any node is resyncing, your code blocks writes for the entire span of the
array from where-ever the resync is up to, to the end of the device.
So a write to a location near the end of the  device will hang until all
resyncs finish.  This could be a much longer time than you would like writes
to hang for.

I think that the resyncing host should only report that it is resyncing a
relatively small range of the array, maybe 100Meg.  Maybe 1G.
Then that would only block access to that small part of the array, which
should clear in just a few seconds at most.

This information on the range being synced is not enough to limit
read-balancing.
I imagined that *every* node would read the bitmap for a failed node, and
would use that information to limit read-balancing.  There are some
complexities in this though.

So the current code isn't "wrong" exactly, but it think it could cause sever
delays in some (unusual) circumstances.

> 
> >
> >   - I think md_reload_sb() might be too simple.  It probably should check that
> >     nothing serious has changed.  The "mddev->raid_disks = 0" look suspicious.
> >     I'll have to think about this a bit more.
> 
> Yes, I get that feeling as well. However, I am not sure how to perform 
> an exact comparison to understand what has changed. Perhaps it needs a 
> new flag?

Probably.  I haven't thought much about it.

> 
> >
> > That's all I can see for now.  I'll have another look once I have it all in my tree.
> >
> 
> I have put all the changes in my git:
> https://github.com/goldwynr/linux
> The branch cluster-md is against the latest upstream. I also performed a 
> small sanity test to check everything is working properly.
> 
> Let me know if you would want me to repost the entire patchset to the 
> mailing list.
> 
> 

I don't think there is any need for that.
I won't pull it in just yet - to give you a chance to resolve the last of the
checkpatch problems.
Then I'll double check that there is no risk to non-cluster users and try to
get it into -next after 3.20-rc2 is out.

Thanks,
NeilBrown
Attachment:
pgpOv34ofG1nN.pgp

Description: OpenPGP digital signature