Re: [RFC PATCH] md: towards a replacement for the raid5 STRIPE_OP_* flags

Neil Brown <neilb@xxxxxxx> · Tue, 6 May 2008 14:10:28 +1000

On Monday May 5, dan.j.williams@xxxxxxxxx wrote:
> The STRIPE_OP_* flags record the state of stripe operations which are
> performed outside the stripe lock.  Their use in indicating which
> operations need to be run is straightforward; however, interpolating what
> the next state of the stripe should be based on a given combination of
> these flags is not straightforward, and has led to bugs.  An easier to read
> implementation with minimal degrees of freedom is needed.

Thanks a lot for looking at this.

Yes, fewer degrees of freedom does seem like a good direction.

I cannot quite decide if I like what you have done or not.  That fact
that you found a bug speaks highly in favour of the approach you
took.  However it doesn't seem to be removing any flag bits, which
bothers me.  I had hoped that at least one and possibly two of
ops.{pending,ack,complete} could go.

You have made the state-machine aspect of the code more explicit by
introducing labels for the high-level states.
When you have a multi-threads machine as we do, I think it is
sometimes easier to encode just the low-level state of the data (which
is harder to get wrong due to races) and each time make a decision
based on that.

This is what we were previously doing, but it seems to be just a
little to complex.  Maybe that is telling us the for sufficiently
complex machines, focussing on the high-level states is the best way
to solve the complexity... but I'm not 100% convinced yet and am keen
to explore.

One aspect of your states that seems a bit odd is that you have both
"idle" and "done" states.  As states, these seem to me to mean much
the same thing - nothing is happening.  You do some "finish up"
tasking in the "done" state.  Maybe these should be done on the
transition from "run" back to "idle".... but maybe locking
requirements make that awkward...

As I said, I would like to see only one bitfield in place of the
current pending,ack,complete.

The way I imagine this is as follows:
  When the machine enters some state (xxx_run) there are some number
  of tasks that need to be accomplished.
  We set "state = xxx_run" and ops.pending = set.of.bits.

  Then outside the lock we doing:
    if (test_and_clean(bit, &ops.pending))
	atomic_inc(&sh->count);
	start 'bit' operation
  for each bit.
  When the operation completes, sh->count is decremented.

  So when ops.pending is 0, and sh->count is zero, xxx_run must have
  completed.
  Possibly we inc sh->count when we set the bit in pending rather than
  when we clear it.  That makes it universal that if sh->count is 0,
  then it is time to assess the current state and move on.

I remains to be seen if the various states, particularly of parity
check and repair, can be adequately deduced from the general state of
the stripe, or whether we need an explicit state variable like you
introduced.

Would you be interested in trying to remove .ack and .complete as
described above and see where that leave us?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html