Re: RAID5 to RAID6 reshape?

pg_lxra@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Fri, 22 Feb 2008 08:13:05 +0000

[ ... ]

>> * Suppose you have a 2+1 array which is full. Now you add a
>> disk and that means that almost all free space is on a single
>> disk. The MD subsystem has two options as to where to add
>> that lump of space, consider why neither is very pleasant.

> No, only one, at the end of the md device and the "free space"
> will be evenly distributed among the drives.

Not necessarily, however let's assume that happens.

Since the the free space will have a different distribution
then the used space will also, so that the physical layout will
evolve like this when creating up a 3+1 from a 2+1+1:

   2+1+1       3+1
  a b c d    a b c d
  -------    -------   
  0 1 P	F    0 1 2 Q	P: old parity
  P 2 3 F    Q 3 4 5    F: free block
  4 P 5 F    6 Q 7 8    Q: new parity
  .......    .......
	     F F F F

How will the free space become evenly distributed among the
drives? Well, sounds like 3 drives will be read (2 if not
checking parity) and 4 drives written; while on a 3+1 a mere
parity rebuild only writes to 1 at a time, even if reads from
3, and a recovery reads from 3 and writes to 2 drives.

Is that a pleasant option? To me it looks like begging for
trouble. For one thing the highest likelyhood of failure is
when a lot of disk start running together doing much the same
things. RAID is based on the idea of uncorrelated failures...

  An aside: in my innocence I realized only recently that online
  redundancy and uncorrelated failures are somewhat contradictory.

Never mind that since one is changing the layout an interruption
in the process may leave the array unusable, even if with no
loss of data, evne if recent MD versions mostly cope; from a
recent 'man' page for 'mdadm':

 «Increasing the number of active devices in a RAID5 is much
  more effort.  Every block in the array will need to be read
  and written back to a new location.»

  From 2.6.17, the Linux Kernel is able to do this safely,
  including restart and interrupted "reshape".

  When relocating the first few stripes on a raid5, it is not
  possible to keep the data on disk completely consistent and
  crash-proof. To provide the required safety, mdadm disables
  writes to the array while this "critical section" is reshaped,
  and takes a backup of the data that is in that section.

  This backup is normally stored in any spare devices that the
  array has, however it can also be stored in a separate file
  specified with the --backup-file option.»

Since the reshape reads N *and then writes* to N+1 the drives at
almost the same time things are going to be a bit slower than a
mere rebuild or recover: each stripe will be read from the N
existing drives and then written back to N+1 *while the next
stripe is being read from N* (or not...).

>> * How fast is doing unaligned writes with a 13+1 or a 12+2
>> stripe? How often is that going to happen, especially on an
>> array that started as a 2+1?

> They are all the same speed with raid5 no matter what you
> started with.

But I asked two questions questions that are not "how does the
speed differ". The two answers to the questions I aked are very
different from "the same speed" (they are "very slow" and
"rather often"):

* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
  slow because of the RMW cycle. This is of course independent
  of how one got to the something like 13+1 or a 12+2.

* Unfortunately the frequency of unaligned writes *does* usually
  depend on how dementedly one got to the 13+1 or 12+2 case:
  because a filesystem that lays out files so that misalignment
  is minimised with a 2+1 stripe just about guarantees that when
  one switches to a 3+1 stripe all previously written data is
  misaligned, and so on -- and never mind that every time one
  adds a disk a reshape is done that shuffles stuff around.

There is a saving grace as to the latter point: many programs
don't overwrite files in place but truncate and recreate them
(which is not so good but for this case).

> You read two blocks and you write two blocks. (not even chunks
> mind you)

But we are talking about a *reshape* here and to a RAID5. If you
add a drive to a RAID5 and redistribute in the obvious way then
existing stripes have to be rewritten as the periodicity of the
parity changes from every N to every N+1.

>> * How long does it take to rebuild parity with a 13+1 array
>> or a 12+2 array in case of single disk failure? What happens
>> if a disk fails during rebuild?

> Depends on how much data the controllers can push. But at
> least with my hpt2320 the limiting factor is the disk speed

But here we are on the Linux RAID mailing list and we are
talking about software RAID. With software RAID a reshape with
14 disks needs to shuffle around the *host bus* (not merely the
host adapter as with hw RAID) almost 5 times as much data as
with 3 (say 14x80MB/s ~= 1GB/s sustained in both directions at
the outer tracks). The host adapter also has to be able to run
14 operations in parallel.

It can be done -- it is just somewhat expensive, but then what's
the point of a 14 wide RAID if the host bus and host adapter
cannot handle the full parallel bandwidth of 14 drives?

Yet in some cases RAID sets are built for capacity more than
speed, and with cheap hw it may not be possible to read or write
in parallel 14 drives, but something like 3-4. Then look at the
alternatives:

* Grow from a 2+1 to a 13+1 a drive at a time: every time the
  whole array is both read and written, and if the host cannot
  handle more than say 4 drives at once, the array will be
  reshaping for 3-4 times longer towards the end than at the
  beginning (something like 8 hours instead of 2).

* Grow from 2+1 by adding say another 2+1 and two 3+1s: every
  time that involves just a few drives, existing drives are not
  touched, and a drive failure during building a new array is
  not an issue because if the build fails there is no data on
  the failed array, indeed the previously built arraya just
  continue to work.

At this some very clever readers will shake their head, count
the 1 drive wasted for resiliency in one case, and 4 in the
other and realize smugly how much more cost effective their
scheme is. Good luck! :-)

> and that doesn't change whether I have 2 disks or 12.

Not quite, but another thing that changes is the probability of
a disk failure during a reshape.

Neil Brown wrote recently in this list (Feb 17th) this very wise
bit of advice:

 «It is really best to avoid degraded raid4/5/6 arrays when at all
  possible. NeilBrown»

Repeatedly expanding an array means deliberately doing something
similar...

One amusing detail is the number of companies advertising disk
recovery services for RAID sets. They have RAID5 to thank for a
lot of their business, but array reshapes may well help too :-).

[ ... ]

>> [ ... ] In your stated applications it is hard to see why
>> you'd want to split your arrays into very many block devices
>> or why you'd want to resize them.

> I think the idea is to be able to have more than just one
> device to put a filesystem on. For example a / filesystem,
> swap and maybe something like /storage comes to mind.

Well, for a small number of volumes like that a reasonable
strategy is to partition the disks and then RAID those
partitions. This can be done on a few disks at a time.

For archiving stuff as it accumulates (''digital attic'') just
adding disks and creating a large single partition on each disk
seems simplest and easiest.

Even RAID is not that useful there (because RAID, especially
parity RAID, is not a substitute for backups). But a few small
(2+1, 3+1, in s desperate case even 4+1) mostly read-only RAID5
may be reasonable for that (as long as there are backups
anyhow).

> Yes, one could to that with partitioning but lvm was made for
> this so why not use it.

The problem with LVM is that it adds an extra layer of
complications and dependencies to things like booting and system
management. Can be fully automated, but then the list of things
that go wrong increases.

BTW, good news: DM/LVM2 are largely no longer necessary: one can
achieve the same effect, including much the same performance, by
using the loop device on large files on a good filesystem that
supports extents, like JFS or XFS.

To the point that in a (slightly dubious) test some guy got
better performance out of Oracle tablespaces as large files
than with the usually recommended raw volumes/partitions...
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html