Re: Multiple disks per server.

Cláudio Martins <ctpm@xxxxxxxxxx> · Thu, 6 May 2010 04:18:27 +0100

On Tue, 04 May 2010 14:18:25 +0200 Mickaël Canévet <canevet@xxxxxxx> wrote:
> Hi,
> 
> I'm testing ceph on 4 old servers.
> 
> As there is more then one disk per server available for data (2 with 6 
> disks and 2 with 10 disks for a total of 32 disks over 4 nodes), I was 
> wondering how to define OSDs.
> 
> I have choice between one OSD per disk (32 OSDs on the cluster) or one 
> OSD per server with one btrfs filesystem over all disks of the server (4 
> OSDs on the cluster). Which one is the best solution ?
> 
> In the first case, if I lose one disk, I lose only a small part of 
> available space. In the other case, if I lose one disk, I lose the whole 
> server (as btrfs filesystem is in stripping) much more space.

 Hi,

 I too am facing a similar dilemma:

 Scenario 1:
 I can set up an MD raid6 array for each OSD box and so can afford up
to 2 simultaneous disk failures without Ceph noticing anything wrong.
When the 3rd drive fails, a long time will be spent redistributing data
across the cluster (though much less time than a simple 25TB raid6
rebuild) . This setup should be quite simple, and a 16 disk raid6
should give generally nice performance, though. I probably would use
2-way data replication (on Ceph config.) for this case.

 Scenario 2:

 I can try to configure 1 OSD per disk. As soon as a drive fails, there
will be data redistribution across the remaining OSDs - but this should
be quite fast, as only the content of a single drive (or slightly more)
has to be redistributed across the cluster (worst case). In this case I
would use 3-way replication for added protection against simultaneous
double drive failures and to compensate for the OSDs not having a raid
array underneath them.

 I can see several potential advantages in "Scenario 2":

 * Greater simplicity and ease of administration, as there's no need to
worry about RAID arrays, their configuration and their possible bugs.
You have one less layer in the stack to worry about, and that has to be
good news.

 * You can replace failed drives with different drives without worrying
about wasted capacity because they are bigger (as you would on raid),
and you can even take advantage of older, smaller drives that would
otherwise go to the trash can. This will give greater liberty when
upgrading hardware, overall.

 * Degradation of available cluster capacity and bandwidth would be
much softer. In fact, assuming that you don't have many power supplies
or mainboards burning up, your cluster will maintain redundancy as
drives go failing. That is, as long as you have more drives than
(amount_of_data * replication_level) your cluster will probably be in a
good, fully redundant state. That should make for better sleep at night.

 * Workloads with small, spread writes should perform better. In a RAID
array those could cause entire stripes to be read, thus requiring data
chunks to be read from a lot of disks just to compute the redundancy
chunks. This one should be quite an advantage for big mail server
workloads, which is one of the workloads I'm interested in.

 * Large write performance should be no worse than with raid, since Ceph
also spreads chunks across OSDs.

 Having said that, there are some aspects about how Ceph would behave
in Scenario 2 that I still have to investigate:

 * If multiple OSDs per node is a well suported option. Do multiple OSDs
per node play well with each other and with a node's resources?

 * If there are issues with network ports/addresses when setting up more
than 1 OSD per node.

 * OSD behaviour when getting I/O errors from its drive -- this is
really the most complex and important one, and the one I wish I could
hear your opinions about:

  Usually, in a RAID array, when there is a fatal failure, the upper
layers will just get permanent I/O errors and you can assume that
storage area is dead and go on with life.
 However, this is frequently not true when you consider single drives
as in Scenario 2, at least for reads: the drive may return read errors
for a small region but still be quite ok for the remaining data.

 So, ideally, a Ceph OSD receiving a read error from the filesystem
would request a copy of the Object in question from another OSD and try
to rewrite it several times before giving up and declaring the drive
dead (1). This is actually what Linux MD does on recent kernels and I
know from my experience that it increases array survivability a lot.
 Background data scrubbing would help a lot with the above, and I guess
BTRFS checksuming will simplify things here.

 Sorry for the huge email, but I hope that what I wrote are valid
points to make Ceph more robust, and hope to know what you think about
them.

Notes:
  (1) Better yet, if the error repeats, it could leave the old backing
file alone, and try to alloc a new one for that object, thus avoiding
declaring the drive completely dead too early.

 Best regards and thanks

Cláudio

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html