Re: Questions about a possible Ceph setup

Wido den Hollander <wido@xxxxxxxxxxxx> · Thu, 20 May 2010 19:57:13 +0200

Hi,

On Thu, 2010-05-20 at 17:09 +0000, Sage Weil wrote:
> > > I'd prefer the situation where i'd stripe over all 4 disks, giving me 
> > > and extra pro. In this situation i could configure my node to panic 
> > > whenever a disk is starting to give errors, so my cluster can take 
> > > over immediately.
> > > 
> > > Am i right? Is this "the way to go"?
> > 
> > I don't know the way to go. But I think that in the 1st case (1 OSD per 
> > hard disk) when a hard disk fails, it gets replicated elsewhere. During 
> > that time the other 3 OSDs on the same machine are still working fine 
> > and serving requests. And then some time later, you've got a brand new 
> > disk, you shutdown the machnie, that's 3 more OSDs down. In the 2nd 
> > case, as soon as 1 disk starts failing, your OSD (which is 4 disks) gets 
> > taken down, that's approximately equivalent to 4 OSDs going down at the 
> > same time if we compare to your 1st case.
> 
> The other 3 osds don't have to rereplicate if you swap the failed disk 
> quickly, or otherwise inform the system that the failure is temporary.  By 
> default there is a 5 minute timeout.  That can be adjusted, or we can add 
> other administrative hooks to 'suspend' any declarations of permanent 
> failure for this sort of case.

Ok, so upping this timeout to something like 10 minutes would be
sufficient for swapping and OSD.

This is done via the mon_osd_down_out_interval paramater i assume (found
in config.cc)

> 
> > About your 2nd case: as cheap as the hardware may be, having 3 perfectly 
> > operational disks not working has a cost...
> > 
> > What I don't know, in both cases, is: when the machines gets back 
> > online, will it hold the same data as before being shutdown, or will 
> > they get entirely new data? I can remember that crush was quite stable 
> > and designed to avoid that kind of full cluster rebalance on failing/new 
> > OSD...
> 
> For independent disks, the the old data will still be there on the 3 that 
> weren't replaced.  If the 4 are striped without redundancy, all will be 
> lost due to the singel failure.  If they are striped with redudancy (btrfs 
> mirroring, DM/MD or hardware raid, etc.), all 4 disks' data will be there 
> (tho there will be some internal rebuild load in the raid).
> 
> > I don't understand how - in the 2nd case - the btrfs pool of 4 disks 
> > would "repair" its missing data, so that the data on the 3 good disks 
> > does not need to get replicated over the network.
> 
> Currently it won't, unless you enable data mirroring (off by default).  
> The problem is that costs you 2x the storage space, which combined with 
> ceph mirroring means you're buying 4x the disk.

My intentions where never to use replication in btrfs, i meant using
"RAID-0". So when one disk fails, you loose all your data in that
particular node.

> 
> I lean toward a cosd per disk in this situation.  The downside is there is 
> a bit more work in setting up the CRUSH hierarchy and rules properly.  
> User friendly tools to do this would certainly be nice.

Yes, the CRUSH map is somewhat more work and also the Ceph config needs
some work.

About more then one OSD on one machine, is there a way how you can bind
an OSD to a specific IP? Can't seem to find any configuration for this.

I assume you will need one IP per OSD on that machine?

And my journaling question, any views on that topic?

Thanks!

> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Met vriendelijke groet,

Wido den Hollander
Hoofd Systeembeheer / CSO
Telefoon Support Nederland: 0900 9633 (45 cpm)
Telefoon Support België: 0900 70312 (45 cpm)
Telefoon Direct: (+31) (0)20 50 60 104
Fax: +31 (0)20 50 60 111
E-mail: support@xxxxxxxxxxxx
Website: http://www.pcextreme.nl
Kennisbank: http://support.pcextreme.nl/
Netwerkstatus: http://nmc.pcextreme.nl

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html