Re: Questions about a possible Ceph setup

Brian Jackson <iggy@xxxxxxxxxxx> · Thu, 20 May 2010 12:03:23 -0500

On Thursday 20 May 2010 11:59:08 Fred Ar wrote:
> --- On Thu, 5/20/10, Wido den Hollander <wido@xxxxxxxxxxxx> wrote:
> > Hi,
> > 
<snip>
> > 
> > Am i right? Is this "the way to go"?
> 
> I don't know the way to go.
> But I think that in the 1st case (1 OSD per hard disk) when a hard disk
> fails, it gets replicated elsewhere. During that time the other 3 OSDs on
> the same machine are still working fine and serving requests. And then
> some time later, you've got a brand new disk, you shutdown the machnie,
> that's 3 more OSDs down. In the 2nd case, as soon as 1 disk starts
> failing, your OSD (which is 4 disks) gets taken down, that's approximately
> equivalent to 4 OSDs going down at the same time if we compare to your 1st
> case.
> 
> So in both cases you have to shutdown 1 machine, but in the 1st case your
> cluster gets replicated in 2 stages, first the failing OSD, then the 3
> other (when you change the disk). And before the 2nd stage the 3 disks
> that stayed alive still work... If the network is a bottleneck, the 1st
> case might be better, because less data gets replicated at the same time.

You could even plan it out and decomm each remaining osd one at a time at off-
peak hours to minimize disruption and the possibility of stressing the system 
so much that something else fails.

> 
> About your 2nd case: as cheap as the hardware may be, having 3 perfectly
> operational disks not working has a cost...
> 
> What I don't know, in both cases, is: when the machines gets back online,
> will it hold the same data as before being shutdown, or will they get
> entirely new data? I can remember that crush was quite stable and designed
> to avoid that kind of full cluster rebalance on failing/new OSD...
> 
> I don't understand how - in the 2nd case - the btrfs pool of 4 disks would
> "repair" its missing data, so that the data on the 3 good disks does not
> need to get replicated over the network.
> 
> > Then there is the journaling topic.
> > 
> > When creating a filesystem you get a big warning if the
> > drive cache is
> > enabled on the journaling partition. Imho you don't want to
> > have a drive
> > cache on your journal, but you do want to have one on your
> > data
> > partition.
> > 
> > This forces you to use a seperate disk for your journaling.
> > Assume that
> > i would have 4 disks in a btrfs stripe, would a fifth disk
> > for
> > journaling only be sufficient? I assume so, since it only
> > has to hold
> > data for a few seconds.
> 
> Let me just copy/paste a question asked yesterday on irc, and Sage's
> answer:
> 
> me> sagewk, what is the best: - a jounal on a partition, same disk as osd
> data and disk write caching off, or - journal on a filesystem, same disk
> as osd data, write caching on? sagewk> partition with write cache off, i
> suspect.
> sagewk> hopefully someday we'll be able to flush the disk cache from
> userspace and that annoyance will go away me> so you don't expect a
> performance penalty running btrfs on a disk with caching deacivated
> sagewk> not really.  the writer threads should keep the disk busy, and the
> commit sequence has barriers that flushes the cache anyway.
> 
> My question was not exactly the same as yours, but I think the answer Sage
> gave is also valid in your case.
> 
> Fred
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html