Re: [EXTERNAL] Re: 2x replication: A BIG warning

"Will.Boege" <Will.Boege@xxxxxxxxxx> · Wed, 7 Dec 2016 20:43:35 +0000

Thanks for the explanation.  I guess this case you outlined explains why the Ceph developers chose to make this a ‘safe’ default.

2 osds are transiently down and the third fails hard. The PGs on the 3rd osd with no more replicas are marked unfound.  You bring up 1 and 2 and these PGs will remain unfound because they were stale, at that point you can either revert or delete those PGs. Am I understanding that correctly?

I still think there is a cost/benefit conversation you can have around this setting.  A 2 OSD failure situation will be far far more probable than the ‘sequence of events’ type failure you outlined above.  There is a cost to several blocked IO events per year - availability, to protect from a data loss event that might be a once every three year type thing. 

I guess it’s just where you want to put that needle on the spectrum of availability vs integrity.

On 12/7/16, 2:10 PM, "Wido den Hollander" <wido@xxxxxxxx> wrote:

    > Op 7 december 2016 om 21:04 schreef "Will.Boege" <Will.Boege@xxxxxxxxxx>:
    > 
    > 
    > Hi Wido,
    > 
    > Just curious how blocking IO to the final replica provides protection from data loss?  I’ve never really understood why this is a Ceph best practice.  In my head all 3 replicas would be on devices that have roughly the same odds of physically failing or getting logically corrupted in any given minute.  Not sure how blocking IO prevents this.
    > 

    Say, disk #1 fails and you have #2 and #3 left. Now #2 fails leaving only #3 left.

    By block you know that #2 and #3 still have the same data. Although #2 failed it could be that it is the host which went down but the disk itself is just fine. Maybe the SATA cable broke, you never know.

    If disk #3 now fails you can still continue your operation if you bring #2 back. It has the same data on disk as #3 had before it failed. Since you didn't allow for any I/O on #3 when #2 went down earlier.

    If you would have accepted writes on #3 while #1 and #2 were gone you have invalid/old data on #2 by the time it comes back.

    Writes were made on #3 but that one really broke down. You managed to get #2 back, but it doesn't have the changes which #3 had.

    The result is corrupted data.

    Does this make sense?

    Wido

    > On 12/7/16, 9:11 AM, "ceph-users on behalf of LOIC DEVULDER" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of loic.devulder@xxxxxxxx> wrote:
    > 
    >     > -----Message d'origine-----
    >     > De : Wido den Hollander [mailto:wido@xxxxxxxx]
    >     > Envoyé : mercredi 7 décembre 2016 16:01
    >     > À : ceph-users@xxxxxxxx; LOIC DEVULDER - U329683 <loic.devulder@xxxxxxxx>
    >     > Objet : RE:  2x replication: A BIG warning
    >     > 
    >     > 
    >     > > Op 7 december 2016 om 15:54 schreef LOIC DEVULDER
    >     > <loic.devulder@xxxxxxxx>:
    >     > >
    >     > >
    >     > > Hi Wido,
    >     > >
    >     > > > As a Ceph consultant I get numerous calls throughout the year to
    >     > > > help people with getting their broken Ceph clusters back online.
    >     > > >
    >     > > > The causes of downtime vary vastly, but one of the biggest causes is
    >     > > > that people use replication 2x. size = 2, min_size = 1.
    >     > >
    >     > > We are building a Ceph cluster for our OpenStack and for data integrity
    >     > reasons we have chosen to set size=3. But we want to continue to access
    >     > data if 2 of our 3 osd server are dead, so we decided to set min_size=1.
    >     > >
    >     > > Is it a (very) bad idea?
    >     > >
    >     > 
    >     > I would say so. Yes, downtime is annoying on your cloud, but data loss if
    >     > even worse, much more worse.
    >     > 
    >     > I would always run with min_size = 2 and manually switch to min_size = 1
    >     > if the situation really requires it at that moment.
    >     > 
    >     > Loosing two disks at the same time is something which doesn't happen that
    >     > much, but if it happens you don't want to modify any data on the only copy
    >     > which you still have left.
    >     > 
    >     > Setting min_size to 1 should be a manual action imho when size = 3 and you
    >     > loose two copies. In that case YOU decide at that moment if it is the
    >     > right course of action.
    >     > 
    >     > Wido
    >     
    >     Thanks for your quick response!
    >     
    >     That's make sense, I will try to convince my colleagues :-)
    >     
    >     Loic
    >     _______________________________________________
    >     ceph-users mailing list
    >     ceph-users@xxxxxxxxxxxxxx
    >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >     
    > 
    >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com