Seems like that would be helpful. I'm not really familiar with ceph-disk though. -Sam On Wed, Nov 23, 2016 at 5:24 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Hi Sam, > > Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} variable be a good idea? It good either strip it out or > fail to start the OSD unless an override flag is specified somewhere. > > Looking at ceph-disk code, I would imagine around here would be the right place to put the check > https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642 > > I don't mind trying to get this done if its felt to be worthwhile. > > Nick > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Samuel Just >> Sent: 19 November 2016 00:31 >> To: Nick Fisk <nick@xxxxxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: how possible is that ceph cluster crash >> >> Many reasons: >> >> 1) You will eventually get a DC wide power event anyway at which point probably most of the OSDs will have hopelessly corrupted >> internal xfs structures (yes, I have seen this happen to a poor soul with a DC with redundant power). >> 2) Even in the case of a single rack/node power failure, the biggest danger isn't that the OSDs don't start. It's that they *do > start*, but >> forgot or arbitrarily corrupted a random subset of transactions they told other osds and clients that they committed. The exact > impact >> would be random, but for sure, any guarantees Ceph normally provides would be out the window. RBD devices could have random >> byte ranges zapped back in time (not great if they're the offsets assigned to your database or fs journal...) for instance. >> 3) Deliberately powercycling a node counts as a power failure if you don't stop services and sync etc first. >> >> In other words, don't mess with the definition of "committing a transaction" if you value your data. >> -Sam "just say no" Just >> >> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> > Yes, because these things happen >> > >> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter >> > ruption/ >> > >> > We had customers who had kit in this DC. >> > >> > To use your analogy, it's like crossing the road at traffic lights but >> > not checking cars have stopped. You might be OK 99%of the time, but >> > sooner or later it will bite you in the arse and it won't be pretty. >> > >> > ________________________________ >> > From: "Brian ::" <bc@xxxxxxxx> >> > Sent: 18 Nov 2016 11:52 p.m. >> > To: sjust@xxxxxxxxxx >> > Cc: Craig Chi; ceph-users@xxxxxxxxxxxxxx; Nick Fisk >> > Subject: Re: how possible is that ceph cluster crash >> > >> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com) >> >> X-Assp-Spam-Level: ***** >> >> X-Assp-Envelope-From: bc@xxxxxxxx >> >> X-Assp-Intended-For: nick@xxxxxxxxxx >> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296) >> >> X-Assp-Version: 1.9.1.4(1.0.00) >> >> >> >> >> >> This is like your mother telling not to cross the road when you were >> >> 4 years of age but not telling you it was because you could be >> >> flattened by a car :) >> >> >> >> Can you expand on your answer? If you are in a DC with AB power, >> >> redundant UPS, dual feed from the electric company, onsite >> >> generators, dual PSU servers, is it still a bad idea? >> >> >> >> >> >> >> >> >> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >>> >> >>> Never *ever* use nobarrier with ceph under *any* circumstances. I >> >>> cannot stress this enough. >> >>> -Sam >> >>> >> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craigchi@xxxxxxxxxxxx> >> >>> wrote: >> >>>> >> >>>> Hi Nick and other Cephers, >> >>>> >> >>>> Thanks for your reply. >> >>>> >> >>>>> 2) Config Errors >> >>>>> This can be an easy one to say you are safe from. But I would say >> >>>>> most outages and data loss incidents I have seen on the mailing >> >>>>> lists have been due to poor hardware choice or configuring options >> >>>>> such as size=2, min_size=1 or enabling stuff like nobarriers. >> >>>> >> >>>> >> >>>> I am wondering the pros and cons of the nobarrier option used by Ceph. >> >>>> >> >>>> It is well known that nobarrier is dangerous when power outage >> >>>> happens, but if we already have replicas in different racks or >> >>>> PDUs, will Ceph reduce the risk of data lost with this option? >> >>>> >> >>>> I have seen many performance tuning articles providing nobarrier >> >>>> option in xfs, but there are not many of then mention the trade-off >> >>>> of nobarrier. >> >>>> >> >>>> Is it really unacceptable to use nobarrier in production >> >>>> environment? I will be much grateful if you guys are willing to >> >>>> share any experiences about nobarrier and xfs. >> >>>> >> >>>> Sincerely, >> >>>> Craig Chi (Product Developer) >> >>>> Synology Inc. Taipei, Taiwan. Ext. 361 >> >>>> >> >>>> On 2016-11-17 05:04, Nick Fisk <nick@xxxxxxxxxx> wrote: >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >> >>>>> Behalf Of Pedro Benites >> >>>>> Sent: 16 November 2016 17:51 >> >>>>> To: ceph-users@xxxxxxxxxxxxxx >> >>>>> Subject: how possible is that ceph cluster crash >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>> I have a ceph cluster with 50 TB, with 15 osds, it is working fine >> >>>>> for one year and I would like to grow it and migrate all my old >> >>>> >> >>>> storage, >> >>>>> >> >>>>> about 100 TB to ceph, but I have a doubt. How possible is that the >> >>>>> cluster fail and everything went very bad? >> >>>> >> >>>> >> >>>> Everything is possible, I think there are 3 main risks >> >>>> >> >>>> 1) Hardware failure >> >>>> I would say Ceph is probably one of the safest options in regards to >> >>>> hardware failures, certainly if you start using 4TB+ disks. >> >>>> >> >>>> 2) Config Errors >> >>>> This can be an easy one to say you are safe from. But I would say most >> >>>> outages and data loss incidents I have seen on the mailing >> >>>> lists have been due to poor hardware choice or configuring options such >> >>>> as >> >>>> size=2, min_size=1 or enabling stuff like nobarriers. >> >>>> >> >>>> 3) Ceph Bugs >> >>>> Probably the rarest, but potentially the most scary as you have less >> >>>> control. They do happen and it's something to be aware of >> >>>> >> >>>> How reliable is ceph? >> >>>>> >> >>>>> What is the risk about lose my data.? is necessary backup my data? >> >>>> >> >>>> >> >>>> Yes, always backup your data, no matter solution you use. Just like RAID >> >>>> != >> >>>> Backup, neither does ceph. >> >>>> >> >>>>> >> >>>>> Regards. >> >>>>> Pedro. >> >>>>> _______________________________________________ >> >>>>> ceph-users mailing list >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> ceph-users mailing list >> >>>> ceph-users@xxxxxxxxxxxxxx >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> Sent from Synology MailPlus >> >>>> _______________________________________________ >> >>>> ceph-users mailing list >> >>>> ceph-users@xxxxxxxxxxxxxx >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >> >>> _______________________________________________ >> >>> ceph-users mailing list >> >>> ceph-users@xxxxxxxxxxxxxx >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> > >> > >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com