Re: 2x replication: A BIG warning

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Dec 2016 21:15:25 +0900

Hello,

On Wed, 7 Dec 2016 14:49:28 +0300 Дмитрий Глушенок wrote:

> RAID10 also will suffer from LSE on big disks, isn't it?
>
IF LSE stands for latent sector errors, then yes, but that's not limited 
to large disks per se.

And you counter it by having another replica and checksums like in ZFS
or hopefully in Bluestore.
The current scrubbing in Ceph is a pretty weak defense against it unless
it's 100% clear from drive SMART checks which one the bad source is.

Christian

> > 7 дек. 2016 г., в 13:35, Christian Balzer <chibi@xxxxxxx> написал(а):
> > 
> > 
> > 
> > Hello,
> > 
> > On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:
> > 
> >> Hi,
> >> 
> >> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two parity disks) size=3 mitigates the problem.
> > 
> > Indeed.
> > That math changes significantly of course if you have very reliable,
> > endurable, well monitored and fast SSDs of not too big a size.
> > Something that will recover in less than hour.
> > 
> > So people with SSD pools might have an acceptable risk.
> > 
> > That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
> > and the increased latency stopped me for this time.
> > Next round I'll upgrade my HW requirements and budget.
> > 
> >> By the way - why it is a common opinion that using RAID (RAID6) with Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some battery-backed cache and increases performance of single OSD.
> >> 
> > 
> > I did run something like that and if your IOPS needs are low enough it
> > works well (the larger HW cache the better).
> > But once you exceed the combined speed of HW cache coalescing, it degrades
> > badly, something that's usually triggered by very mixed R/W ops and/or
> > deep scrubs.
> > It also depends on your cluster size, if you have dozens of OSDs based on
> > such a design, it will work a lot better than with a few.
> > 
> > I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
> > and didn't require all the space.
> > 
> > Christian
> > 
> >>> 7 дек. 2016 г., в 11:08, Wido den Hollander <wido@xxxxxxxx> написал(а):
> >>> 
> >>> Hi,
> >>> 
> >>> As a Ceph consultant I get numerous calls throughout the year to help people with getting their broken Ceph clusters back online.
> >>> 
> >>> The causes of downtime vary vastly, but one of the biggest causes is that people use replication 2x. size = 2, min_size = 1.
> >>> 
> >>> In 2016 the amount of cases I have where data was lost due to these settings grew exponentially.
> >>> 
> >>> Usually a disk failed, recovery kicks in and while recovery is happening a second disk fails. Causing PGs to become incomplete.
> >>> 
> >>> There have been to many times where I had to use xfs_repair on broken disks and use ceph-objectstore-tool to export/import PGs.
> >>> 
> >>> I really don't like these cases, mainly because they can be prevented easily by using size = 3 and min_size = 2 for all pools.
> >>> 
> >>> With size = 2 you go into the danger zone as soon as a single disk/daemon fails. With size = 3 you always have two additional copies left thus keeping your data safe(r).
> >>> 
> >>> If you are running CephFS, at least consider running the 'metadata' pool with size = 3 to keep the MDS happy.
> >>> 
> >>> Please, let this be a big warning to everybody who is running with size = 2. The downtime and problems caused by missing objects/replicas are usually big and it takes days to recover from those. But very often data is lost and/or corrupted which causes even more problems.
> >>> 
> >>> I can't stress this enough. Running with size = 2 in production is a SERIOUS hazard and should not be done imho.
> >>> 
> >>> To anyone out there running with size = 2, please reconsider this!
> >>> 
> >>> Thanks,
> >>> 
> >>> Wido
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> >> --
> >> Dmitry Glushenok
> >> Jet Infosystems
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx <mailto:chibi@xxxxxxx>   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/ <http://www.gol.com/>
> --
> Дмитрий Глушенок
> Инфосистемы Джет
> +7-910-453-2568
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com