Re: 2x replication: A BIG warning

Дмитрий Глушенок <glush@xxxxxxxxxx> · Wed, 7 Dec 2016 15:56:47 +0300

Hi,

The assumptions are:
- OSD nearly full
- HDD vendor not hides real LSE (latent sector error) rate like 1 in 10^18 under "not more than 1 unrecoverable error in 10^15 bits read"

In case of disk (OSD) failure Ceph have to read copy of the disk from other nodes (to restore redundancy). More you read - more chances that LSE will happen on one of the nodes Ceph is reading from (all of them have the same LSE rate). In case of LSE Ceph cannot recover data because the error is unrecoverable and there is no other places to read the data from (in contrast to size=3 where third copy can be used to recover from the error).

Here is a good paper about LSE influence on RAID5: http://www.snia.org/sites/default/orig/sdc_archives/2010_presentations/tuesday/JasonResch_%20Solving-Data-Loss.pdf

7 дек. 2016 г., в 15:07, Wolfgang Link <w.link@xxxxxxxxxxx> написал(а):

Hi

I'm very interested in this calculation.
What assumption do you have done?
Network speed, osd degree of fulfilment, etc....?

Thanks

Wolfgang

On 12/07/2016 11:16 AM, Дмитрий Глушенок wrote:
Hi,

Let me add a little math to your warning: with LSE rate of 1 in 10^15 on
modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8
TB disk. So, every 18th recovery will probably fail. Similarly to RAID6
(two parity disks) size=3 mitigates the problem.
By the way - why it is a common opinion that using RAID (RAID6) with
Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk
errors are handled by RAID (instead of OS/Ceph), decreases OSD count,
adds some battery-backed cache and increases performance of single OSD.

7 дек. 2016 г., в 11:08, Wido den Hollander <wido@xxxxxxxx
<mailto:wido@xxxxxxxx>> написал(а):

Hi,

As a Ceph consultant I get numerous calls throughout the year to help
people with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is
that people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these
settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is
happening a second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken
disks and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented
easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single
disk/daemon fails. With size = 3 you always have two additional copies
left thus keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata'
pool with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with
size = 2. The downtime and problems caused by missing objects/replicas
are usually big and it takes days to recover from those. But very
often data is lost and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a
SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com