Re: 2x replica with NVMe

David Turner <drakonstein@xxxxxxxxx> · Thu, 08 Jun 2017 14:24:25 +0000

Whether or not 2x replica is possible has little to do with the technology and EVERYTHING to do with your use case.  How redundant is your hardware for instance?  If you have the best drives in the world that will never fail after constant use over 100 years.... but you don't have redundant power, bonded network, are running on used hardware, are in a cheap datacenter that doesn't guarantee 99.999% uptime, etc, etc then you are going to lose hosts regardless of what your disks are.
As Wido was quoted saying, the biggest problem with 2x replication is that people use it with min_size=1.  That is cancer and will eventually cause you to have inconsistent data and most likely data loss.  OTOH, min_size=2 and size=2 means that you need to schedule down time to restart your ceph hosts for kernel updates, upgrading ceph, restarting the daemons with new config file options that can't be injected, etc.  You can get around that by using min_size=1 while you perform the scheduled maintenance.  If for any reason you ever lose a server, NVMe, etc while running with 2 replica and min_size=2, then you have unscheduled down time.
Running with 2x Replica right now is possible.  For that matter, people run with 1x replication all the time (especially in testing).  You will never get anyone to tell you that it is the optimal configuration because it is and will always be a lie for general use cases no matter how robust and bullet proof your drives are.  The primary problem is that nodes to be restarted, power goes out, and Murphy's law.  Is your use case such that having a certain percentage of data loss is acceptable?  Then run size=2 and min_size=1 and assume that you will eventually lose data.  Does your use case allow for unexpected downtime?  Then run size=2 and min_size=2.  If you cannot lose data no matter what and must maintain as high of an uptime as possible then you should be asking questions about multi-site replication and the down sides of running 4x replication... 2x replication shouldn't even cross your mind.

Now I'm assuming that you're broaching the topic because a 3x replica NVMe cluster is super expensive.  I think all of us feel your pain there, otherwise we'd all be running it.  A topic that has happened on the ML a couple times is to use primary_affinity and an interesting distribution of buckets in your crush map to build a cluster with both SSD storage and HDD storage in a way that your data is well backed up, but all writes and reads happen to the SSD/NVMe.  What you do here would be create 3 "racks" in your crush map and use a rack failure domain.  1 rack has all of your SSD hosts, and your HDD hosts with SSD/NVMe journals (matching what your other nodes have) are split between your other 2 racks.  Now you set primary_affinity=0 for all of your HDD nodes forcing Ceph to use the SSD/NVMe OSD as the primary for all of the PGs.  What you end up with is a 3 replica situation where 1, and only 1, copy go onto an SSD and 2 copies go onto HDDs.  Once you have this set up the way things will work is writes still happen to all OSDs in a PG, so you will have 2 writes going to HDDs, except the write acks once it is written to the SSD journal.  So your writes happen to all flash storage.  Your reads are only ever done to your primary OSD for a PG, so all reads will happen to the SSD/NVMe OSD.  Your recovery/backfilling will be slower as you'll be reading a fair amount of your data from HDDs, but that's a fairly insignificant sacrifice for what you are gaining.  For each 1TB of flash storage, you need to have 2TB of HDD storage.  If you have more HDD storage than this ratio, then it is wasted and won't be used.

To recap... The problems with 2x replica isn't the disk failure rate or how bullet proof your hardware is.  Unless downtime or data loss is acceptable, just don't talk about 2x replica.  But you can have 3 replicas that run as fast as all flash with only having 1 replica of flash storage and enough flash journals for the slower HDD replicas.  The trade off for this is that you limit future customizations to your CRUSH map if you want to actually configure logical racks for a growing/large cluster and you have generally increased complexity when adding new storage nodes.

If downtime or data loss is not an acceptable running state and running with a complex CRUSH map is not viable due to who will be in charge of adding the storage... Then you're back to getting 3x replicas of the same type of storage.

On Thu, Jun 8, 2017 at 9:32 AM <info@xxxxxxxxx> wrote:
I'm thinking to delay this project until Luminous release to have Bluestore support.

So are you telling me that checksum capability will be present in Bluestore and therefore considering using NVMe with 2x replica for production data will be possibile?

From: "nick" <nick@xxxxxxxxxx>
To: "Vy Nguyen Tan" <vynt.kenshiro@xxxxxxxxx>, info@xxxxxxxxx
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, June 8, 2017 3:19:20 PM
Subject: RE:  2x replica with NVMe

There are two main concerns with using 2x replicas, recovery speed and coming across inconsistent objects.

With spinning disks their size to access speed means recovery can take a long time and increases the chance that additional failures may happen during the recovery process. NVME will recover a lot faster and so this risk is greatly reduced and means that using 2x replicas may be possible.

However, with Filestore there are no checksums and so there is no way to determine in the event of inconsistent objects, which one is corrupt. So even with NVME, I would not feel 100% confident using 2x replicas. With Bluestore this problem will go away.

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Vy Nguyen Tan
Sent: 08 June 2017 13:47
To: info@xxxxxxxxx
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  2x replica with NVMe

Hi,

I think that the replica 2x on HDD/SSD are the same. You should read quote from Wido bellow:

""Hi,

As a Ceph consultant I get numerous calls throughout the year to help people with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon fails. With size = 3 you always have two additional copies left thus keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. The downtime and problems caused by missing objects/replicas are usually big and it takes days to recover from those. But very often data is lost and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

On Thu, Jun 8, 2017 at 5:32 PM, <info@xxxxxxxxx> wrote:
Hi all,

i'm going to build an all-flash ceph cluster, looking around the existing documentation i see lots of guides and and use case scenarios from various vendor testing Ceph with replica 2x.

Now, i'm an old school Ceph user, I always considered 2x replica really dangerous for production data, especially when both OSDs can't decide which replica is the good one.
Why all NVMe storage vendor and partners use only 2x replica? 
They claim it's safe because NVMe is better in handling errors, but i usually don't trust marketing claims :)
Is it true? Can someone confirm that NVMe is different compared to HDD and therefore replica 2 can be considered safe to be put in production?

Many Thanks
Giordano

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com