Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Le 10/01/2017 à 19:32, Brian Andrus a écrit :
[...]


I think the main point I'm trying to address is - as long as the backing OSD isn't egregiously handling large amounts of writes and it has a good journal in front of it (that properly handles O_DSYNC [not D_SYNC as Sebastien's article states]), it is unlikely inconsistencies will occur upon a crash and subsequent restart.

I don't see how you can guess if it is "unlikely". If you need SSDs you are probably handling relatively large amounts of accesses (so large amounts of writes aren't unlikely) or you would have used cheap 7200rpm or even slower drives.

Remember that in the default configuration, if you have any 3 OSDs failing at the same time, you have chances of losing data. For <30 OSDs and size=3 this is highly probable as there are only a few thousands combinations of 3 OSDs possible (and you usually have typically a thousand or 2 of pgs picking OSDs in a more or less random pattern).

With SSDs not handling write barriers properly I wouldn't bet on recovering the filesystems of all OSDs properly given a cluster-wide power loss shutting down all the SSDs at the same time... In fact as the hardware will lie about the stored data, the filesystem might not even detect the crash properly and might apply its own journal on outdated data leading to unexpected results.
So losing data is a possibility and testing for it is almost impossible (you'll have to reproduce all the different access patterns your Ceph cluster could experience at the time of a power loss and trigger the power losses in each case).


Therefore - while not ideal to rely on journals to maintain consistency,

Ceph journals aren't designed for maintaining the filestore consistency. They *might* restrict the access patterns to the filesystems in such a way that running fsck on them after a "let's throw away committed data" crash might have better chances of restoring enough data but if it's the case it's only an happy coincidence (and you will have to run these fscks *manually* as the filesystem can't detect inconsistencies by itself).

that is what they are there for.

No. They are here for Ceph internal consistency, not the filesystem backing the filestore consistency. Ceph relies both on journals and filesystems able to maintain internal consistency and supporting syncfs to maintain consistency, if the journal or the filesystem fails the OSD is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you enter "probable data loss" territory.

There is a situation where "consumer-grade" SSDs could be used as OSDs. While not ideal, it can and has been done before, and may be preferable to tossing out $500k of SSDs (Seen it firsthand!)

For these I'd like to know :
- which SSD models were used ?
- how long did the SSDs survive (some consumer SSDs not only lie to the system about write completions but they usually don't handle large amounts of write nearly as well as DC models) ?
- how many cluster-wide power losses did the cluster survive ?
- what were the access patterns on the cluster during the power losses ?

If for a model not guaranteed for sync writes there hasn't been dozens of power losses on clusters under large loads without any problem detected in the week following (thing deep-scrub), using them is playing Russian roulette with your data.

AFAIK there have only been reports of data losses and/or heavy maintenance later when people tried to use consumer SSDs (admittedly mainly for journals). I've yet to spot long-running robust clusters built with consumer SSDs.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux