Hi,
Le 10/01/2017 à 19:32, Brian Andrus a écrit :
I don't see how you can guess if it is "unlikely". If you need SSDs you are probably handling relatively large amounts of accesses (so large amounts of writes aren't unlikely) or you would have used cheap 7200rpm or even slower drives. Remember that in the default configuration, if you have any 3 OSDs failing at the same time, you have chances of losing data. For <30 OSDs and size=3 this is highly probable as there are only a few thousands combinations of 3 OSDs possible (and you usually have typically a thousand or 2 of pgs picking OSDs in a more or less random pattern). With SSDs not handling write barriers properly I wouldn't bet on recovering the filesystems of all OSDs properly given a cluster-wide power loss shutting down all the SSDs at the same time... In fact as the hardware will lie about the stored data, the filesystem might not even detect the crash properly and might apply its own journal on outdated data leading to unexpected results. So losing data is a possibility and testing for it is almost impossible (you'll have to reproduce all the different access patterns your Ceph cluster could experience at the time of a power loss and trigger the power losses in each case).
Ceph journals aren't designed for maintaining the filestore consistency. They *might* restrict the access patterns to the filesystems in such a way that running fsck on them after a "let's throw away committed data" crash might have better chances of restoring enough data but if it's the case it's only an happy coincidence (and you will have to run these fscks *manually* as the filesystem can't detect inconsistencies by itself).
No. They are here for Ceph internal consistency, not the filesystem backing the filestore consistency. Ceph relies both on journals and filesystems able to maintain internal consistency and supporting syncfs to maintain consistency, if the journal or the filesystem fails the OSD is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you enter "probable data loss" territory.
For these I'd like to know : - which SSD models were used ? - how long did the SSDs survive (some consumer SSDs not only lie to the system about write completions but they usually don't handle large amounts of write nearly as well as DC models) ? - how many cluster-wide power losses did the cluster survive ? - what were the access patterns on the cluster during the power losses ? If for a model not guaranteed for sync writes there hasn't been dozens of power losses on clusters under large loads without any problem detected in the week following (thing deep-scrub), using them is playing Russian roulette with your data. AFAIK there have only been reports of data losses and/or heavy maintenance later when people tried to use consumer SSDs (admittedly mainly for journals). I've yet to spot long-running robust clusters built with consumer SSDs. Lionel |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com