Since I'm trying to write an article about distributed system, this thread is really really good to me. I'm quite sure that my thought is not always perfects thought. #inline When I played with Lustre(2.6.x), I saw many times data inconsistency especially metadata, and GPFS(2.6.18) as well, which CoW was really good but sometimes really bad -; Both of them were running on Kernel 2.6.32-431, and connected to clients over FDR infiniband. So my conclusion at the moment is below: But in summary, at this moment, there is a bunch of things need to be consider. ----- Original Message ----- From: "Gregory Farnum" <gfarnum@xxxxxxxxxx> To: "Owen Synge" <osynge@xxxxxxxx> Cc: "zhao mingyue" <zhao.mingyue@xxxxxxx>, "Xiaoxi Chen" <xiaoxi.chen@xxxxxxxxx>, "huang jun" <hjwsm1989@xxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx Sent: Wednesday, September 16, 2015 1:09:20 AM Subject: Re: Brewer's theorem also known as CAP theorem Congratulations, you've just hit on my biggest pet peeve in distributed systems discussions. Sorry if this gets a little hot. :) On Tue, Sep 15, 2015 at 5:38 AM, Owen Synge <osynge@xxxxxxxx> wrote: >> On Mon, 14 Sep 2015 13:57:26 -0700 >> Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> >> The OSD is supposed to stay down if any of the networks are missing. >> Ceph is a CP system in CAP parlance; there's no such thing as a CA >> system. ;) >> >> I know I am being fussy, but within my team your email was sited that >> you cannot consider ceph as a CA system. Hence I make my argument in >> public so I can be humbled in public. >> >> Just to clarify your opinion I site >> >> http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed >> >> suggests: >> >> <quote> >> The CAP theorem states that any networked shared-data system can have >> at most two of three desirable properties. >> >> * consistency (C) equivalent to having a single up-to-date copy of >> the data; >> * high availability (A) of that data (for updates) >> * tolerance to network partitions (P). >> </quote> >> >> So I dispute that a CA system cannot exist. >Right, you can create a system that assumes no partitions and thus is >always consistent and available in the absence of partitions. The >problem is that partitions *do* exist and happen. The stereotypical >example is that a simple hard power-off on one server is >indistinguishable from a (very small) partition to the other nodes. >Even leaving aside stuff like that, networks partition, or undergo >partition-like events (huge packet loss over some link). When that >happens, your system is going to...do something. If you don't want >that something to be widespread data corruption, it will be designed >to handle partitions. > >There are no real "CA" systems in the world. > >> I think you are too absolute even in interpretation of this vague >> theory. A further quote from the author of said theorem from the same >> article: >> >> <quote> >> The "2 of 3" formulation was always misleading because it tended to >> oversimplify the tensions among properties. >> </quote> > Right. This is the cause of a lot of problems for students of > distributed systems. Another quote from that article: >> CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare. > Lots of users forget that the CAP theorem is very precise, and that > precision is important. Some quick-and-dirty (but precise enough) > definitions: > Available: any request received by a well-behaved node of the system > will receive a (correct!) response (within some bounded amount of > time) To make this realize as much as possible, we have to have one real master node which has true data and metadata just directly coming clients. > Consistent: All nodes in the system agree on the status of a > particular piece of state. (Eg, that an object is at version 12345 and > not 12344.) In side this node, every single device must be redundant. This is to make distributed system keep simple. > Partition-tolerant: the system continues to function correctly in the > presence of message loss between some set of nodes. It's almost impossible to provide 100% accuracy guarantee partition-tolerant. Because there is always latency difference between: CPU and RAM RAM and HDD ... Network between each end point Protocol itself(TCP/IP) ... 512 bytes limitation for async ... Etc, etc... To make system simple as much as possible, and provide 99.999...% data consistency including metadata with distributed system, it's necessary to have one real master node is not necessary to serve all data permanently. But this node must have responsibility that all distributed nodes have no inconsistency data. But it's not necessary for us to use really really expensive hardware for the master node. > As I understand it: > > Ceph as a cluster always provides "Consistency". (or else you found a > bug) > > If a ceph cluster is operating it will always provide acknowledgment > (it may block) to the client if the operation has succeeded > or failed hence provides "availability". > This is the part you're missing: blocking a request is not allowed > under the CAP theorem's definition of availability. If a PG might have > been updated by a set of nodes which are now partitioned away, we > can't respond to the client request (despite it being a valid, > well-behaved request) and so the system is not big-A Available. > Now, we are little-a available for *other* kinds of work. The cluster > keeps going and will process requests for all the state which it knows > it is authoritative for. But we do not satisfy the availability > criteria of the CAP theorem. This is part of the wide design space > which CAP does not demonstrate is impossible. >> if a ceph cluster is partitioned, only one partition will continue >> operation, hence you cannot consider the system "partition" tolerant >> as multiple parts of the system cannot operate when partitioned. > Nope, that's not what partition-tolerant means in the context of the > CAP theorem. This shortcut — in which we treat one side of the > partition as a misbehaving node — is pretty common, but what you're > citing here is actually associated with the "Available" side of the > spectrum. > The proof of the CAP theorem is relatively elegant and can be > summarized as: if a node disappears/is partitioned, you must either > ignore any updates it handled (sacrificing consistency) or refuse to > answer until the node reappears (sacrificing availability). No > combination of clever data replication or distribution can eliminate > the choice a system has to make when it goes to look at data and > discovers the data isn't there. > Now, that discussion of Ceph's classification under the CAP theorem > obviously leaves out lots of stuff: Ceph is small-a available in that > it takes a lot more than one disk failure to render data inaccessible! > Much has been made of this design space, with many vendors and > developers pretending that this means they've somehow beaten CAP. But > when faced with requests for data that are not currently accessible, > Ceph chooses to block and remain Consistent over making up a value and > remaining Available; every distributed system must make that choice > one way or another. Most CP systems endeavor to remain available for > as long as possible; AP systems expend varying amounts of effort to be > consistent up until they're faced with either blocking or making up a > value. >-Greg >-- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html