Hey folks, This is a report of the discussion regarding DHT2 that took place with Jeff Darcy and Shyam in Bangalore during February. Many of you have been already following (or aware to some extent) of what's planned for DHT2, but there's no consolidated document that explains the design in detail; just peices of information from various talks and presentations. This report would follow up with the design doc. So, hang in for that. [Short report of the discussion and what to expect in the design doc..] DHT2 server component (MDS) --------------------------- DHT2 client would act as a forwarder/router to the server component. It's the server component that would drive an operation. Operations may be compound in nature - locally and/or remote. The "originator" server may forward (sub)operations to another server in the cluster. This depends on the type of (original) operation. The server component also takes care of serializing operations if required to ensure correctness and resilience of concurrent operations. For crash consistency, DHT2 would first log the operation it's about to perform in a write ahead log (WAL or journal) followed by performing the operation(s) to the store. WAL records are marked completed after operations are durable on store. Pending operations are replayed on server restart. Cluster Map ----------- This is a versioned "view" of the current state of a cluster. State here refers to nodes (or probably sub-volumes?) which form a cluster along with it's operational state (up, down) and weightage. A cluster map is used to distribute objects to a set of nodes. Every entity in a cluster (client, servers) keep a copy of the cluster map and consult whenever required (e.g., during distribution). Cluster map versions are monotonically increasing; epoch number is best suited for such a versioning scheme. A master copy of the map is maintained by GlusterD (in etcd). Operation performed by clients (and servers) carry the epoch number of their cached version of the cluster map. Servers use this information to validate the freshness of the cluster map. Write Ahead Log --------------- DHT2 would make use of journal to ensure crash consistency. It's tempting to reuse the journaling translator developed for NSR (FDL: Full Data Logging xlator), but doing that would require FDL handling special cases for DHT2. Furthermore, there are plans to redesign quota to rely on journals (journaled quota). Also, designing a system with such tight coupling makes it hard to switch to alternate implementations (e.g., server side AFR instead of NSR). Therefore, implementing the journal as a regular file that's treated as any other file by all layers below would + provide more control (discard, replay, etc..) to DHT2 server component on the journal + enables MDS to use NSR or AFR (server side) without any modifications to journaling part. NSR/AFR would treat the journal as a any other file and keep it replicated+consistent. - restrict taking advantage of storing (DHT2) journal on faster storage (SSD) Sharding -------- Introduce the notion of block pointers for inodes. Block pointers are distributed by DHT2 rather than individual files/objects. This changes the translator API extensively. Try to leverage existing shard implementation and see if the concept of block pointers can be used in place of treating each shard as a separate file. Treating each shard as a separate file bloats up the amount of tracking (on the MDS) needed for each file shard. Size on MDS ----------- https://review.gerrithub.io/#/c/253517/ Leader Election --------------- It's important for the server side DHT2 compoenent to know, in a replicated MDS setup, if a brick is acting as a leader. As of now this information is a part of NSR, but needs to be carved out as a separate translator in the server. [More on this in the design document] Server Graph ------------ DHT2 MDSes "interact" with each other (non-replicated MDC would typically have the client translator loaded on the server to "talk" to other (N - 1) MDS nodes. When the MDC is replicated (NSR for example) then: 1) 'N - 1' NSR client component(s) loaded on the server to talk to other replias (when a (sub)operation needs to be performed on non local/replica-group) node) where, N == number of distribute subvolumes 2) 'N' NSR client component(s) loaded on the client for high availability of distributed MDS. -- As usual, comments are more than welcome. If things are still unclear or you're wanting to read more, then hang in for the design doc. Thanks, Venky _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel