Hi Krish, I've written this quite fast and probably I'll have missed some important details, but I hope it contains the essential points to understand how it works. I'll try to write a high level overview of the execution flow of all fops and then I'll detail what happens in each step for readv/writev fops. Each fop entry point is ec_gf_<fop>(), that immediately calls ec_<fop>(). There an ec_fop_data_t structure is created. This structure controls the live cycle of the fop execution. Just after having created this structure, ec_manager() is called, where the state machine begins execution at state EC_STATE_INIT. Each fop specifies a function, called ec_manager_<fop>() that manages what to do in each state. However, since many actions are equal for all fops, this function only has to process special cases for the given fop. Common state transitions and other stuff are done in ec_default_manager(). All state transitions are controlled by the completion of suboperations initiated in a given state. This means that if a fop calls another fop or STACK_WIND() during the processing of one state (for example to lock an inode or send the request to the next xlator), the next state won't be processed until this other operation has finished. This is controlled by ec_wait(), that checks if there are other operations and returns an error code indicating the result of that operation or -1 if there are pending operations. In this case the state machine "sleeps" until that operation finishes. Following is the meaning of each state and possible transitions between them. In this description it is assumed that nothing fails. If something fails in any state, the next state will have the sign changed. For example, if there is an error during operations done in EC_STATE_PREOP, once they finish, the next state will be -EC_STATE_DISPATCH instead of EC_STATE_DISPATCH. EC_STATE_INIT: The fop has a chance to modify its input arguments or prepare anything needed to process it. The next state depends on what flags the fop has been given. If EC_FLAG_LOCK is present, the next state will be EC_STATE_LOCK, otherwise it will be EC_STATE_DISPATCH. EC_STATE_LOCK: In this state the fop locks the entry or inode using entrylk/inodelk fops (functions ec_entrylk() and ec_inodelk()) before processing the fop. This is done to synchronize execution between bricks. Which entries or inodes are locked depends on the flags given to the fop. If EC_FLAG_PREOP is present, the next state will be EC_STATE_PREOP, otherwise it will be EC_STATE_DISPATCH. EC_STATE_PREOP: In this state, a lookup on the inode is made to get the version and real size information. This data is needed to identify corrupted bricks and determine the real size of regular files before doing modifications on them. The next state is always EC_STATE_DISPATCH. EC_STATE_DISPATCH: In this state, the fop determines how many and which bricks will be involved in the operation and initiates the execution through ec_dispatch_(all|min|one) (). These functions basically select a subset of alive bricks without detected errors. Once selected, ec_wind_<fop>() is called to send the request to the underlying xlators using STACK_WIND(). When a brick answers with STACK_UNWIND through ec_<fop>_cbk(), this answer is stored in a ec_cbk_data_t structure and ec_combine() is called to try to find other answers that are compatible with it (this basically means that their return codes are equal, and basic metadata is identical, like xdata, iatt, ...). When two answers can be combined, they form a group of answers. All answers are stored until the completion of the execution of the fop, even if they do not match with any other answer. Every time an answer is processed, ec_complete() is called to decrement the number of pending wind's. When this counter goes to 0, it means that all bricks have answered. At this point, if there hasn't been found any group with enough answers (expected number of answers), the code checks if there's any other group with at least a minimum amount of answers, and if that is the case, that group is taken as the good answer. If no group satisfies the condition, an EIO error is reported. ec_report() is called then. When ec_combine() determines that a group has enough answers, it calls to ec_report() to tell the fop's state machine that the processing of the fop can continue. At this point, the next state is set to EC_STATE_REBUILD. EC_STATE_REBUILD: In this state fops can do any additional processing of the received answers or initiate other tasks needed to complete the answer. Once the data is ready to be propagated to the calling xlator, the next state is set to EC_STATE_REPORT. EC_STATE_REPORT: In this state the callback of the fop is called, passing to it the arguments corresponding to the best answer received from all bricks. For normal fops coming from ec_gf_<fop>() this only means a call to STACK_UNWIND(). In other cases, like subfops initiated by other ec fops or self-heal operations, this callback can be any other function that will continue the processing of the parent operation. Once the fop has been reported to the caller, the state machine waits until all remaining winds have finished (this can happen in some circumstance while processing locking fops). When all pending winds have finished, the next state is EC_STATE_COMPLETED. EC_STATE_COMPLETED: In this state all processing for the fop has finished and it only determines if a postop operation must be executed or not by looking at flag EC_FLAG_POSTOP. If this flag is set, the next state is EC_STATE_POSTOP, otherwise it jumps to state EC_STATE_UNLOCK. EC_STATE_POSTOP: In this state a call to ec_update_version() is made to increment the version number of the file and update the real size if needed. This is done using the ec_(f)xattrop() fops. Once completed, the next state is EC_STATE_UNLOCK. EC_STATE_UNLOCK: In this state, any lock acquired during EC_STATE_LOCK state is released. This finished the state machine of the fop and releases all allocated resources. The sequence of calls for a readv operation is the following: ec_gf_readv() ec_readv() ec_fop_data_t(): create ec_fop_data_t structure ec_manager(): initiate state machine ec_manager_readv() and ec_default_manager() are called for each state. In each state, readv does the following (states not specified mean default processing as described earlier): EC_STATE_INIT: Offset and size of the read are aligned to the block size and transformed to valid values for bricks. EC_STATE_REBUILD: ec_readv_rebuild() is called to combine the fragments read from all bricks into a single data block using the erasure code decoding function ec_method_merge(). The writev operation is a bit more complex and needs to create an additional state: ec_gf_writev() ec_writev() ec_fop_data_t(): create ec_fop_data_t structure ec_manager(): initiate state machine ec_manager_writev() and ec_default_manager() are called for each state. In each state, writev does the following: EC_STATE_INIT: Call to ec_writev_init() to align offset and size to the block size. It also creates an aligned contiguous buffer with the contents of the data to write that will be needed to encode it. EC_STATE_DISPATCH: Before starting the dispatch of the write fop to the underlying xlators using STACK_WIND(), some write operations need to do a read of some fragments of data before and/or after the specified offset and size of the write operation. This is needed when a write is not aligned to the block size. This is done through ec_writev_start(). This function initiates a readv subfop just before the offset if it's not aligned, and another readv after offset+size if it's not aligned. Once the readv subfops have finished, the state machine goes to the EC_STATE_WRITE_START, that simply does the normal processing of the EC_STATE_DISPATCH. EC_STATE_DISPATCH: In this state, ec_wind_writev() is called for each subvolume. This function computes the erasure code encoded data that will be sent to the brick using the ec_method_split() function. EC_STATE_REBUILD: At this state, the fop calculates the correct return code from the write operation and computes the resulting size of the file. Hope this helps... Xavi On Monday 09 June 2014 08:15:34 Krishnan Parthasarathi wrote: > Hi Xavi, > > Following the code walk through and discussion surrounding > erasure coding translator's implementation on #gluster-meeting, > I wanted to ask a few questions that would make things clearer > and help speed up the review. I am CC'ing gluster-devel in a hope > that some of these questions might have popped in others' head > as well. > > While learning a translator I try to identify the different internal stages > that a FOP goes through while 'inside' a xlator (ie, before a STACK_WIND or > STACK_UNWIND transfer the control to the child/parent xlator). > > Additionally, it helps to understand the points in processing of a FOP, > the sequence of functions lead it to flow to the child(ren) xlators > and the sequence of functions that lead it into the xlator (via callbacks). > > With that context, it would help if you listed the sequence of functions, > including the state machine functions which 'guide' the FOP through various > sub-operations, in the following cases. > > - When a inode modification call (say writev) enters cluster/ec. > - When a readv call enters cluster/ec > > This could be done by attaching gdb to the mount process, but what I am > looking for is your notes/insights that would help us appreciate > the design/intent better. It would also help us to notice this pattern > in other FOPs implemented in cluster/ec. > > cheers, > Krish _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel