> 3. Volume start fails with a message "volume start: test-vol: failed: > Commit failed on 00000000-0000-0000-0000-000000000000. Please check log > file for details." The 'zero' uuid displayed in the error message is due to a naive implementation of the procedure wrapping the RPC. This can be easily solved by remembering the uuid of the peer to which the RPC is initiated and this can be logged during failure independent of whether the peer is reachable. > > 4. gluster volume status now shows the volume as started although the > previous transaction failed. > > In this case, since the local commit op succeed, changes to volinfo was > made but op_ret was non zero as the remote commit op failed at the other > node (due to other node going down at same point of time). > > I was thinking of moving the local commit op code after the remote > commit ops and then overriding the op_ret and op_errstr with the local > commit op's behaviour. I know with this fix we can't solve the entire > inconsistency issue here as the current design doesn't have UNDO > framework but with this fix at least we can throw a correct message in CLI. > IIUC, you intend to avoid misrepresenting the status of a command executed on the cluster. For this, we need to succeed any transaction that was successful in majority of the nodes and fail the transaction otherwise. This must be reflected in the status reported to the CLI. We need to improve our transaction framework to rollback changes performed in nodes where the command executed successfully when it didn't succeed in a majority and to 'retry' (until success) the command in nodes where it failed when the command succeeded in majority of nodes. Until the point where the cluster agrees on all the transactions on the statuses that were reported to the 'clients' (CLI), the nodes in disagreement must operate in degraded state. ie. they must not have a say on success of transactions meanwhile. By reordering the commit phase operations we don't gain anything. If you want to fix the error reported to the CLI, it can be done by handling errors at the originator node (where the CLI was run locally), reflecting the 'exit' status of the command in the cluster. ~KP _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel