On Thu, Mar 21, 2013 at 1:07 AM, Yan, Zheng <zheng.z.yan@xxxxxxxxx> wrote: > On 03/21/2013 02:31 AM, Greg Farnum wrote: >> On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: >>> On 03/20/2013 02:15 PM, Sage Weil wrote: >>>> On Wed, 20 Mar 2013, Yan, Zheng wrote: >>>>> On 03/20/2013 07:09 AM, Greg Farnum wrote: >>>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. >>>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? >>>>> >>>>> >>>>> >>>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch >>>>> increases for each stage. I don't think there are any more colliding >>>>> states in the table. The table client/server use two phase commit. it's >>>>> similar to client request that involves multiple MDS. the reqid is >>>>> analogy to client request id. The difference is client request ID is >>>>> unique because new client always get an unique session id. >>>> >>>> >>>> >>>> Each time a tid is consumed (at least for an update) it is journaled in >>>> the EMetaBlob::table_tids list, right? So we could actually take a max >>>> from journal replay and pick up where we left off? That seems like the >>>> cleanest. >>>> >>>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to >>>> avoid that possibility. >>> >>> >>> >>> Can we re-use the client request ID as table client request ID ? >>> >>> Regards >>> Yan, Zheng >> >> Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. >> > > You are right, client request ID does not work. > >> As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… >> -Greg >> > > The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts. > The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update > preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit > value should be safe enough. Right. I'd like to somehow mark those reqid's so that we can tell when they come from a different incarnation of the MDS TableClient daemon. One way is via some piece of random data that will probably distinguish them, although if we have something which we can know is different that would be preferable. I think we can work something out of the startup session data each MDS does with the monitors, but I'm not sure I can check any time soon; I have a number of other things to get to now that I've gotten through (the first round on) this series. Thanks for all the patches, by the way. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html