On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: > I ran into the 'op not idempotent' problem during the testing today. > There is one bug in the previous fix. In that fix, we copy the reqids in > the final step of 'fill_in_copy_get'. If the object is deleted, since > the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. > No reqids will be copied during promotion in this case. This again leads > to the 'op not idempotent' problem. We need a 'smart' way to detect the > op is a 'copy get' op (looping the ops vector doesn't seem smart?) and > copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op->may_write() && !op->may_cache() && (!obc->obs.exists || ((m->get_snapid() != CEPH_SNAPDIR) && obc->obs.oi.is_whiteout()))) { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Tuesday, May 26, 2015 12:27 AM > To: Wang, Zhiqiang > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: 'Racing read got wrong version' during proxy write testing > > On Mon, 25 May 2015, Wang, Zhiqiang wrote: > > Hi all, > > > > I ran into a problem during the teuthology test of proxy write. It is like this: > > > > - Client sends 3 writes and a read on the same object to base tier > > - Set up cache tiering > > - Client retries ops and sends the 3 writes and 1 read to the cache > > tier > > - The 3 writes finished on the base tier, say with versions v1, v2 and > > v3 > > - Cache tier proxies the 1st write, and start to promote the object > > for the 2nd write, the 2nd and 3rd writes and the read are blocked > > - The proxied 1st write finishes on the base tier with version v4, and > > returns to cache tier. But somehow the cache tier fails to send the > > reply due to socket failure injecting > > - Client retries the writes and the read again, the writes are > > identified as dup ops > > - The promotion finishes, it copies the pg_log entries from the base > > tier and put it in the cache tier's pg_log. This includes the 3 writes > > on the base tier and the proxied write > > - The writes dispatches after the promotion, they are identified as > > completed dup ops. Cache tier replies these write ops with the version > > from the base tier (v1, v2 and v3) > > - In the last, the read dispatches, it reads the version of the > > proxied write (v4) and replies to client > > - Client complains that 'racing read got wrong version' > > > > In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? > > I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. > > This probably means adding more optional args to Objecter::read()...? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html