Yeah, I think the right answer is to remove that check and restructure all of the existing reader ops which rely on it to return ENOENT from do_osd_ops. For copy_get, we instead return 0 and indicate in the structured payload that the object logically does not exist (along with the relevant log entries). -Sam ----- Original Message ----- From: "Samuel Just" <sjust@xxxxxxxxxx> To: "Zhiqiang Wang" <zhiqiang.wang@xxxxxxxxx> Cc: "David Zafman" <dzafman@xxxxxxxxxx>, "Sage Weil" <sweil@xxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx Sent: Thursday, June 4, 2015 11:12:14 AM Subject: Re: 'Racing read got wrong version' during proxy write testing copy_get can happen outside of the context of a cache op, so we can't (shouldn't) just flag it as a cache op. I wonder whether it wouldn't be better to live without that check at all and let all of the ops in do_osd_ops individually return ENOENT as appropriate. -Sam ----- Original Message ----- From: "Zhiqiang Wang" <zhiqiang.wang@xxxxxxxxx> To: "David Zafman" <dzafman@xxxxxxxxxx>, "Sage Weil" <sweil@xxxxxxxxxx> Cc: ceph-devel@xxxxxxxxxxxxxxx Sent: Wednesday, June 3, 2015 7:08:35 PM Subject: RE: 'Racing read got wrong version' during proxy write testing Hi David, Proxy write hasn't been merge into master yet. It's not likely this is causing #11511. -----Original Message----- From: David Zafman [mailto:dzafman@xxxxxxxxxx] Sent: Thursday, June 4, 2015 9:46 AM To: Wang, Zhiqiang; Sage Weil Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: Re: 'Racing read got wrong version' during proxy write testing I'm wonder if this issue could be the cause of #11511. Could a proxy write have raced with the fill_in_copy_get() so object_info_t size doesn't correspond with the size of the object in the filestore? David On 6/3/15 6:22 PM, Wang, Zhiqiang wrote: > Making the 'copy get' op to be a cache op seems like a good idea. > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Thursday, June 4, 2015 9:14 AM > To: Wang, Zhiqiang > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: 'Racing read got wrong version' during proxy write > testing > > On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: >> I ran into the 'op not idempotent' problem during the testing today. >> There is one bug in the previous fix. In that fix, we copy the reqids >> in the final step of 'fill_in_copy_get'. If the object is deleted, >> since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. >> No reqids will be copied during promotion in this case. This again >> leads to the 'op not idempotent' problem. We need a 'smart' way to >> detect the op is a 'copy get' op (looping the ops vector doesn't seem >> smart?) and copy the reqids in this case. > Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). > > I think teh trick is getting it past the generic check in do_op: > > if (!op->may_write() && > !op->may_cache() && > (!obc->obs.exists || > ((m->get_snapid() != CEPH_SNAPDIR) && > obc->obs.oi.is_whiteout()))) { > reply_ctx(ctx, -ENOENT); > return; > } > > Maybe we mark these as cache operations so that may_cache is true? > > Sam, what do you think? > > sage > > >> -----Original Message----- >> From: Sage Weil [mailto:sweil@xxxxxxxxxx] >> Sent: Tuesday, May 26, 2015 12:27 AM >> To: Wang, Zhiqiang >> Cc: ceph-devel@xxxxxxxxxxxxxxx >> Subject: Re: 'Racing read got wrong version' during proxy write >> testing >> >> On Mon, 25 May 2015, Wang, Zhiqiang wrote: >>> Hi all, >>> >>> I ran into a problem during the teuthology test of proxy write. It is like this: >>> >>> - Client sends 3 writes and a read on the same object to base tier >>> - Set up cache tiering >>> - Client retries ops and sends the 3 writes and 1 read to the cache >>> tier >>> - The 3 writes finished on the base tier, say with versions v1, v2 >>> and >>> v3 >>> - Cache tier proxies the 1st write, and start to promote the object >>> for the 2nd write, the 2nd and 3rd writes and the read are blocked >>> - The proxied 1st write finishes on the base tier with version v4, >>> and returns to cache tier. But somehow the cache tier fails to send >>> the reply due to socket failure injecting >>> - Client retries the writes and the read again, the writes are >>> identified as dup ops >>> - The promotion finishes, it copies the pg_log entries from the base >>> tier and put it in the cache tier's pg_log. This includes the 3 >>> writes on the base tier and the proxied write >>> - The writes dispatches after the promotion, they are identified as >>> completed dup ops. Cache tier replies these write ops with the >>> version from the base tier (v1, v2 and v3) >>> - In the last, the read dispatches, it reads the version of the >>> proxied write (v4) and replies to client >>> - Client complains that 'racing read got wrong version' >>> >>> In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? >> I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. >> >> This probably means adding more optional args to Objecter::read()...? >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html