答复: How does ceph preserve read/write consistency?

许雪寒 <xuxuehan@xxxxxx> · Fri, 10 Mar 2017 03:20:07 +0000

Thanks for your reply.

As the log shows, in our test, a READ that come after a WRITE did finished before that WRITE. And I read the source code, it seems that, for writes, in ReplicatedPG::do_op method, the thread in OSD_op_tp calls ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it fails, the op will be put into obc->rwstate.waiters queue and be requeued when repop finishes, however, the OSD_op_tp's thread doesn't wait for repop and tries to get the next OP. Can this be the cause?

-----邮件原件-----
发件人: Wei Jin [mailto:wjin.cn@xxxxxxxxx] 
发送时间: 2017年3月9日 21:52
收件人: 许雪寒
抄送: ceph-users@xxxxxxxxxxxxxx
主题: Re:  How does ceph preserve read/write consistency?

On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒 <xuxuehan@xxxxxx> wrote:
> Hi, everyone.

> As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter finished at 19:00:20:333389 while the former finished commit at 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we found that between the start and finish of each req, there was a lot of "dequeue_op" of that req. We read the source code, it seems that this is due to "RWState", is that correct?
>
> And also, it seems that OSD won't distinguish reqs from different clients, so is it possible that io reqs from the same client also finish in a different order than that they were created in? Could this affect the read/write consistency? For instance, that a read can't acquire the data that were written by the same client just before it.
>

IMO, that doesn't make sense for rados to distinguish reqs from different clients.
Clients or Users should do it by themselves.

However, as for one specific client, ceph can and must guarantee the request order.

1) ceph messenger (network layer) has in_seq and out_seq when receiving and sending message

2) message will be dispatched or fast dispatched and then be queued in ShardedOpWq in order.

If requests belong to different pgs, they may be processed concurrently, that's ok.

If requests belong to the same pg, they will be queued in the same shard and will be processed in order due to pg lock (both read and write).
For continuous write, op will be queued in ObjectStore in order due to pg lock and ObjectStore has OpSequence to guarantee the order when applying op to page cache, that's ok.

With regard to  'read after write' to the same object, ceph must guarantee read can get the correct write content. That's done by ondisk_read/write_lock in ObjectContext.

> We are testing hammer version, 0.94.5.  Please help us, thank you:-) 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com