Re: How does ceph preserve read/write consistency?

Wei Jin <wjin.cn@xxxxxxxxx> · Thu, 9 Mar 2017 21:51:49 +0800

On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒 <xuxuehan@xxxxxx> wrote:
> Hi, everyone.

> As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107 and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter finished at 19:00:20:333389 while the former finished commit at 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs, we found that between the start and finish of each req, there was a lot of "dequeue_op" of that req. We read the source code, it seems that this is due to "RWState", is that correct?
>
> And also, it seems that OSD won't distinguish reqs from different clients, so is it possible that io reqs from the same client also finish in a different order than that they were created in? Could this affect the read/write consistency? For instance, that a read can't acquire the data that were written by the same client just before it.
>

IMO, that doesn't make sense for rados to distinguish reqs from
different clients.
Clients or Users should do it by themselves.

However, as for one specific client, ceph can and must guarantee the
request order.

1) ceph messenger (network layer) has in_seq and out_seq when
receiving and sending message

2) message will be dispatched or fast dispatched and then be queued in
ShardedOpWq in order.

If requests belong to different pgs, they may be processed
concurrently, that's ok.

If requests belong to the same pg, they will be queued in the same
shard and will be processed in order due to pg lock (both read and
write).
For continuous write, op will be queued in ObjectStore in order due to
pg lock and ObjectStore has OpSequence to guarantee the order when
applying op to page cache, that's ok.

With regard to  'read after write' to the same object, ceph must
guarantee read can get the correct write content. That's done by
ondisk_read/write_lock in ObjectContext.

> We are testing hammer version, 0.94.5.  Please help us, thank you:-)
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com