Re: How does rbd preserve the consistency of WRITE requests that span across multiple objects?

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 24 May 2017 11:04:39 -0400

Just like a regular block device, re-orders are permitted between
write barriers/flushes. For example, if I had a HDD with 512 byte
sectors and I attempted to write 4K, there is no guarantee what the
disk will look like if you had a crash mid-write or if you
concurrently issued an overlapping write. The correct way your
application should behave (regardless of using RBD or HDDs or SSDs)
would be to wait for the first write to complete before issuing the
overlapping write.

On Tue, May 23, 2017 at 11:29 PM, 许雪寒 <xuxuehan@xxxxxx> wrote:
> Hi, thanks for the explanation:-)
>
> On the other hand, I wonder if the following scenario could happen:
>
>         A program in a virtual machine that uses "libaio" to access a file continuous submit "write" requests to the underlying file system which translates the request into rbd requests. Say, a rbd "aio_write" X wants to write to an area that span across object A and B. according to my understanding of the rbd source code, librbd would separate this write request into two rados Ops, each corresponding to a single object. After these two rados Ops have been sent to OSD and before they are finished, another rbd "aio_write" request Y which also wants to write to the same area as the previous arrives, and is sent to OSD in the same way as X. Due to the possible reorder, it's possible that Y.B is done before X.B while Y.A is done after X.A, which could lead to an unexpected result.
>
> Is this possible?
>
>
> Date: Fri, 10 Mar 2017 19:27:00 +0000
> From: Gregory Farnum <gfarnum@xxxxxxxxxx>
> To: Wei Jin <wjin.cn@xxxxxxxxx>,        "ceph-users@xxxxxxxxxxxxxx"
>         <ceph-users@xxxxxxxxxxxxxx>, ??? <xuxuehan@xxxxxx>
> Subject: Re:  ??: How does ceph preserve read/write
>         consistency?
> Message-ID:
>         <CAJ4mKGYP1OkAGYCgv=y5CsBmVaKBqh+NGzTPS45pyWawLUtQVA@xxxxxxxxxxxxxx>
> Content-Type: text/plain; charset="utf-8"
>
> On Thu, Mar 9, 2017 at 7:20 PM ??? <xuxuehan@xxxxxx> wrote:
>
>> Thanks for your reply.
>>
>> As the log shows, in our test, a READ that come after a WRITE did finished
>> before that WRITE.
>
>
> This is where you've gone astray. Any storage system is perfectly free to
> reorder simultaneous requests -- defined as those whose submit-reply time
> overlaps. So you submitted write W, then submitted read R, then got a
> response to R before W. That's allowed, and preventing it is actually
> impossible in general. In the specific case you've outlined, we *could* try
> to prevent it, but doing so is pretty ludicrously expensive and, since the
> "reorder" can happen anyway, doesn't provide any benefit.
> So we don't try. :)
>
> That said, obviously we *do* provide strict ordering across write
> boundaries: a read submitted after a write completed will always see the
> results of that write.
> -Greg
>
> And I read the source code, it seems that, for writes, in
>> ReplicatedPG::do_op method, the thread in OSD_op_tp calls
>> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it
>> fails, the op will be put into obc->rwstate.waiters queue and be requeued
>> when repop finishes, however, the OSD_op_tp's thread doesn't wait for repop
>> and tries to get the next OP. Can this be the cause?
>>
>> -----????-----
>> ???: Wei Jin [mailto:wjin.cn@xxxxxxxxx]
>> ????: 2017?3?9? 21:52
>> ???: ???
>> ??: ceph-users@xxxxxxxxxxxxxx
>> ??: Re:  How does ceph preserve read/write consistency?
>>
>> On Thu, Mar 9, 2017 at 1:45 PM, ??? <xuxuehan@xxxxxx> wrote:
>> > Hi, everyone.
>>
>> > As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107
>> and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter
>> finished at 19:00:20:333389 while the former finished commit at
>> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs,
>> we found that between the start and finish of each req, there was a lot of
>> "dequeue_op" of that req. We read the source code, it seems that this is
>> due to "RWState", is that correct?
>> >
>> > And also, it seems that OSD won't distinguish reqs from different
>> clients, so is it possible that io reqs from the same client also finish in
>> a different order than that they were created in? Could this affect the
>> read/write consistency? For instance, that a read can't acquire the data
>> that were written by the same client just before it.
>> >
>>
>> IMO, that doesn't make sense for rados to distinguish reqs from different
>> clients.
>> Clients or Users should do it by themselves.
>>
>> However, as for one specific client, ceph can and must guarantee the
>> request order.
>>
>> 1) ceph messenger (network layer) has in_seq and out_seq when receiving
>> and sending message
>>
>> 2) message will be dispatched or fast dispatched and then be queued in
>> ShardedOpWq in order.
>>
>> If requests belong to different pgs, they may be processed concurrently,
>> that's ok.
>>
>> If requests belong to the same pg, they will be queued in the same shard
>> and will be processed in order due to pg lock (both read and write).
>> For continuous write, op will be queued in ObjectStore in order due to pg
>> lock and ObjectStore has OpSequence to guarantee the order when applying op
>> to page cache, that's ok.
>>
>> With regard to  'read after write' to the same object, ceph must guarantee
>> read can get the correct write content. That's done by
>> ondisk_read/write_lock in ObjectContext.
>>
>>
>> > We are testing hammer version, 0.94.5.  Please help us, thank you:-)
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com