A userspace application should issue fsync or fdatasync calls where appropriate. On Wed, May 24, 2017 at 10:15 PM, 许雪寒 <xuxuehan@xxxxxx> wrote: > Thanks for your reply:-) > > I've got your point. By the way, if an application opens a file WITHOUT setting the O_DIRECT or O_SYNC, then it sequentially issues two overlapping glibc write operations to the underlying file system. As far as I understand the linux file system, those writes might not be written to the disk when the function call "write" returns, then how does the file system insure that the result of those two writes are as expected? Does it merge those two operations, or synchronously issue those writes to the disk? If the latter, does the file system insert some other operations, like io barrier, between those to writes so that the underlying storage system is aware of the case? > > -----邮件原件----- > 发件人: Jason Dillaman [mailto:jdillama@xxxxxxxxxx] > 发送时间: 2017年5月24日 23:05 > 收件人: 许雪寒 > 抄送: ceph-users@xxxxxxxxxxxxxx > 主题: Re: How does rbd preserve the consistency of WRITE requests that span across multiple objects? > > Just like a regular block device, re-orders are permitted between write barriers/flushes. For example, if I had a HDD with 512 byte sectors and I attempted to write 4K, there is no guarantee what the disk will look like if you had a crash mid-write or if you concurrently issued an overlapping write. The correct way your application should behave (regardless of using RBD or HDDs or SSDs) would be to wait for the first write to complete before issuing the overlapping write. > > On Tue, May 23, 2017 at 11:29 PM, 许雪寒 <xuxuehan@xxxxxx> wrote: >> Hi, thanks for the explanation:-) >> >> On the other hand, I wonder if the following scenario could happen: >> >> A program in a virtual machine that uses "libaio" to access a file continuous submit "write" requests to the underlying file system which translates the request into rbd requests. Say, a rbd "aio_write" X wants to write to an area that span across object A and B. according to my understanding of the rbd source code, librbd would separate this write request into two rados Ops, each corresponding to a single object. After these two rados Ops have been sent to OSD and before they are finished, another rbd "aio_write" request Y which also wants to write to the same area as the previous arrives, and is sent to OSD in the same way as X. Due to the possible reorder, it's possible that Y.B is done before X.B while Y.A is done after X.A, which could lead to an unexpected result. >> >> Is this possible? >> >> >> Date: Fri, 10 Mar 2017 19:27:00 +0000 >> From: Gregory Farnum <gfarnum@xxxxxxxxxx> >> To: Wei Jin <wjin.cn@xxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" >> <ceph-users@xxxxxxxxxxxxxx>, ??? <xuxuehan@xxxxxx> >> Subject: Re: ??: How does ceph preserve read/write >> consistency? >> Message-ID: >> >> <CAJ4mKGYP1OkAGYCgv=y5CsBmVaKBqh+NGzTPS45pyWawLUtQVA@xxxxxxxxxxxxxx> >> Content-Type: text/plain; charset="utf-8" >> >> On Thu, Mar 9, 2017 at 7:20 PM ??? <xuxuehan@xxxxxx> wrote: >> >>> Thanks for your reply. >>> >>> As the log shows, in our test, a READ that come after a WRITE did >>> finished before that WRITE. >> >> >> This is where you've gone astray. Any storage system is perfectly free >> to reorder simultaneous requests -- defined as those whose >> submit-reply time overlaps. So you submitted write W, then submitted >> read R, then got a response to R before W. That's allowed, and >> preventing it is actually impossible in general. In the specific case >> you've outlined, we *could* try to prevent it, but doing so is pretty >> ludicrously expensive and, since the "reorder" can happen anyway, doesn't provide any benefit. >> So we don't try. :) >> >> That said, obviously we *do* provide strict ordering across write >> boundaries: a read submitted after a write completed will always see >> the results of that write. >> -Greg >> >> And I read the source code, it seems that, for writes, in >>> ReplicatedPG::do_op method, the thread in OSD_op_tp calls >>> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. >>> If it fails, the op will be put into obc->rwstate.waiters queue and >>> be requeued when repop finishes, however, the OSD_op_tp's thread >>> doesn't wait for repop and tries to get the next OP. Can this be the cause? >>> >>> -----????----- >>> ???: Wei Jin [mailto:wjin.cn@xxxxxxxxx] >>> ????: 2017?3?9? 21:52 >>> ???: ??? >>> ??: ceph-users@xxxxxxxxxxxxxx >>> ??: Re: How does ceph preserve read/write consistency? >>> >>> On Thu, Mar 9, 2017 at 1:45 PM, ??? <xuxuehan@xxxxxx> wrote: >>> > Hi, everyone. >>> >>> > As shown above, WRITE req with tid 1312595 arrived at >>> > 18:58:27.439107 >>> and READ req with tid 6476 arrived at 18:59:55.030936, however, the >>> latter finished at 19:00:20:333389 while the former finished commit >>> at >>> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these >>> logs, we found that between the start and finish of each req, there >>> was a lot of "dequeue_op" of that req. We read the source code, it >>> seems that this is due to "RWState", is that correct? >>> > >>> > And also, it seems that OSD won't distinguish reqs from different >>> clients, so is it possible that io reqs from the same client also >>> finish in a different order than that they were created in? Could >>> this affect the read/write consistency? For instance, that a read >>> can't acquire the data that were written by the same client just before it. >>> > >>> >>> IMO, that doesn't make sense for rados to distinguish reqs from >>> different clients. >>> Clients or Users should do it by themselves. >>> >>> However, as for one specific client, ceph can and must guarantee the >>> request order. >>> >>> 1) ceph messenger (network layer) has in_seq and out_seq when >>> receiving and sending message >>> >>> 2) message will be dispatched or fast dispatched and then be queued >>> in ShardedOpWq in order. >>> >>> If requests belong to different pgs, they may be processed >>> concurrently, that's ok. >>> >>> If requests belong to the same pg, they will be queued in the same >>> shard and will be processed in order due to pg lock (both read and write). >>> For continuous write, op will be queued in ObjectStore in order due >>> to pg lock and ObjectStore has OpSequence to guarantee the order when >>> applying op to page cache, that's ok. >>> >>> With regard to 'read after write' to the same object, ceph must >>> guarantee read can get the correct write content. That's done by >>> ondisk_read/write_lock in ObjectContext. >>> >>> >>> > We are testing hammer version, 0.94.5. Please help us, thank >>> > you:-) _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@xxxxxxxxxxxxxx >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com