Re: write completion strategy (ack/commit questions)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 23 Mar 2011, Brian Chrisman wrote:
> Regarding section 5.3 of this paper (section titled 'Data Safety')
> http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/
> 
> I've had some discussions recently about what kind of data consistency
> we can achieve.
> >From discussions on this list/IRC, it's clear that right now ceph,
> with a reasonable crushmap, requires an on-disk write to each of two
> nodes (either a journal or fs write is acceptable) before returning a
> write completion.
> The referred section suggests that there may be a way to have a client
> notified of a write-to-cache-completion instead when the written data
> is cached on two nodes (but not yet committed to disk) via the 'ack'.
> 
> If these acks aren't exposed/used at the ceph layer, are they exposed
> via librbd, libceph, or librados?

They are exposed by librados explicitly.

At the fs layer (libceph, kernel client), the ack is used only when 
multiple clients are r/w on the same file and i/o goes synchronous.  In 
that case, a write(2) will return when we get the ack and the write is 
serialized.  (A subsequent fsync(2) on the fd won't return until we get 
the commit.)  If the OSD fails before we get the commit, we will resubmit 
the original write to the new/recovered OSD with the version number we got 
previously so that the writes will be reapplied in order.  This gives you 
less-horrible shared write performance while preserving the posix 
write ordering semantics (*).

For writeback, the fs always waits for the commit.

Does that answer your question?  If you're interested in relaxing safety 
for _all_ writes, that's something that could be added to the protocol, 
but I wouldn't recommend it.  :)

sage


(*) There is one residual issue here when a write spans an object 
boundary; two racing overlapping writes can get ordered differently on the 
two different objects.  Either locking or a multi-object transaction 
mechanisms in rados is needed to resolve this. 

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux