On Fri, 11 Jun 2010, Peter Niemayer wrote: > On 06/11/2010 06:40 PM, Sage Weil wrote: > > The btrfs isn't required for consistency if the writeahead journal is > > enabled (which it is by default). However, at the moment the code that > > controls trimming the journal assumes ext3 data=ordered fsync semantics > > (fsync flushes the entire journal and all prior writes). This needs a > > little bit of work to do the right thing with ext4 and xfs. > > > > So: I would stick with btrfs or ext3 for now if you want recovery to work > > reliably! > > The recovery you are referring to, here, is that an operation required... > > a) after an outage that involved many/all redundant OSDs > b) after a physical failure of one underlying storage device > c) after every disconnect/reconnect of Ceph nodes After an OSD node crash. The challenge is keeping the contents of the osd data dir in a fully consistent state. The writeahead journal lets us do that, but it needs to know when previous operations have fully committed to disk so it can trim. Currently there's a simple fsync() in there to do that, but something trickier is required for ext4 and xfs. A per-mount sync(2) type operation would be ideal. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html