Re: SSD Journal

Jan Schermer <jan@xxxxxxxxxxx> · Fri, 29 Jan 2016 01:12:42 +0100

> On 28 Jan 2016, at 23:19, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
> 
> Le 28/01/2016 22:32, Jan Schermer a écrit :
>> P.S. I feel very strongly that this whole concept is broken fundamentaly. We already have a journal for the filesystem which is time proven, well behaved and above all fast. Instead there's this reinvented wheel which supposedly does it better in userspace while not really avoiding the filesystem journal either. It would maybe make sense if OSD was storing the data on a block device directly, avoiding the filesystem altogether. But it would still do the same bloody thing and (no disrespect) ext4 does this better than Ceph ever will.
>> 
> 
> Hum I've seen this discussed previously but I'm not sure the fs journal could be used as a Ceph journal.
> 
> First BTRFS doesn't have a journal per se, so you would not be able to use xfs or ext4 journal on another device with journal=data setup to make write bursts/random writes fast. And I won't go back to XFS or test ext4... I've detected too much silent corruption by hardware with BTRFS to trust our data to any filesystem not using CRC on reads (and in our particular case the compression and speed are additional bonuses).

ZFS takes care of all those concerns... Most people are quite happy with ext2/3/4, oblivious to the fact they are losing bits here and there... and the world still spins the same :-)
I personally believe the task of not corrupting data doesn't belong in the fileystem layer but rather should be handled by RAID array, mdraid, RBD... ZFS does it because it handles those tasks too.

> 
> Second I'm not familiar with Ceph internals but OSDs must make sure that their PGs are synced so I was under the impression that the OSD content for a PG on the filesystem should always be guaranteed to be on all the other active OSDs *or* their journals (so you wouldn't apply journal content unless the other journals have already committed the same content). If you remove the journals there's no intermediate on-disk "buffer" that can be used to guarantee such a thing: one OSD will always have data that won't be guaranteed to be on disk on the others. As I understand this you could say that this is some form of 2-phase commit.

You can simply commit the data (to the filestore), and it would be in fact faster.
Client gets the write acknowledged when all the OSDs have the data - that doesn't change in this scenario. If one OSD gets ahead of the others and commits something the other OSDs do not before the whole cluster goes down then it doesn't hurt anything - you didn't acknowledge so the client has to replay if it cares, _NOT_ the OSDs.
The problem still exists, just gets shifted elsewhere. But the client (guest filesystem) already handles this.

> 
> I may be mistaken: there are structures in the filestore that *may* take on this role but I'm not sure what their exact use is : the <pg_num>_TEMP dirs, the omap and meta dirs. My guess is that they serve other purposes: it would make sense to use the journals for this because the data is already there and the commit/apply coherency barriers seem both trivial and efficient to use.
> 
> That's not to say that the journals are the only way to maintain the needed coherency, just that they might be used to do so because once they are here, this is a trivial extension of their use.
> 

In the context of cloud more and more people realize that clinging to things like "durability" and "consitency" is out of fashion. I think the future will take a different turn... I can't say I agree with that, though, I'm usually the one fixing those screw ups afterwards.

> Lionel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com