Re: SSD Journal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Le 28/01/2016 22:32, Jan Schermer a écrit :
P.S. I feel very strongly that this whole concept is broken fundamentaly. We already have a journal for the filesystem which is time proven, well behaved and above all fast. Instead there's this reinvented wheel which supposedly does it better in userspace while not really avoiding the filesystem journal either. It would maybe make sense if OSD was storing the data on a block device directly, avoiding the filesystem altogether. But it would still do the same bloody thing and (no disrespect) ext4 does this better than Ceph ever will.


Hum I've seen this discussed previously but I'm not sure the fs journal could be used as a Ceph journal.

First BTRFS doesn't have a journal per se, so you would not be able to use xfs or ext4 journal on another device with journal=data setup to make write bursts/random writes fast. And I won't go back to XFS or test ext4... I've detected too much silent corruption by hardware with BTRFS to trust our data to any filesystem not using CRC on reads (and in our particular case the compression and speed are additional bonuses).

Second I'm not familiar with Ceph internals but OSDs must make sure that their PGs are synced so I was under the impression that the OSD content for a PG on the filesystem should always be guaranteed to be on all the other active OSDs *or* their journals (so you wouldn't apply journal content unless the other journals have already committed the same content). If you remove the journals there's no intermediate on-disk "buffer" that can be used to guarantee such a thing: one OSD will always have data that won't be guaranteed to be on disk on the others. As I understand this you could say that this is some form of 2-phase commit.

I may be mistaken: there are structures in the filestore that *may* take on this role but I'm not sure what their exact use is : the <pg_num>_TEMP dirs, the omap and meta dirs. My guess is that they serve other purposes: it would make sense to use the journals for this because the data is already there and the commit/apply coherency barriers seem both trivial and efficient to use.

That's not to say that the journals are the only way to maintain the needed coherency, just that they might be used to do so because once they are here, this is a trivial extension of their use.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux