ceph-osd builds a transactional interface on top of the usual posix operations so that we can do things like atomically perform an object write and update the osd metadata. The current implementation requires our own journal and some metadata ordering (which is provided by the backing filesystem's own journal) to implement our own atomic operations. It's true that in some cases you might be able to get away with having the client replay the operation (which we do anyway for other reasons), but that wouldn't be enough to ensure consistency of the filesystem's own internal structures. It also wouldn't be enough to ensure that the OSD's internal structure remain consistent in the case of a crash. Also, if the client is unavailable to do the replay, you'd have a problem. In summary, it's actually really hard to to detect partial/corrupted writes after a crash without journaling of some form. -Sam On Wed, Aug 21, 2013 at 6:03 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote: > Let me make a simpler case, to do ACID (https://en.wikipedia.org/wiki/ACID) > which are all properties you want in a filesystem or a database, you need a > journal. You need a journaled filesystem to make the object store's file > operations safe. You need a journal in ceph to make sure the object > operations are safe. Flipped bits are a separate problem that may be aided > by journaling but the primary objective of a journal is to make guarantees > about concurrent operations and interrupted operations. There isn't a > person on this list who hasn't had an osd die, without a journal starting > that osd up again and getting it usable would be impractical. > > On Aug 21, 2013, at 8:00 PM, Johannes Klarenbeek > <Johannes.Klarenbeek@xxxxxxx> wrote: > > > > > I think you are missing the distinction between metadata journaling and data > journaling. In most cases a journaling filesystem is one that journal's > it's own metadata but your data is on its own. Consider the case where you > have a replication level of two, the osd filesystems have journaling > disabled and you append a block to a file (which is an object in terms of > ceph) but only one commits the change in file size to disk. Later you scrub > and discover a discrepancy in object sizes, with a replication level of 2 > there is no way to authoritatively say which one is correct just based on > what's in ceph. This is a similar scenario to a btrfs bug that caused me to > lose data with ceph. Journaling your metadata is the absolute minimum level > of assurance you need to make a transactional system like ceph work. > > Hey Mike J > > I get your point. However, isn’t it then possible to authoritatively say > which one is the correct one in case of 3 OSD’s? > Or is the replication level a configuration setting that tells the cluster > that the object needs to be replicated 3 times? > In both cases, data scrubbing chooses the majority of the same-same > replicated objects in order to know which one is authorative. > > But I also believe (!) that each object has a checksum and each PG too so > that it should be easy to find the corrupted object on any of the OSD’s. > How else would scrubbing find corrupted sectors? Especially when I think > about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere. > It happens more often with big cheap TB disks, but that doesn’t mean the > corrupted sector is a bad sector (in not useable anymore). Journaling is not > going to help anyone with this. > Therefor I believe (again) that the data scrubber must have a mechanism to > detect these types of corruptions even in a 2 OSD setup by means of > checksums (or better, with a hashed checksum id). > > Also, aren’t there 2 types of transactions; one for writing and one for > replicating? > > On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek > <Johannes.Klarenbeek@xxxxxxx> wrote: > > > > Dear ceph-users, > > I read a lot of documentation today about ceph architecture and linux file > system benchmarks in particular and I could not help notice something that I > like to clear up for myself. Take into account that it has been a while that > I actually touched linux, but I did some programming on php2b12 and apache > back in the days so I’m not a complete newbie. The real question is below if > you do not like reading the rest ;) > > What I have come to understand about file systems for OSD’s is that in > theory btrfs is the file system of choice. However, due to its young age > it’s not considered stable yet. Therefore EXT4 but preferably XFS is used in > most cases. It seems that most people choose this system because of its > journaling feature and XFS for its additional attribute storage which has a > 64kb limit which should be sufficient for most operations. > > But when you look at file system benchmarks btrfs is really, really slow. > Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput > results. On journaling systems (like XFS, EXT4 and btrfs) disabling > journaling actually helps throughput as well. Sometimes more then 2 times > for write actions. > > The preferred configuration for OSD’s is one OSD per disk. Each object is > striped among all Object Storage Daemons in a cluster. So if I would take > one disk for the cluster and check its data, chances are slim that I will > find a complete object there (a non-striped, full object I mean). > > When a client issues an object write (I assume a full object/file write in > this case) it is the client’s responsibility to stripe it among the object > storage daemons. When a stripe is successfully stored by the daemon an ACK > signal is send to (?) the client and all participating OSD’s. When all > participating OSD’s for the object have completed the client assumes all is > well and returns control to the application > > If I’m not mistaken, then journaling is meant for the rare occasions that a > hardware failure will occur and the data is corrupted. Ceph does this too in > another way of course. But ceph should be able to notice when a block/stripe > is correct or not. In the rare occasion that a node is failing while doing a > write; an ACK signal is not send to the caller and therefor the client can > resend the block/stripe to another OSD. Therefor I fail to see the purpose > of this extra journaling feature. > > Also ceph schedules a data scrubbing process every day (or however it is > configured) that should be able to tackle bad sectors or other errors on the > file system and accordingly repair them on the same daemon or flag the whole > block as bad. Since everything is replicated the block is still in the > storage cluster so no harm is done. > > In a normal/single file system I truly see the value of journaling and the > potential for btrfs (although it’s still very slow). However in a system > like ceph, journaling seems to me more like a paranoid super fail save. > > Did anyone experiment with file systems that disabled journaling and how did > it perform? > > Regards, > Johannes > > > > > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com