Re: Storage, File Systems and Data Scrubbing

Johannes Klarenbeek <Johannes.Klarenbeek@xxxxxxx> · Thu, 22 Aug 2013 00:00:31 +0000

I think you are missing the distinction between metadata journaling and data journaling.  In most cases a journaling filesystem is one that journal's it's own metadata but your data is on its own.  Consider the
 case where you have a replication level of two, the osd filesystems have journaling disabled and you append a block to a file (which is an object in terms of ceph) but only one commits the change in file size to disk.  Later you scrub and discover a discrepancy
 in object sizes, with a replication level of 2 there is no way to authoritatively say which one is correct just based on what's in ceph.  This is a similar scenario to a btrfs bug that caused me to lose data with ceph.  Journaling your metadata is the absolute
 minimum level of assurance you need to make a transactional system like ceph work.

Hey Mike
J

I get your point. However, isn’t it then possible to authoritatively say which one is the correct one in case of 3 OSD’s?
Or is the replication level a configuration setting that tells the cluster that the object needs to be replicated 3 times?

In both cases, data scrubbing chooses the majority of the same-same replicated objects in order to know which one is authorative.

But I also believe (!) that each object has a checksum and each PG too so that it should be easy to find the corrupted object on any of the OSD’s.
How else would scrubbing find corrupted sectors? Especially when I think about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere.

It happens more often with big cheap TB disks, but that doesn’t mean the corrupted sector is a bad sector (in not useable anymore). Journaling
 is not going to help anyone with this. 
Therefor I believe (again) that the data scrubber must have a mechanism to detect these types of corruptions even in a 2 OSD setup by means of
 checksums (or better, with a hashed checksum id).

Also, aren’t there 2 types of transactions; one for writing and one for replicating?

On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek <Johannes.Klarenbeek@xxxxxxx> wrote:

Dear ceph-users,

I read a lot of documentation today about ceph architecture and linux file system benchmarks in particular and I could not help notice something that I like
 to clear up for myself. Take into account that it has been a while that I actually touched linux, but I did some programming on php2b12 and apache back in the days so I’m not a complete newbie. The real question is below if you do not like reading the rest
 ;)

What I have come to understand about file systems for OSD’s is that in theory btrfs is the file system of choice. However, due to its young age it’s not considered
 stable yet. Therefore EXT4 but preferably XFS is used in most cases. It seems that most people choose this system because of its journaling feature and XFS for its additional attribute storage which has a 64kb limit which should be sufficient for most operations.

But when you look at file system benchmarks btrfs is really, really slow. Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput results. On
 journaling systems (like XFS, EXT4 and btrfs) disabling journaling actually helps throughput as well. Sometimes more then 2 times for write actions.

The preferred configuration for OSD’s is one OSD per disk. Each object is striped among all Object Storage Daemons in a cluster. So if I would take one disk
 for the cluster and check its data, chances are slim that I will find a complete object there (a non-striped, full object I mean).

When a client issues an object write (I assume a full object/file write in this case) it is the client’s responsibility to stripe it among the object storage
 daemons. When a stripe is successfully stored by the daemon an ACK signal is send to (?) the client and all participating OSD’s. When all participating OSD’s for the object have completed the client assumes all is well and returns control to the application

If I’m not mistaken, then journaling is meant for the rare occasions that a hardware failure will occur and the data is corrupted. Ceph does this too in another
 way of course. But ceph should be able to notice when a block/stripe is correct or not. In the rare occasion that a node is failing while doing a write; an ACK signal is not send to the caller and therefor the client can resend the block/stripe to another
 OSD. Therefor I fail to see the purpose of this extra journaling feature.

Also ceph schedules a data scrubbing process every day (or however it is configured) that should be able to tackle bad sectors or other errors on the file system
 and accordingly repair them on the same daemon or flag the whole block as bad. Since everything is replicated the block is still in the storage cluster so no harm is done.

In a normal/single file system I truly see the value of journaling and the potential for btrfs (although it’s still very slow). However in a system like ceph,
 journaling seems to me more like a paranoid super fail save.

Did anyone experiment with file systems that disabled journaling and how did it perform?

Regards,

Johannes

__________ Informatie van ESET Endpoint Antivirus, versie van database viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

__________ Informatie van ESET Endpoint Antivirus, versie van database viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com

__________ Informatie van ESET Endpoint Antivirus, versie van database viruskenmerken 8713 (20130821) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com