Re: write cache disabling recommendations for journal and storage disks ?

Sage Weil <sage@xxxxxxxxxxx> · Tue, 22 May 2012 08:17:56 -0700 (PDT)

On Tue, 22 May 2012, Alexandre DERUMIER wrote:
> Thanks Sage,
> 
> yes newer kernel doesn't need barrier option since 2.6.37 if I remember. 
> (support of REQ_FLUSH/FUA)
> 
> 
> Just to be sure:
> 
> If client do a fsync or fdatasync, does the write will go only to 
> journal and after 30seconds is flushed to disk ?
> 
> or does it force the write to be committed to disk ?

Oh, are you talking about the *ceph* client doing an fsync?  In that case, 
it waits for a COMMIT from the osd, which happens when all replicas have 
written to the journal (or fs, whichever is durable first).

The OSD itself is calling fdatasync() on the journal file.

> Does the journal fake the fsync ? (with zfs, this is the nocacheflush=1 
> system variable)

I'm not sure what you mean by "faking" the fsync...

sage

> 
> 
> ----- Mail original ----- 
> 
> De: "Sage Weil" <sage@xxxxxxxxxxx> 
> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
> Cc: ceph-devel@xxxxxxxxxxxxxxx 
> Envoyé: Mardi 22 Mai 2012 16:41:50 
> Objet: Re: write cache disabling recommendations for journal and storage disks ? 
> 
> On Tue, 22 May 2012, Alexandre DERUMIER wrote: 
> > Hi, I have some questions about disabling write cache 
> > " 
> > http://ceph.com/docs/master/config-cluster/file-system-recommendations/ 
> > 
> > Ceph aims for data safety, which means that when the application receives notice that data was written to the disk, that data was actually written to the disk. For old kernels (<2.6.33), disable the write cache if the journal is on a raw disk. Newer kernels should work fine. 
> > 
> > Use hdparm to disable write caching on the hard disk: 
> > 
> > hdparm -W 0 /dev/hda 0 
> > " 
> 
> To clarify: on newer kernels, calling fsync() or fdatasync() flushes the 
> disk's write cache, so this isn't something you need to worry about at 
> all. 
> 
> > Cache on journal disk: 
> > 
> > what happen if we have a powerfailure, if data are in cache of journal 
> > disk (ssd with/without supercapicitor) (so write is ack, but not really 
> > write on disk). 
> 
> The ack is only sent if the client requests it, and normally the client 
> does not. Which means the client didn't get the ack, and will resend the 
> request to the other replicas once the failed OSD is marked down. 
> 
> > Cache on disks storage: 
> > what happen if we have a powerfailure,if write is commited to journal, 
> > but write are in cache of storage disks and not yet on the platters ? 
> 
> On newer kernels, the file system is careful to flush the disk cache any 
> time durability matters (e.g., during a journal commit); there is no need 
> to disable it on that disk. If the write is durable in the journal, it 
> will be applied to the fs on ceph-osd restart. 
> 
> > Maybe the best way is to disable write cache on both (journal and 
> > storage disks) ? 
> 
> If you have an old kernel, disable it on the journal, and (if you're using 
> ext3) mount with -o discard. On newer kernels, I believe discard is 
> (finally) the default. 
> 
> sage 
> 
> 
> 
> -- 
> 
> -- 
> 
> 
> 
> 
> 	Alexandre D erumier 
> Ingénieur Système 
> Fixe : 03 20 68 88 90 
> Fax : 03 20 68 90 81 
> 45 Bvd du Général Leclerc 59100 Roubaix - France 
> 12 rue Marivaux 75002 Paris - France 
> 	
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>