On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: > Hi Sage, > thanks for your response. > > >>If you turn off the journal compeletely, you will see bursty write commits > >>from the perspective of the client, because the OSD is periodically doing > >>a sync or snapshot and only acking the writes then. > >>If you enable the journal, the OSD will reply with a commit as soon as the > >>write is stable in the journal. That's one reason why it is there--file > >>system commits of heavyweight and slow. > > Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. > > >>If we left the file system to its own devices and did a sync every 10 > >>seconds, the disk would sit idle while a bunch of dirty data accumulated > >>in cache, and then the sync/snapshot would take a really long time. This > >>is horribly inefficient (the disk is idle half the time), and useless (the > >>delayed write behavior makes sense for local workloads, but not servers > >>where there is a client on the other end batching its writes). To prevent > >>this, 'filestore flusher' will prod the kernel to flush out any written > >>data to the disk quickly. Then, when we get around to doing the > >>sync/snapshot it is pretty quick, because only fs metadata and > >>just-written data needs to be flushed. > > mmm, I disagree. > > If you flush quickly, it's works fine with sequential write workload. > > But if you have a lot of random write with 4k block by exemple, you are > going to have a lot of disk seeks. The way zfs works or netapp san > storage works, they take random writes in a fast journal then flush them > sequentially each 20s to slow storage. Oh, I see what you're getting at. Yes, that is not ideal for small random writes. There is a branch in ceph.git called wip-flushmin that just sets a minimum write size for the flush that will probably do a decent job of dealing with this: small writes won't get flushed, large ones will. Picking the right value will depend on how expensive seeks are for your storage system. You'll want to cherry-pick just the top commit on top of whatever it is you're running... sage > > To compare with zfs or netapp, I can achieve around 20000io/s on random > write 4K with 4GB nvram and 10 x 7200 disk. > > with ceph, i'm around 2000io/s with same config. (3 nodes with > 10x7200disk, 2x replication), so around real disk io limit without any > write cache. > > > So for now, i'm think i'm going to use ssd for my osds,I have 80% random > write workload. (no seeks, so no problem to constant random write) > > > > NTW: maybe wiki is wrong > http://ceph.com/wiki/OSD_journal > section Motivation > "Enterprise products like NetApp filers "cheat" by journaling all writes to NVRAM and then taking their time to flush things out to disk efficiently. This gives you very low-latency writes _and_ efficient disk IO at the expense of hardware." > > This why I thinked ceph worked like this. > > > Thanks again, > > -Alexandre > > > > > > > > ----- Mail original ----- > > De: "Sage Weil" <sage@xxxxxxxxxxx> > À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx, "Mark Nelson" <mark.nelson@xxxxxxxxxxx>, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> > Envoyé: Jeudi 21 Juin 2012 18:03:45 > Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) > > Hi Alexandre, > > [Sorry I didn't follow up earlier; I didn't understand your question.] > > If you turn off the journal compeletely, you will see bursty write commits > from the perspective of the client, because the OSD is periodically doing > a sync or snapshot and only acking the writes then. > > If you enable the journal, the OSD will reply with a commit as soon as the > write is stable in the journal. That's one reason why it is there--file > system commits of heavyweight and slow. > > If we left the file system to its own devices and did a sync every 10 > seconds, the disk would sit idle while a bunch of dirty data accumulated > in cache, and then the sync/snapshot would take a really long time. This > is horribly inefficient (the disk is idle half the time), and useless (the > delayed write behavior makes sense for local workloads, but not servers > where there is a client on the other end batching its writes). To prevent > this, 'filestore flusher' will prod the kernel to flush out any written > data to the disk quickly. Then, when we get around to doing the > sync/snapshot it is pretty quick, because only fs metadata and > just-written data needs to be flushed. > > So: the behavior you're seeing is normal, and good. > > Did I understand your confusion correctly? > > Thanks! > sage > > > On Wed, 20 Jun 2012, Alexandre DERUMIER wrote: > > > Hi, > > I have tried to disabe filestore flusher > > > > filestore flusher = false > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > > > > > in osd config. > > > > > > now, I see correct sync each 30s when doing rados bench > > > > rados -p pool3 bench 60 write -t 16 > > > > > > seekwatcher movie: > > > > > > before > > ------ > > http://odisoweb1.odiso.net/seqwrite-radosbench-flusherenable.mpg > > > > after > > ----- > > http://odisoweb1.odiso.net/seqwrite-radosbench-flusherdisable.mpg > > > > > > Shouldn't it be the normal behaviour ? What's exactly is filestore flusher vs syncfs ? > > > > > > > > This seem to works fine with rados bench, > > But when I launch benchmark with fio from my guest vm, I see again constant write. > > (I'll try to debug that today) > > > > > > My target is to be able to handle small random write and write them each 30s. > > > > Regards, > > > > Alexandre > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > -- > > > > > > Alexandre D e rumier > > Ingénieur Systèmes et Réseaux > > > Fixe : 03 20 68 88 85 > > Fax : 03 20 68 90 88 > > > 45 Bvd du Général Leclerc 59100 Roubaix > 12 rue Marivaux 75002 Paris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >