FYI ---------- Forwarded message ---------- From: David Casier <david.casier@xxxxxxxx> Date: 2015-12-01 21:32 GMT+01:00 Subject: Re: Fwd: [newstore (again)] how disable double write WAL To: Sage Weil <sage@xxxxxxxxxxxx> Cc: Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx>, Vish Maram-SSI <vishwanath.m@xxxxxxxxxxxxxxx>, Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>, Benoît LORIOT <benoit.loriot@xxxxxxxx>, pascal.billery-schneider@xxxxxxxxxxx Hi Sage, With a standard disk (4 to 6 TB), and a small flash drive, it's easy to create an ext4 FS with metadata on flash Example with sdg1 on flash and sdb on hdd : size_of() { blockdev --getsize $1 } mkdmsetup() { _ssd=/dev/$1 _hdd=/dev/$2 _size_of_ssd=$(size_of $_ssd) echo """0 $_size_of_ssd linear $_ssd 0 $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2} } mkdmsetup sdg1 sdb mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i $((1024*512)) /dev/mapper/dm-sdg1-sdb With that, all meta_blocks are on the SSD If omap are on SSD, there are almost no metadata on HDD Consequence : performance Ceph (with hack on filestore without journal and directIO) are almost same that performance of the HDD. With cache-tier, it's very cool ! That is why we are working on a hybrid approach HDD / Flash on ARM or Intel With newstore, it's much more difficult to control the I/O profil. Because rocksDB embedded its own intelligence In the (near) futur, we will create a portal to display our hardware solution in the CERN OHL license. (My non-fluency in English explains the latency of my answers) 2015-11-24 21:42 GMT+01:00 Sage Weil <sage@xxxxxxxxxxxx>: > > On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote: > > Hello Vish, > > > > Please apologize for the delay in my answer. > > Following the conversation you had with my colleague David, here are > > some more details about our work : > > > > We are working on Filestore / Newstore optimizations by studying how we > > could set ourselves free from using the journal. > > > > It is very important to work with SSD, but it is also mandatory to > > combine it with regular magnetic platter disks. This is why we are > > combining metadata storing on flash with data storing on disk. > > This is pretty common, and something we will support natively with > newstore. > > > Our main goal is to have the control on performance. Which is quite > > difficult with NewStore, and needs fundamental hacks with FileStore. > > Can you clarify what you mean by "quite difficult with NewStore"? > > FWIW, the latest bleeding edge code is currently at > github.com/liewegas/wip-bluestore. > > sage > > > > Is Samsung working on ARM boards with embedded flash and a SATA port, in > > order to allow us to work on a hybrid approach? What is your line of > > work with Ceph? > > > > How can we work together ? > > > > Regards, > > Sébastien > > > > > Début du message réexpédié : > > > > > > De: David Casier <david.casier@xxxxxxxx> > > > Date: 12 octobre 2015 20:52:26 UTC+2 > > > À: Sage Weil <sage@xxxxxxxxxxxx>, Ceph Development <ceph-devel@xxxxxxxxxxxxxxx> > > > Cc: Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx>, benoit.loriot@xxxxxxxx, Denis Saget <geodni@xxxxxxxxx>, "luc.petetin" <luc.petetin@xxxxxxxx> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Ok, > > > Great. > > > > > > With these settings : > > > // > > > newstore_max_dir_size = 4096 > > > newstore_sync_io = true > > > newstore_sync_transaction = true > > > newstore_sync_submit_transaction = true > > > newstore_sync_wal_apply = true > > > newstore_overlay_max = 0 > > > // > > > > > > And direct IO in the benchmark tool (fio) > > > > > > I see that the HDD is 100% charged and there are notransfer of /db to /fragments after stopping benchmark : Great ! > > > > > > But when i launch a bench with random blocs of 256k, i see random blocs between 32k and 256k on HDD. Any idea ? > > > > > > Debits to the HDD are about 8MBps when they could be higher with larger blocs (~30MBps) > > > And 70 MBps without fsync (hard drive cache disabled). > > > > > > Other questions : > > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread fsync_wq) ? > > > newstore_sync_transaction -> true = sync in DB ? > > > newstore_sync_submit_transaction -> if false then kv_queue (only if newstore_sync_transaction=false) ? > > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > > > Is it true ? > > > > > > Way for cache with battery (sync DB and no sync data) ? > > > > > > Thanks for everything ! > > > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >>> Hello everybody, > > >>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>> I separed "/db" and "/fragments" but during the bench, everything is writing > > >>> to "/db" > > >>> I changed options "newstore_sync_*" without success. > > >>> > > >>> Is there any way to write all metadata in "/db" and all data in "/fragments" ? > > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >> But if you are overwriting an existing object, doing write-ahead logging > > >> is usually unavoidable because we need to make the update atomic (and the > > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >> mitigates this somewhat for larger writes by limiting fragment size, but > > >> for small IOs this is pretty much always going to be the case. For small > > >> IOs, though, putting things in db/ is generally better since we can > > >> combine many small ios into a single (rocksdb) journal/wal write. And > > >> often leave them there (via the 'overlay' behavior). > > >> > > >> sage > > >> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > *David CASIER > > > DCConsulting SARL > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > **Ligne directe: _01 75 98 53 85_ > > > Email: _david.casier@aevoo.fr_ > > > * ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: David Casier <david.casier@xxxxxxxx> > > > Date: 2 novembre 2015 20:02:37 UTC+1 > > > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@xxxxxxxxxxxxxxx> > > > Cc: benoit LORIOT <benoit.loriot@xxxxxxxx>, Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > In FileStore, data and metadata are stored in files, with xargs FS and omap. > > > NewStore works with RocksDB. > > > There are a lot of configuration in RocksDB but all options not implemented. > > > > > > The best way, for me, is not to use the logs, with secure cache (for example SSD 845DC). > > > I don't think that is necessary to report I/O with a good metadata optimisation. > > > > > > The problem with RocksDB is that not possible to control I/O blocs size. > > > > > > We will resume work on NewStore soon. > > > > > > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Thanks David for the reply. > > >> > > >> Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it. > > >> > > >> It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together. > > >> > > >> Thanks, > > >> -Vish > > >> <> > > >> From: David Casier [mailto:david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx>] > > >> Sent: Thursday, October 29, 2015 4:41 AM > > >> To: Vish (Vishwanath) Maram-SSI > > >> Cc: benoit LORIOT; Sébastien VALSEMEY > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Hi Vish, > > >> It's OK. > > >> > > >> We have a lot of different configuration with newstore tests. > > >> > > >> What is your goal with ? > > >> > > >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Hi David, > > >> > > >> Sorry for sending you the mail directly. > > >> > > >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO. > > >> > > >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO? > > >> > > >> Thanks, > > >> -Vish > > >> > > >> -----Original Message----- > > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx>] On Behalf Of David Casier > > >> Sent: Monday, October 12, 2015 11:52 AM > > >> To: Sage Weil; Ceph Development > > >> Cc: Sébastien VALSEMEY; benoit.loriot@xxxxxxxx <mailto:benoit.loriot@xxxxxxxx>; Denis Saget; luc.petetin > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > >> > > >> Debits to the HDD are about 8MBps when they could be higher with larger > > >> blocs (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > >> newstore_sync_transaction -> true = sync in DB ? > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > >> > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >> Hello everybody, > > >> fragment is stored in rocksdb before being written to "/fragments" ? > > >> I separed "/db" and "/fragments" but during the bench, everything is writing > > >> to "/db" > > >> I changed options "newstore_sync_*" without success. > > >> > > >> Is there any way to write all metadata in "/db" and all data in "/fragments" ? > > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >> But if you are overwriting an existing object, doing write-ahead logging > > >> is usually unavoidable because we need to make the update atomic (and the > > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >> mitigates this somewhat for larger writes by limiting fragment size, but > > >> for small IOs this is pretty much always going to be the case. For small > > >> IOs, though, putting things in db/ is generally better since we can > > >> combine many small ios into a single (rocksdb) journal/wal write. And > > >> often leave them there (via the 'overlay' behavior). > > >> > > >> sage > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> ________________________________________________________ > > >> > > >> Cordialement, > > >> > > >> David CASIER > > >> > > >> > > >> 4 Trait d'Union > > >> 77127 LIEUSAINT > > >> > > >> Ligne directe: 01 75 98 53 85 > > >> Email: david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx> > > >> ________________________________________________________ > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: Sage Weil <sage@xxxxxxxxxxxx> > > > Date: 12 octobre 2015 21:33:52 UTC+2 > > > À: David Casier <david.casier@xxxxxxxx> > > > Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>, Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx>, benoit.loriot@xxxxxxxx, Denis Saget <geodni@xxxxxxxxx>, "luc.petetin" <luc.petetin@xxxxxxxx> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi David- > > > > > > On Mon, 12 Oct 2015, David Casier wrote: > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > > > > > Is this a hard disk? Those settings probably don't make sense since it > > > does every IO synchronously, blocking the submitting IO path... > > > > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > > > > > Random IOs have to be write ahead logged in rocksdb, which has its own IO > > > pattern. Since you made everything sync above I think it'll depend on > > > how many osd threads get batched together at a time.. maybe. Those > > > settings aren't something I've really tested, and probably only make > > > sense with very fast NVMe devices. > > > > > >> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > > > > > yes > > > > > >> newstore_sync_transaction -> true = sync in DB ? > > > > > > synchronously do the rocksdb commit too > > > > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > > > > > yeah.. there is an annoying rocksdb behavior that makes an async > > > transaction submit block if a sync one is in progress, so this queues them > > > up and explicitly batches them. > > > > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > > > the txn commit completion threads can do the wal work synchronously.. this > > > is only a good idea if it's doing aio (which it generally is). > > > > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > > > > > ? > > > s > > > > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>> Hello everybody, > > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>> I separed "/db" and "/fragments" but during the bench, everything is > > >>>> writing > > >>>> to "/db" > > >>>> I changed options "newstore_sync_*" without success. > > >>>> > > >>>> Is there any way to write all metadata in "/db" and all data in > > >>>> "/fragments" ? > > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>> But if you are overwriting an existing object, doing write-ahead logging > > >>> is usually unavoidable because we need to make the update atomic (and the > > >>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >>> mitigates this somewhat for larger writes by limiting fragment size, but > > >>> for small IOs this is pretty much always going to be the case. For small > > >>> IOs, though, putting things in db/ is generally better since we can > > >>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>> often leave them there (via the 'overlay' behavior). > > >>> > > >>> sage > > >>> > > >> > > >> > > >> -- > > >> ________________________________________________________ > > >> > > >> Cordialement, > > >> > > >> *David CASIER > > >> DCConsulting SARL > > >> > > >> > > >> 4 Trait d'Union > > >> 77127 LIEUSAINT > > >> > > >> **Ligne directe: _01 75 98 53 85_ > > >> Email: _david.casier@aevoo.fr_ > > >> * ________________________________________________________ > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >> > > >> > > > Début du message réexpédié : > > > > > > De: David Casier <david.casier@xxxxxxxx> > > > Date: 29 octobre 2015 12:41:22 UTC+1 > > > À: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@xxxxxxxxxxxxxxx> > > > Cc: benoit LORIOT <benoit.loriot@xxxxxxxx>, Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > It's OK. > > > > > > We have a lot of different configuration with newstore tests. > > > > > > What is your goal with ? > > > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Hi David, > > >> > > >> Sorry for sending you the mail directly. > > >> > > >> This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO. > > >> > > >> Can you please share your Ceph Configuration file which you have used to run the IO's using FIO? > > >> > > >> Thanks, > > >> -Vish > > >> > > >> -----Original Message----- > > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx>] On Behalf Of David Casier > > >> Sent: Monday, October 12, 2015 11:52 AM > > >> To: Sage Weil; Ceph Development > > >> Cc: Sébastien VALSEMEY; benoit.loriot@xxxxxxxx <mailto:benoit.loriot@xxxxxxxx>; Denis Saget; luc.petetin > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > >> > > >> Debits to the HDD are about 8MBps when they could be higher with larger > > >> blocs (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > >> newstore_sync_transaction -> true = sync in DB ? > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > >> > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>> Hello everybody, > > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>> I separed "/db" and "/fragments" but during the bench, everything is writing > > >>>> to "/db" > > >>>> I changed options "newstore_sync_*" without success. > > >>>> > > >>>> Is there any way to write all metadata in "/db" and all data in "/fragments" ? > > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>> But if you are overwriting an existing object, doing write-ahead logging > > >>> is usually unavoidable because we need to make the update atomic (and the > > >>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >>> mitigates this somewhat for larger writes by limiting fragment size, but > > >>> for small IOs this is pretty much always going to be the case. For small > > >>> IOs, though, putting things in db/ is generally better since we can > > >>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>> often leave them there (via the 'overlay' behavior). > > >>> > > >>> sage > > >>> > > >> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: "Vish (Vishwanath) Maram-SSI" <vishwanath.m@xxxxxxxxxxxxxxx> > > > Date: 29 octobre 2015 17:30:56 UTC+1 > > > À: David Casier <david.casier@xxxxxxxx> > > > Cc: benoit LORIOT <benoit.loriot@xxxxxxxx>, Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx> > > > Objet: RE: Fwd: [newstore (again)] how disable double write WAL > > > > > > Thanks David for the reply. > > > > > > Yeah We just wanted to know how different is it from Filestore and how do we contribute for this? My motive is to first understand the design of Newstore and get the Performance loopholes so that we can try looking into it. > > > > > > It would be helpful if you can share what is your idea from your side to use Newstore and configuration? What plans you are having for contributions to help us understand and see if we can work together. > > > > > > Thanks, > > > -Vish > > > <> > > > From: David Casier [mailto:david.casier@xxxxxxxx] > > > Sent: Thursday, October 29, 2015 4:41 AM > > > To: Vish (Vishwanath) Maram-SSI > > > Cc: benoit LORIOT; Sébastien VALSEMEY > > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > It's OK. > > > > > > We have a lot of different configuration with newstore tests. > > > > > > What is your goal with ? > > > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > > Hi David, > > > > > > Sorry for sending you the mail directly. > > > > > > This is Vishwanath Maram from Samsung and started to play around with Newstore and observing some issues with running FIO. > > > > > > Can you please share your Ceph Configuration file which you have used to run the IO's using FIO? > > > > > > Thanks, > > > -Vish > > > > > > -----Original Message----- > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx <mailto:ceph-devel-owner@xxxxxxxxxxxxxxx>] On Behalf Of David Casier > > > Sent: Monday, October 12, 2015 11:52 AM > > > To: Sage Weil; Ceph Development > > > Cc: Sébastien VALSEMEY; benoit.loriot@xxxxxxxx <mailto:benoit.loriot@xxxxxxxx>; Denis Saget; luc.petetin > > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > > > Ok, > > > Great. > > > > > > With these settings : > > > // > > > newstore_max_dir_size = 4096 > > > newstore_sync_io = true > > > newstore_sync_transaction = true > > > newstore_sync_submit_transaction = true > > > newstore_sync_wal_apply = true > > > newstore_overlay_max = 0 > > > // > > > > > > And direct IO in the benchmark tool (fio) > > > > > > I see that the HDD is 100% charged and there are notransfer of /db to > > > /fragments after stopping benchmark : Great ! > > > > > > But when i launch a bench with random blocs of 256k, i see random blocs > > > between 32k and 256k on HDD. Any idea ? > > > > > > Debits to the HDD are about 8MBps when they could be higher with larger > > > blocs (~30MBps) > > > And 70 MBps without fsync (hard drive cache disabled). > > > > > > Other questions : > > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > > fsync_wq) ? > > > newstore_sync_transaction -> true = sync in DB ? > > > newstore_sync_submit_transaction -> if false then kv_queue (only if > > > newstore_sync_transaction=false) ? > > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > > > > > Is it true ? > > > > > > Way for cache with battery (sync DB and no sync data) ? > > > > > > Thanks for everything ! > > > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > > > On Mon, 12 Oct 2015, David Casier wrote: > > > Hello everybody, > > > fragment is stored in rocksdb before being written to "/fragments" ? > > > I separed "/db" and "/fragments" but during the bench, everything is writing > > > to "/db" > > > I changed options "newstore_sync_*" without success. > > > > > > Is there any way to write all metadata in "/db" and all data in "/fragments" ? > > > You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > > But if you are overwriting an existing object, doing write-ahead logging > > > is usually unavoidable because we need to make the update atomic (and the > > > underlying posix fs doesn't provide that). The wip-newstore-frags branch > > > mitigates this somewhat for larger writes by limiting fragment size, but > > > for small IOs this is pretty much always going to be the case. For small > > > IOs, though, putting things in db/ is generally better since we can > > > combine many small ios into a single (rocksdb) journal/wal write. And > > > often leave them there (via the 'overlay' behavior). > > > > > > sage > > > > > > > > > > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: David Casier <david.casier@xxxxxxxx> > > > Date: 14 octobre 2015 22:03:38 UTC+2 > > > À: Sébastien VALSEMEY <sebastien.valsemey@xxxxxxxx>, benoit.loriot@xxxxxxxx > > > Cc: Denis Saget <geodni@xxxxxxxxx>, "luc.petetin" <luc.petetin@xxxxxxxx> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Bonsoir Messieurs, > > > Je viens de vivre le premier vrai feu Ceph. > > > Loic Dachary m'a bien appuyé sur le coup. > > > > > > Je peux vous dire une chose : on a beau penser maîtriser le produit, c'est lors d'un incident qu'on se rend compte du nombre de facteurs à connaître par coeur. > > > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un succès et comme un excellent coup de boost. > > > > > > Explications : > > > - LI ont un peu trop joué avec la crushmap (je ferai de la technique pointut un autre jour) > > > - Mise à jour et redémarrage des OSD > > > - Les OSD ne savaient plus où étaient la data > > > - Reconstruction à la mimine de la crushmap et zzooouu. > > > > > > Rien de bien grave en soit et un gros plus (++++) en image chez LI (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé) > > > > > > Conclusion : > > > On va bosser ensemble sur des stress-tests, un peu comme des validations RedHat : une plate-forme, je casse, vous réparez. > > > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de passer quelques jours sur certains trucs). > > > > > > Objectifs : > > > - Maîtriser une liste de vérifs à faire > > > - La rejouer toutes les semaines si beaucoup de fautes > > > - Tous les mois si un peu de faute > > > - Tous les 3 mois si bonne maîtrise > > > - ... > > > > > > Il faut qu'on soit au top et que certaines choses passent en réflexe (vérif crushmap, savoir trouver la data sans les process, ...). > > > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas). > > > > > > Et franchement, c'est vraiment passionnant Ceph ! > > > > > > On 10/12/2015 09:33 PM, Sage Weil wrote: > > >> Hi David- > > >> > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >>> Ok, > > >>> Great. > > >>> > > >>> With these settings : > > >>> // > > >>> newstore_max_dir_size = 4096 > > >>> newstore_sync_io = true > > >>> newstore_sync_transaction = true > > >>> newstore_sync_submit_transaction = true > > >> Is this a hard disk? Those settings probably don't make sense since it > > >> does every IO synchronously, blocking the submitting IO path... > > >> > > >>> newstore_sync_wal_apply = true > > >>> newstore_overlay_max = 0 > > >>> // > > >>> > > >>> And direct IO in the benchmark tool (fio) > > >>> > > >>> I see that the HDD is 100% charged and there are notransfer of /db to > > >>> /fragments after stopping benchmark : Great ! > > >>> > > >>> But when i launch a bench with random blocs of 256k, i see random blocs > > >>> between 32k and 256k on HDD. Any idea ? > > >> Random IOs have to be write ahead logged in rocksdb, which has its own IO > > >> pattern. Since you made everything sync above I think it'll depend on > > >> how many osd threads get batched together at a time.. maybe. Those > > >> settings aren't something I've really tested, and probably only make > > >> sense with very fast NVMe devices. > > >> > > >>> Debits to the HDD are about 8MBps when they could be higher with larger blocs> (~30MBps) > > >>> And 70 MBps without fsync (hard drive cache disabled). > > >>> > > >>> Other questions : > > >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >>> fsync_wq) ? > > >> yes > > >> > > >>> newstore_sync_transaction -> true = sync in DB ? > > >> synchronously do the rocksdb commit too > > >> > > >>> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >>> newstore_sync_transaction=false) ? > > >> yeah.. there is an annoying rocksdb behavior that makes an async > > >> transaction submit block if a sync one is in progress, so this queues them > > >> up and explicitly batches them. > > >> > > >>> newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ? > > >> the txn commit completion threads can do the wal work synchronously.. this > > >> is only a good idea if it's doing aio (which it generally is). > > >> > > >>> Is it true ? > > >>> > > >>> Way for cache with battery (sync DB and no sync data) ? > > >> ? > > >> s > > >> > > >>> Thanks for everything ! > > >>> > > >>> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>>> Hello everybody, > > >>>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>>> I separed "/db" and "/fragments" but during the bench, everything is > > >>>>> writing > > >>>>> to "/db" > > >>>>> I changed options "newstore_sync_*" without success. > > >>>>> > > >>>>> Is there any way to write all metadata in "/db" and all data in > > >>>>> "/fragments" ? > > >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>>> But if you are overwriting an existing object, doing write-ahead logging > > >>>> is usually unavoidable because we need to make the update atomic (and the > > >>>> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >>>> mitigates this somewhat for larger writes by limiting fragment size, but > > >>>> for small IOs this is pretty much always going to be the case. For small > > >>>> IOs, though, putting things in db/ is generally better since we can > > >>>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>>> often leave them there (via the 'overlay' behavior). > > >>>> > > >>>> sage > > >>>> > > >>> > > >>> -- > > >>> ________________________________________________________ > > >>> > > >>> Cordialement, > > >>> > > >>> *David CASIER > > >>> DCConsulting SARL > > >>> > > >>> > > >>> 4 Trait d'Union > > >>> 77127 LIEUSAINT > > >>> > > >>> **Ligne directe: _01 75 98 53 85_ > > >>> Email: _david.casier@aevoo.fr_ > > >>> * ________________________________________________________ > > >>> -- > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx <mailto:majordomo@xxxxxxxxxxxxxxx> > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html> > > >>> > > >>> > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >> the body of a message to majordomo@xxxxxxxxxxxxxxx <mailto:majordomo@xxxxxxxxxxxxxxx> > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > DCConsulting SARL > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.casier@xxxxxxxx <mailto:david.casier@xxxxxxxx> > > > ________________________________________________________ > > > > -- ________________________________________________________ Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.casier@xxxxxxxx ________________________________________________________ -- ________________________________________________________ Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.casier@xxxxxxxx ________________________________________________________ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html