Re: [crimson] bluestore in an alien world

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 1, 2019 at 3:44 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
>
> On Mon, Jul 29, 2019 at 9:19 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
> >
> > On Sat, Jul 27, 2019 at 5:26 AM Sam Just <sjust@xxxxxxxxxx> wrote:
> > >
> > > > - to run bluestore in the same process of crimson-osd, but we will
> > > > allocate some dedicated threads (and CPU cores) to it. we could use
> > > > ceph::thread::ThreadPool for this purpose. for instance, we will have
> > > > 3 ConfigProxy backends.
> > > >   1. the classic ConfigProxy used by classic OSD and other daemons and
> > > > command line utilities. the ConfigProxy normally resides in a global
> > > > CephContext.
> > > >   2. the ceph::common::ConfigProxy solely used by crimson OSD. it is
> > > > rewritten using seastar. it's a sharded service. normally we just
> > > > access the config proxy directly in crimson, like
> > > > 'local_conf().get_val< uint64_t>("name")' instead of using something
> > > > like 'cct->_conf.get_val<uint64_t>("name")'
> > > >   3. the ConfigProxy used by bluestore living in the alien world. its
> > > > interface will be exactly the same as the classic one, but it will
> > > > call into its crimson counterpart using the `seastar::alien::submit()`
> > > > call.
> > >
> > > I'm not sure this is quite right.  I think that the seastar config
> > > would have a reference over to the alien config machinery in order to
> > > inject config changes and do the initial setup, but the alien side
> > > needn't have a reference to the crimson one.
> >
> > i was thinking about the implementation of ConfigProxy::get_val<>().
> > but yeah, if we 1) have a separated copy of ConfigValue on the alien
> > side, 2) let the alien side work in the passive mode, and 3) use the
> > ThreadPool::submit() to inject config changes into alien's
> > ConfigProxy, what'd be a lot easier.
> >
> > >
> > > >   in addition to WITH_SEASTAR macro, we can introduce yet another
> > > > macro allowing us to call into the facilities offered by
> > > > crimson-common. and we can use inline namespace to differentiate the
> > > > 2nd from 3rd implementations. as they will need to be co-located in
> > > > the same process. and without using different names, we'd violate ODR.
> > > > - to hide bluestore in a library which links against ceph-common
> > > > library. but the libblustore won't expose any ceph-common symbols to
> > > > crimson-osd. but we need to figure out how to maintain the internal
> > > > status of ceph-common. as it not quite self-contained in the sense
> > > > that it need to access the logging, config and other facilities
> > > > offered by crimson-osd.
> > >
> > > The library option seems promising to me if we go this direction.  It
> > > can even export an interface which is entirely agnostic of the config
> > > machinery (maybe take a serialized representation of the config
> > > values?) and write to a different log file at first.
> >
> > yeah, probably we just need a "keyhole" for updating the alien side's
> > config settings. this option actually is a variant of the previous
> > one. the only difference is that, we need to use different namespaces
> > to differentiate the symbols in bluestore from those in
> > crimson-common.
> >
> > >
> > > > - to port rocksdb to seastar: to be specific, this approach will use
> > > > seastar's green thread to implement the Mutex, CondVar and Thread in
> > > > rocksdb, and implement all blocking calls using seastar's
> > > > counterparts. if this approach is proved to be workable. the next
> > > > problem would be to upstream this change. and in a long run, the
> > > > rocksdb backed bluestore will be replaced by seastore if seastore is
> > > > capable of supporting relatively slow devices as well.
> > >
> > > I've started to look at your rocksdb port.  It does look like the
> > > parts we'd need to adapt are appropriately factored out in rocksdb,
> > > and I bet we'd get interest from upstream.  We might want to take
> > > their temperature sooner rather than later?  We'd also have to perform
> >
> > good idea! will do so early tomorrow!
> >
> > > essentially the same refactor in Bluestore in order to break the
> > > bluestore logic apart from the IO/blocking/locking portions.  I guess
> > > this exists in some form with the BlockDevice interface, but we'll
> > > also have to introduce something like rocksdb's lock replacement.
> > > This path would get us a much more cooperative (probably more
> > > performant as well, particularly in high density hosts) bluestore in
> > > the long run, so it might be worth the work.
> >
> > thanks. your insights are inspiring!
>
> the test of `env_seastar_test` passed, so it kinda works. and i also
> wrote a post in https://www.facebook.com/groups/rocksdb.dev/ to get
> the opinions from the upstream community.
>


just a quick update.

i was testing the seastar port of rocksdb, its performance does not
look promising in comparison with that of classic rocksdb:

db_bench --benchmarks="fillseq"
  seastar rocksdb:
    fillseq      :     390.297 micros/op 2562 ops/sec;    0.3 MB/s
  classic
    fillseq      :      30.836 micros/op 32429 ops/sec;    3.6 MB/s

i will try to understand this discrepancy. if it turns out to be a
dead end, we will have to focus on one of the options above unless we
have a concrete plan of seastore.

> >
> > >
> > > > - seastore: a completely rewritten object store backend targeting fast
> > > > NVMe devices. but it will take longer to get there.
> > >
> > > I think we're going to do this no matter what.  I think
> > > alien/bluestore choice is about how we want to test crimson prior to
> > > developing seastore and possibly for handling devices inappropriate
> > > for seastore?
> >
> > that's also my impression. the way how i see it is just because we
> > haven't started scoping it or had a low level design.
> >
> > > -Sam
> >
> >
> >
> > --
> > Regards
> > Kefu Chai
>
>
>
> --
> Regards
> Kefu Chai



-- 
Regards
Kefu Chai
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux