Re: Generalizing ceph-disk for various osd backends

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 21 Aug 2015 08:07:14 -0700 (PDT)

On Wed, 19 Aug 2015, Varada Kari wrote:
> Hi all,
> 
> This is regarding generalizing the ceph-disk to work with different osd backends like FileStore, KeyValueStore and NewStore etc...
> All these object store implementations has different needs on the disk being used for holding data and Meta data.
> Sage suggest in the one of the pull requests for the requirements what ceph-disk should satisfy to generalize the ceph-disk to handle all the backends optimally. From the current implementation of supported object store backends below are requirements ceph-disk is expected to perform.
> 
> FileStore:
> 1. Needs a partition/disk for the FileSystem
> 2. Needs a partition/disk for the Journal
> 3. Additionally if we can make the omap(Leveldb/RocksDB to be on separate partition depending on the backend medium being used HDD or SSD.
> 
> NewStore:
> 1. Needs a File System on a disk/partition
> 2. Optionally needs a file system depending on backend DB used(LevelDB/RocksDB ...) for the journal
> 3. Optionally needs a file system on a faster medium for the warm levels to hold the data.
> 
> KeyValueStore:
> 1. Needs a small partition to hold metadata of the OSD on a file system.
> 2. Needs a partition/disk to hold data. Some backends need a file system, some can work of a raw partition/disk.
> 3. Optionally may need a partition to hold the cache or journal
> 
> Please add any of the details if I had missed.
> 
> Ideally, ceph-disk should make a decision depending on the input given by the user through conf file or some options to ceph-disk in a manual deployment.  Inputs from user can be what kind of file system need to be created, file system size, device to be created on etc...  in case of a file store.
> Similarly for KeyValueStore, backend can work on raw partition or a disk else if would need a file system to work.
> 
> Quoting Sage again here.
> Alternatively, we could say that it's the admin's job to express to ceph-disk what kind of OSD it should create (backend type, secondary fs's or partitions, etc.) instead of inferring that from the environment. In that case, we'd could
> * make a generic way to specify which backend to use in the osd_data dir
> * make sure all secondary devices or file systems are symlinked from the osd_data dir, the way the journal is today. This could be in a backend-specific way. e.g., FileStore wants the journal (to bdev) link, NewStore wants a db_wal link (to small + fast fs) link, etc.
> * we could create uuid types for each secondary device type. A raw block dev would work just like ceph-disk activate-journal. A new uuid would be for secondary fs's, which would mount and then trigger ceph-disk activate-slave-fs DEV or similar.
> * ceph-disk activate[-] can ensure that *all symlinks in the data dir resolve to real things (all devices or secondary fs's are mounted) before starting ceph-osd.
> 
> Will be making the changes once we agree on requirements and 
> implementation specifics. Please correct me if I had understood wrong.

I think the trick here is to figure out how to describe these 
requirements.  I think it ought to be some structured thing ceph-osd 
can spit out for a given backend that says what it needs.  For example, 
for filestore,

{
  "data": {
     "type": "fs",
     "min_size": 10485760,
     "max_size": 1000000000000000, # whatever
     "preferred_size": 100000000000000000,
     "required": true
  },
  "journal": {
     "type": "block",
     "min_size": 10485760,
     "max_size": 104857600,
     "preferred_size": 40960000,
     "required": false,
     "preferred": true
  },
}

Then ceph-disk can be fed the devices to use based on those names. e.g.,

 ceph-disk prepare objectstore=filestore data=/dev/sda journal=/dev/sdb

Or for your KV backend,

{
  "data": {
     "type": "fs",
     "min_size": 10485760,
     "max_size": 10485760,
     "preferred_size": 10485760,
     "required": true
  },
  "kvdata": {
     "type": "block",
     "min_size": 10485760,
     "max_size": 1000000000000000, # whatever
     "preferred_size": 100000000000000000,
     "required": true
  },
  "journal": {
     "type": "block",
     "min_size": 10485760,
     "max_size": 104857600,
     "preferred_size": 40960000,
     "required": false,
     "preferred": false
  },
}

 ceph-disk prepare objectstore=keyvaluestore data=/dev/sda kvdata=/dev/sda journal=/dev/sdb

The ceph-disk logic would create partitions on the given devices as 
needed, trying for preferred size but doing what it needs to to 
make it fit.  If something is required/preferred but not specified (e.g., 
with filestore's journal) it'll use the same device as the other stuff, so 
that the filestore case coudl simplify to

 ceph-disk prepare objectstore=filestore data=/dev/sda

or whatever.

Would something like this be general enough to capture the possibilities 
and still do everything we need it to?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html