Gotcha. As a side note, that setting is only used by ceph-disk as ceph-volume does not create partitions for the WAL or DB. You need to create those partitions manually if using anything other than a whole block device when creating OSDs with ceph-volume.
On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
David,Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to inform other people that Ceph's default of 1GB is pretty low.Now that i read my own sentence it indeed looks as if i was using 1GB partitions, sorry for the confusion.Caspar2018-02-27 14:11 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:If you're only using a 1GB DB partition, there is a very real possibility it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work for most use cases (except loads and loads of small files). There are a few threads that mention how to check how much of your DB partition is in use. Once it's full, it spills over to the HDD.On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:2018-02-24 7:10 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:Caspar, it looks like your idea should work. Worst case scenario seems like the osd wouldn't start, you'd put the old SSD back in and go back to the idea to weight them to 0, backfilling, then recreate the osds. Definitely with a try in my opinion, and I'd love to hear your experience after.Hi David,First of all, thank you for ALL your answers on this ML, you're really putting a lot of effort into answering many questions asked here and very often they contain invaluable information.To follow up on this post i went out and built a very small (proxmox) cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD. And it worked!Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)Here's what i did on 1 node:1) ceph osd set noout2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.23) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log4) removed the old SSD physically from the node5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in6) ceph osd unset nooutI assume that once the ddrescue step is finished a 'partprobe' or something similar is triggered and udev finds the DB partitions on the new SSD and starts the OSD's again (kind of what happens during hotplug)So it is probably better to clone the SSD in another (non-ceph) system to not trigger any udev events.I also tested a reboot after this and everything still worked.The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)Delta of data was very low because it was a test cluster.All in all the OSD's in question were 'down' for only 5 minutes (so i stayed within the ceph_osd_down_out interval of the default 10 minutes and didn't actually need to set noout :)I kicked off a brief discussion about this with some of the BlueStore guys and they're aware of the problem with migrating across SSDs, but so far it's just a Trello card: https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-dbThey do confirm you should be okay with dd'ing things across, assuming symlinks get set up correctly as David noted.Great that it is on the radar to address. This method feels hacky.I've got some other bad news, though: BlueStore has internal metadata about the size of the block device it's using, so if you copy it onto a larger block device, it will not actually make use of the additional space. :(-GregYes, i was well aware of that, no problem. The reason was the smaller SSD sizes are simply not being made anymore or discontinued by the manufacturer.Would be nice though if the DB size could be resized in the future, the default 1GB DB size seems very small to me.CasparKind regards,Caspar_______________________________________________Nico, it is not possible to change the WAL or DB size, location, etc after osd creation. If you want to change the configuration of the osd after creation, you have to remove it from the cluster and recreate it. There is no similar functionality to how you could move, recreate, etc filesystem osd journals. I think this might be on the radar as a feature, but I don't know for certain. I definitely consider it to be a regression of bluestore.On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <nico.schottelius@xxxxxxxxxxx> wrote:
A very interesting question and I would add the follow up question:
Is there an easy way to add an external DB/WAL devices to an existing
OSD?
I suspect that it might be something on the lines of:
- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some kind of osd mkfs ?)
- start osd
Has anyone done this so far or recommendations on how to do it?
Which also makes me wonder: what is actually the format of WAL and
BlockDB in bluestore? Is there any documentation available about it?
Best,
Nico
Caspar Smit <casparsmit@xxxxxxxxxxx> writes:
> Hi All,
>
> What would be the proper way to preventively replace a DB/WAL SSD (when it
> is nearing it's DWPD/TBW limit and not failed yet).
>
> It hosts DB partitions for 5 OSD's
>
> Maybe something like:
>
> 1) ceph osd reweight 0 the 5 OSD's
> 2) let backfilling complete
> 3) destroy/remove the 5 OSD's
> 4) replace SSD
> 5) create 5 new OSD's with seperate DB partition on new SSD
>
> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> thought maybe the following would work:
>
> 1) ceph osd set noout
> 2) stop the 5 OSD's (systemctl stop)
> 3) 'dd' the old SSD to a new SSD of same or bigger size
> 4) remove the old SSD
> 5) start the 5 OSD's (systemctl start)
> 6) let backfilling/recovery complete (only delta data between OSD stop and
> now)
> 6) ceph osd unset noout
>
> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
> stuff preventing this to work?
>
> Or is there another 'less hacky' way to replace a DB SSD without moving too
> much data?
>
> Kind regards,
> Caspar
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com