Re: is it possible to remove the db+wal from an external device (nvme)

Victor Hooi <victorhooi@xxxxxxxxx> · Fri, 1 Oct 2021 08:30:27 +1000

Hi,

I'm curious - how did you tell that the separate WAL+DB volume was slowing
things down? I assume you did some benchmarking - is there any chance you'd
be willing to share results? (Or anybody else that's been in a similar
situation).

What sorts of devices are you using for the WAL+DB, versus the data disks?

We're using NAND SSDs, with Optanes for the WAL+DB, and on some systems I
am seeing slowly than expected behaviour - need to dive deeper into it

In my case, I was running with 4 or 2 OSDs per Optane volume:

https://www.reddit.com/r/ceph/comments/k2lef1/how_many_waldb_partitions_can_you_run_per_optane/

but I couldn't seem to get the results I'd expected - so curious what
people are seeing in the real world - and of course, we might need to
follow the steps here to remove them as well.

Thanks,
Victor

On Thu, 30 Sept 2021 at 16:10, Eugen Block <eblock@xxxxxx> wrote:

> Yes, I believe for you it should work without containers although I
> haven't tried the migrate command in a non-containerized cluster yet.
> But I believe this is a general issue for containerized clusters with
> regards to maintenance. I haven't checked yet if there are existing
> tracker issues for this, but maybe this should be worth creating one?
>
>
> Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
>
> > Actually I don't have containerized deployment, my is normal one. So
> > it should work the lvm migrate.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---------------------------------------------------
> > Agoda Services Co., Ltd.
> > e: istvan.szabo@xxxxxxxxx
> > ---------------------------------------------------
> >
> > -----Original Message-----
> > From: Eugen Block <eblock@xxxxxx>
> > Sent: Wednesday, September 29, 2021 8:49 PM
> > To: 胡 玮文 <huww98@xxxxxxxxxxx>
> > Cc: Igor Fedotov <ifedotov@xxxxxxx>; Szabo, Istvan (Agoda)
> > <Istvan.Szabo@xxxxxxxxx>; ceph-users@xxxxxxx
> > Subject: Re: is it possible to remove the db+wal from an external
> > device (nvme)
> >
> > Email received from the internet. If in doubt, don't click any link
> > nor open any attachment !
> > ________________________________
> >
> > That's what I did and pasted the results in my previous comments.
> >
> >
> > Zitat von 胡 玮文 <huww98@xxxxxxxxxxx>:
> >
> >> Yes. And “cephadm shell” command does not depend on the running
> >> daemon, it will start a new container. So I think it is perfectly fine
> >> to stop the OSD first then run the “cephadm shell” command, and run
> >> ceph-volume in the new shell.
> >>
> >> 发件人: Eugen Block<mailto:eblock@xxxxxx>
> >> 发送时间: 2021年9月29日 21:40
> >> 收件人: 胡 玮文<mailto:huww98@xxxxxxxxxxx>
> >> 抄送: Igor Fedotov<mailto:ifedotov@xxxxxxx>; Szabo, Istvan
> >> (Agoda)<mailto:Istvan.Szabo@xxxxxxxxx>;
> >> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >> 主题: Re: is it possible to remove the db+wal from an external device
> >> (nvme)
> >>
> >> The OSD has to be stopped in order to migrate DB/WAL, it can't be done
> >> live. ceph-volume requires a lock on the device.
> >>
> >>
> >> Zitat von 胡 玮文 <huww98@xxxxxxxxxxx>:
> >>
> >>> I’ve not tried it, but how about:
> >>>
> >>> cephadm shell -n osd.0
> >>>
> >>> then run “ceph-volume” commands in the newly opened shell. The
> >>> directory structure seems fine.
> >>>
> >>> $ sudo cephadm shell -n osd.0
> >>> Inferring fsid e88d509a-f6fc-11ea-b25d-a0423f3ac864
> >>> Inferring config
> >>> /var/lib/ceph/e88d509a-f6fc-11ea-b25d-a0423f3ac864/osd.0/config
> >>> Using recent ceph image
> >>> cr.example.com/infra/ceph@sha256:8a0f6f285edcd6488e2c91d3f9fa43534d37
> >>> d7a9b37db1e0ff6691aae6466530 root@host0:/# ll
> >>> /var/lib/ceph/osd/ceph-0/ total 68
> >>> drwx------ 2 ceph ceph 4096 Sep 20 04:15 ./
> >>> drwxr-x--- 1 ceph ceph 4096 Sep 29 13:32 ../
> >>> lrwxrwxrwx 1 ceph ceph   24 Sep 20 04:15 block ->
> /dev/ceph-hdd/osd.0.data
> >>> lrwxrwxrwx 1 ceph ceph   23 Sep 20 04:15 block.db ->
> >>> /dev/ubuntu-vg/osd.0.db
> >>> -rw------- 1 ceph ceph   37 Sep 20 04:15 ceph_fsid
> >>> -rw------- 1 ceph ceph  387 Jun 21 13:24 config
> >>> -rw------- 1 ceph ceph   37 Sep 20 04:15 fsid
> >>> -rw------- 1 ceph ceph   55 Sep 20 04:15 keyring
> >>> -rw------- 1 ceph ceph    6 Sep 20 04:15 ready
> >>> -rw------- 1 ceph ceph    3 Apr  2 01:46 require_osd_release
> >>> -rw------- 1 ceph ceph   10 Sep 20 04:15 type
> >>> -rw------- 1 ceph ceph   38 Sep 17 14:26 unit.configured
> >>> -rw------- 1 ceph ceph   48 Nov  9  2020 unit.created
> >>> -rw------- 1 ceph ceph   35 Sep 17 14:26 unit.image
> >>> -rw------- 1 ceph ceph  306 Sep 17 14:26 unit.meta
> >>> -rw------- 1 ceph ceph 1317 Sep 17 14:26 unit.poststop
> >>> -rw------- 1 ceph ceph 3021 Sep 17 14:26 unit.run
> >>> -rw------- 1 ceph ceph  142 Sep 17 14:26 unit.stop
> >>> -rw------- 1 ceph ceph    2 Sep 20 04:15 whoami
> >>>
> >>> 发件人: Eugen Block<mailto:eblock@xxxxxx>
> >>> 发送时间: 2021年9月29日 21:29
> >>> 收件人: Igor Fedotov<mailto:ifedotov@xxxxxxx>
> >>> 抄送: 胡 玮文<mailto:huww98@xxxxxxxxxxx>; Szabo, Istvan
> >>> (Agoda)<mailto:Istvan.Szabo@xxxxxxxxx>;
> >>> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >>> 主题: Re:  Re: 回复: [ceph-users] Re: is it possible to
> >>> remove the db+wal from an external device (nvme)
> >>>
> >>> Hi Igor,
> >>>
> >>> thanks for your input. I haven't done this in a prod env yet either,
> >>> still playing around in a virtual lab env.
> >>> I tried the symlink suggestion but it's not that easy, because it
> >>> looks different underneath the ceph directory than ceph-volume
> >>> expects it. These are the services underneath:
> >>>
> >>> ses7-host1:~ # ll /var/lib/ceph/152fd738-01bc-11ec-a7fd-fa163e672db2/
> >>> insgesamt 48
> >>> drwx------ 3 root       root   4096 16. Sep 16:11
> alertmanager.ses7-host1
> >>> drwx------ 3 ceph       ceph   4096 29. Sep 09:03 crash
> >>> drwx------ 2 ceph       ceph   4096 16. Sep 16:39 crash.ses7-host1
> >>> drwx------ 4 messagebus lp     4096 16. Sep 16:23 grafana.ses7-host1
> >>> drw-rw---- 2 root       root   4096 24. Aug 10:00 home
> >>> drwx------ 2 ceph       ceph   4096 16. Sep 16:37 mgr.ses7-host1.wmgyit
> >>> drwx------ 3 ceph       ceph   4096 16. Sep 16:37 mon.ses7-host1
> >>> drwx------ 2 nobody     nobody 4096 16. Sep 16:37
> node-exporter.ses7-host1
> >>> drwx------ 2 ceph       ceph   4096 29. Sep 08:43 osd.0
> >>> drwx------ 2 ceph       ceph   4096 29. Sep 15:11 osd.1
> >>> drwx------ 4 root       root   4096 16. Sep 16:12 prometheus.ses7-host1
> >>>
> >>>
> >>> While the directory in a non-containerized deployment looks like this:
> >>>
> >>> nautilus:~ # ll /var/lib/ceph/osd/ceph-0/ insgesamt 24 lrwxrwxrwx 1
> >>> ceph ceph 93 29. Sep 12:21 block ->
> >>> /dev/ceph-a6d78a29-637f-494b-a839-76251fcff67e/osd-block-39340a48-54b
> >>> 3-4689-9896-f54d005c535d
> >>> -rw------- 1 ceph ceph 37 29. Sep 12:21 ceph_fsid
> >>> -rw------- 1 ceph ceph 37 29. Sep 12:21 fsid
> >>> -rw------- 1 ceph ceph 55 29. Sep 12:21 keyring
> >>> -rw------- 1 ceph ceph  6 29. Sep 12:21 ready
> >>> -rw------- 1 ceph ceph 10 29. Sep 12:21 type
> >>> -rw------- 1 ceph ceph  2 29. Sep 12:21 whoami
> >>>
> >>>
> >>> But even if I create the symlink to the osd directory it fails
> >>> because I only have ceph-volume within the containers where the
> >>> symlink is not visible to cephadm.
> >>>
> >>>
> >>> ses7-host1:~ # ll /var/lib/ceph/osd/ceph-1 lrwxrwxrwx 1 root root 57
> >>> 29. Sep 15:08 /var/lib/ceph/osd/ceph-1 ->
> >>> /var/lib/ceph/152fd738-01bc-11ec-a7fd-fa163e672db2/osd.1/
> >>>
> >>> ses7-host1:~ # cephadm ceph-volume lvm migrate --osd-id 1 --osd-fsid
> >>> b4c772aa-07f8-483d-ae58-0ab97b8d0cc4 --from db --target
> >>> ceph-b1ddff4b-95e8-4b91-b451-a3ea35d16ec0/osd-block-b4c772aa-07f8-483
> >>> d-ae58-0ab97b8d0cc4 Inferring fsid
> >>> 152fd738-01bc-11ec-a7fd-fa163e672db2
> >>> [...]
> >>> /usr/bin/podman: stderr --> Migrate to existing, Source:
> >>> ['--devs-source', '/var/lib/ceph/osd/ceph-1/block.db'] Target:
> >>> /var/lib/ceph/osd/ceph-1/block
> >>> /usr/bin/podman: stderr  stdout: inferring bluefs devices from
> >>> bluestore path
> >>> /usr/bin/podman: stderr  stderr: can't migrate
> >>> /var/lib/ceph/osd/ceph-1/block.db, not a valid bluefs volume
> >>> /usr/bin/podman: stderr --> Failed to migrate device, error code:1
> >>> /usr/bin/podman: stderr --> Undoing lv tag set
> >>> /usr/bin/podman: stderr Failed to migrate to :
> >>> ceph-b1ddff4b-95e8-4b91-b451-a3ea35d16ec0/osd-block-b4c772aa-07f8-483
> >>> d-ae58-0ab97b8d0cc4
> >>> Traceback (most recent call last):
> >>>    File "/usr/sbin/cephadm", line 6225, in <module>
> >>>      r = args.func()
> >>>    File "/usr/sbin/cephadm", line 1363, in _infer_fsid
> >>>      return func()
> >>>    File "/usr/sbin/cephadm", line 1422, in _infer_image
> >>>      return func()
> >>>    File "/usr/sbin/cephadm", line 3687, in command_ceph_volume
> >>>      out, err, code = call_throws(c.run_cmd(),
> >>> verbosity=CallVerbosity.VERBOSE)
> >>>    File "/usr/sbin/cephadm", line 1101, in call_throws
> >>>      raise RuntimeError('Failed command: %s' % ' '.join(command))
> >>> [...]
> >>>
> >>>
> >>> I could install the package ceph-osd (where ceph-volume is packaged
> >>> in) but it's not available by default (as you see this is a SES 7
> >>> environment).
> >>>
> >>> I'm not sure what the design is here, it feels like the ceph-volume
> >>> migrate command is not applicable to containers yet.
> >>>
> >>> Regards,
> >>> Eugen
> >>>
> >>>
> >>> Zitat von Igor Fedotov <ifedotov@xxxxxxx>:
> >>>
> >>>> Hi Eugen,
> >>>>
> >>>> indeed this looks like an issue related to containerized deployment,
> >>>> "ceph-volume lvm migrate" expects osd folder to be under
> >>>> /var/lib/ceph/osd:
> >>>>
> >>>>> stderr: 2021-09-29T06:56:24.787+0000 7fde05b96180 -1
> >>>>> bluestore(/var/lib/ceph/osd/ceph-1) _lock_fsid failed to lock
> >>>>> /var/lib/ceph/osd/ceph-1/fsid (is another ceph-osd still
> >>>>> running?)(11) Resource temporarily unavailable
> >>>>
> >>>> As a workaround you might want to try to create a symlink to your
> >>>> actual location before issuing the migrate command:
> >>>> /var/lib/ceph/osd ->
> >>>> /var/lib/ceph/152fd738-01bc-11ec-a7fd-fa163e672db2/
> >>>>
> >>>> More complicated (and more general IMO) way would be to run the
> >>>> migrate command from within a container deployed similarly (i.e.
> >>>> with all the proper subfolder mappings) to ceph-osd one. Just
> >>>> speculating - not a big expert in containers and never tried that
> >>>> with properly deployed production cluster...
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Igor
> >>>>
> >>>> On 9/29/2021 10:07 AM, Eugen Block wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I just tried with 'ceph-volume lvm migrate' in Octopus but it
> >>>>> doesn't really work. I'm not sure if I'm missing something here,
> >>>>> but I believe it's again the already discussed containers issue. To
> >>>>> be able to run the command for an OSD the OSD has to be offline,
> >>>>> but then you don't have access to the block.db because the path is
> >>>>> different from outside the container:
> >>>>>
> >>>>> ---snip---
> >>>>> [ceph: root@host1 /]# ceph-volume lvm migrate --osd-id 1 --osd-fsid
> >>>>> b4c772aa-07f8-483d-ae58-0ab97b8d0cc4 --from db --target
> >>>>> ceph-b1ddff4b-95e8-4b91-b451-a3ea35d16ec0/osd-block-b4c772aa-07f8-4
> >>>>> 83d-ae58-0ab97b8d0cc4 --> Migrate to existing, Source:
> >>>>> ['--devs-source', '/var/lib/ceph/osd/ceph-1/block.db']
> >>>>> Target:
> >>>>> /var/lib/ceph/osd/ceph-1/block
> >>>>>  stdout: inferring bluefs devices from bluestore path
> >>>>>  stderr:
> >>>>> /home/abuild/rpmbuild/BUILD/ceph-15.2.14-84-gb6e5642e260/src/os/blu
> >>>>> estore/BlueStore.cc: In function 'int
> >>>>> BlueStore::_mount_for_bluefs()' thread
> >>>>> 7fde05b96180
> >>>>> time
> >>>>> 2021-09-29T06:56:24.790161+0000
> >>>>>  stderr:
> >>>>> /home/abuild/rpmbuild/BUILD/ceph-15.2.14-84-gb6e5642e260/src/os/blu
> >>>>> estore/BlueStore.cc: 6876: FAILED ceph_assert(r ==
> >>>>> 0)
> >>>>>  stderr: 2021-09-29T06:56:24.787+0000 7fde05b96180 -1
> >>>>> bluestore(/var/lib/ceph/osd/ceph-1) _lock_fsid failed to lock
> >>>>> /var/lib/ceph/osd/ceph-1/fsid (is another ceph-osd still
> >>>>> running?)(11) Resource temporarily unavailable
> >>>>>
> >>>>>
> >>>>> # path outside
> >>>>> host1:~ # ll
> >>>>> /var/lib/ceph/152fd738-01bc-11ec-a7fd-fa163e672db2/osd.1/
> >>>>> insgesamt 60
> >>>>> lrwxrwxrwx 1 ceph ceph   93 29. Sep 08:43 block ->
> >>>>>
> /dev/ceph-b1ddff4b-95e8-4b91-b451-a3ea35d16ec0/osd-block-b4c772aa-07f8-483d-ae58-0ab97b8d0cc4
> >>>>> lrwxrwxrwx 1 ceph ceph   90 29. Sep 08:43 block.db ->
> >>>>> /dev/ceph-6f1b8f49-daf2-4631-a2ef-12e9452b01ea/osd-db-69b11aa0-af96
> >>>>> -443e-8f03-5afa5272131f
> >>>>> ---snip---
> >>>>>
> >>>>>
> >>>>> But if I shutdown the OSD I can't access the block and block.db
> >>>>> devices. I'm not even sure how this is supposed to work with
> >>>>> cephadm. Maybe I'm misunderstanding, though. Or is there a way to
> >>>>> provide the offline block.db path to 'ceph-volume lvm migrate'?
> >>>>>
> >>>>>
> >>>>>
> >>>>> Zitat von 胡 玮文 <huww98@xxxxxxxxxxx>:
> >>>>>
> >>>>>> You may need to use `ceph-volume lvm migrate’ [1] instead of
> >>>>>> ceph-bluestore-tool. If I recall correctly, this is a pretty new
> >>>>>> feature, I’m not sure whether it is available to your version.
> >>>>>>
> >>>>>> If you use ceph-bluestore-tool, then you need to modify the LVM
> >>>>>> tags manually. Please refer to the previous threads, e.g. [2] and
> >>>>>> some more.
> >>>>>>
> >>>>>> [1]: https://docs.ceph.com/en/latest/man/8/ceph-volume/#migrate
> >>>>>> [2]:
> >>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/VX
> >>>>>> 23NQ66P3PPEX36T3PYYMHPLBSFLMYA/#JLNDFGXR4ZLY27DHD3RJTTZEDHRZJO4Q
> >>>>>>
> >>>>>> 发件人: Szabo, Istvan (Agoda)<mailto:Istvan.Szabo@xxxxxxxxx>
> >>>>>> 发送时间: 2021年9月28日 18:20
> >>>>>> 收件人: Eugen Block<mailto:eblock@xxxxxx>;
> >>>>>> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >>>>>> 主题:  Re: is it possible to remove the db+wal from an
> >>>>>> external device (nvme)
> >>>>>>
> >>>>>> Gave a try of it, so all the 3 osds finally failed :/ Not sure
> >>>>>> what went wrong.
> >>>>>>
> >>>>>> Do the normal maintenance things, ceph osd set noout, ceph osd set
> >>>>>> norebalance, stop the osd and run this command:
> >>>>>> ceph-bluestore-tool bluefs-bdev-migrate --dev-target
> >>>>>> /var/lib/ceph/osd/ceph-0/block --devs-source
> >>>>>> /var/lib/ceph/osd/ceph-8/block.db --path /var/lib/ceph/osd/ceph-8/
> >>>>>> Output:
> >>>>>> device removed:1 /var/lib/ceph/osd/ceph-8/block.db device added: 1
> >>>>>> /dev/dm-2
> >>>>>>
> >>>>>> When tried to start I got this in the log:
> >>>>>> osd.8 0 OSD:init: unable to mount object store
> >>>>>>  ** ERROR: osd init failed: (13) Permission denied set uid:gid to
> >>>>>> 167:167 (ceph:ceph) ceph version 15.2.13
> >>>>>> (c44bc49e7a57a87d84dfff2a077a2058aa2172e2)
> >>>>>> octopus (stable), process ceph-osd, pid 1512261
> >>>>>> pidfile_write: ignore empty --pid-file
> >>>>>>
> >>>>>> From the another 2 osds the block.db removed and I can start it
> back.
> >>>>>> I've zapped the db drive just to be removed from the device
> >>>>>> completely and after machine restart none of these 2 osds came
> >>>>>> back, I guess missing the db device.
> >>>>>>
> >>>>>> Is there any steps missing?
> >>>>>> 1.Noout+norebalance
> >>>>>> 2. Stop osd
> >>>>>> 3. migrate with the above command the block.db to the block.
> >>>>>> 4. do on the other osds which is sharing the same db device that
> >>>>>> want to remove.
> >>>>>> 5. zap the db device
> >>>>>> 6. start back the osds.
> >>>>>>
> >>>>>> Istvan Szabo
> >>>>>> Senior Infrastructure Engineer
> >>>>>> ---------------------------------------------------
> >>>>>> Agoda Services Co., Ltd.
> >>>>>> e: istvan.szabo@xxxxxxxxx
> >>>>>> ---------------------------------------------------
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Eugen Block <eblock@xxxxxx>
> >>>>>> Sent: Monday, September 27, 2021 7:42 PM
> >>>>>> To: ceph-users@xxxxxxx
> >>>>>> Subject:  Re: is it possible to remove the db+wal from
> >>>>>> an external device (nvme)
> >>>>>>
> >>>>>> Email received from the internet. If in doubt, don't click any
> >>>>>> link nor open any attachment !
> >>>>>> ________________________________
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I think 'ceph-bluestore-tool bluefs-bdev-migrate' could be of use
> >>>>>> here. I haven't tried it in a production environment yet, only in
> >>>>>> virtual labs.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Eugen
> >>>>>>
> >>>>>>
> >>>>>> Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Seems like in our config the nvme device  as a wal+db in front of
> >>>>>>> the ssd slowing down the ssds osds.
> >>>>>>> I'd like to avoid to rebuild all the osd-, is there a way somehow
> >>>>>>> migrate to the "slower device" the wal+db without reinstall?
> >>>>>>>
> >>>>>>> Ty
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>>>>> an email to ceph-users-leave@xxxxxxx
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>>>> an email to ceph-users-leave@xxxxxxx
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>>>> an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send
> >>>>> an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx