Re: Fwd: Re: Ceph osd will not start.

Marco Pizzolo <marcopizzolo@xxxxxxxxx> · Sat, 29 May 2021 18:19:37 -0400

Thanks David
We will investigate the bugs as per your suggestion, and then will look to
test with the custom image.

Appreciate it.

On Sat, May 29, 2021, 4:11 PM David Orman <ormandj@xxxxxxxxxxxx> wrote:

> You may be running into the same issue we ran into (make sure to read
> the first issue, there's a few mingled in there), for which we
> submitted a patch:
>
> https://tracker.ceph.com/issues/50526
> https://github.com/alfredodeza/remoto/issues/62
>
> If you're brave (YMMV, test first non-prod), we pushed an image with
> the issue we encountered fixed as per above here:
> https://hub.docker.com/repository/docker/ormandj/ceph/tags?page=1 . We
> 'upgraded' to this when we encountered the mgr hanging on us after
> updating ceph to v16 and experiencing this issue using: "ceph orch
> upgrade start --image docker.io/ormandj/ceph:v16.2.3-mgrfix". I've not
> tried to boostrap a new cluster with a custom image, and I don't know
> when 16.2.4 will be released with this change (hopefully) integrated
> as remoto accepted the patch upstream.
>
> I'm not sure if this is your exact issue, see the bug reports and see
> if you see the lock/the behavior matches, if so - then it may help you
> out. The only change in that image is that patch to remoto being
> overlaid on the default 16.2.3 image.
>
> On Fri, May 28, 2021 at 1:15 PM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
> wrote:
> >
> > Peter,
> >
> > We're seeing the same issues as you are.  We have 2 new hosts Intel(R)
> > Xeon(R) Gold 6248R CPU @ 3.00GHz w/ 48 cores, 384GB RAM, and 60x 10TB SED
> > drives and we have tried both 15.2.13 and 16.2.4
> >
> > Cephadm does NOT properly deploy and activate OSDs on Ubuntu 20.04.2 with
> > Docker.
> >
> > Seems to be a bug in Cephadm and a product regression, as we have 4 near
> > identical nodes on Centos running Nautilus (240 x 10TB SED drives) and
> had
> > no problems.
> >
> > FWIW we had no luck yet with one-by-one OSD daemon additions through ceph
> > orch either.  We also reproduced the issue easily in a virtual lab using
> > small virtual disks on a single ceph VM with 1 mon.
> >
> > We are now looking into whether we can get past this with a manual
> buildout.
> >
> > If you, or anyone, has hit the same stumbling block and gotten past it, I
> > would really appreciate some guidance.
> >
> > Thanks,
> > Marco
> >
> > On Thu, May 27, 2021 at 2:23 PM Peter Childs <pchilds@xxxxxxx> wrote:
> >
> > > In the end it looks like I might be able to get the node up to about 30
> > > odds before it stops creating any more.
> > >
> > > Or more it formats the disks but freezes up starting the daemons.
> > >
> > > I suspect I'm missing somthing I can tune to get it working better.
> > >
> > > If I could see any error messages that might help, but I'm yet to spit
> > > anything.
> > >
> > > Peter.
> > >
> > > On Wed, 26 May 2021, 10:57 Eugen Block, <eblock@xxxxxx> wrote:
> > >
> > > > > If I add the osd daemons one at a time with
> > > > >
> > > > > ceph orch daemon add osd drywood12:/dev/sda
> > > > >
> > > > > It does actually work,
> > > >
> > > > Great!
> > > >
> > > > > I suspect what's happening is when my rule for creating osds run
> and
> > > > > creates them all-at-once it ties the orch it overloads cephadm and
> it
> > > > can't
> > > > > cope.
> > > >
> > > > It's possible, I guess.
> > > >
> > > > > I suspect what I might need to do at least to work around the
> issue is
> > > > set
> > > > > "limit:" and bring it up until it stops working.
> > > >
> > > > It's worth a try, yes, although the docs state you should try to
> avoid
> > > > it, it's possible that it doesn't work properly, in that case create
> a
> > > > bug report. ;-)
> > > >
> > > > > I did work out how to get ceph-volume to nearly work manually.
> > > > >
> > > > > cephadm shell
> > > > > ceph auth get client.bootstrap-osd -o
> > > > > /var/lib/ceph/bootstrap-osd/ceph.keyring
> > > > > ceph-volume lvm create --data /dev/sda --dmcrypt
> > > > >
> > > > > but given I've now got "add osd" to work, I suspect I just need to
> fine
> > > > > tune my osd creation rules, so it does not try and create too many
> osds
> > > > on
> > > > > the same node at the same time.
> > > >
> > > > I agree, no need to do it manually if there is an automated way,
> > > > especially if you're trying to bring up dozens of OSDs.
> > > >
> > > >
> > > > Zitat von Peter Childs <pchilds@xxxxxxx>:
> > > >
> > > > > After a bit of messing around. I managed to get it somewhat
> working.
> > > > >
> > > > > If I add the osd daemons one at a time with
> > > > >
> > > > > ceph orch daemon add osd drywood12:/dev/sda
> > > > >
> > > > > It does actually work,
> > > > >
> > > > > I suspect what's happening is when my rule for creating osds run
> and
> > > > > creates them all-at-once it ties the orch it overloads cephadm and
> it
> > > > can't
> > > > > cope.
> > > > >
> > > > > service_type: osd
> > > > > service_name: osd.drywood-disks
> > > > > placement:
> > > > >   host_pattern: 'drywood*'
> > > > > spec:
> > > > >   data_devices:
> > > > >     size: "7TB:"
> > > > >   objectstore: bluestore
> > > > >
> > > > > I suspect what I might need to do at least to work around the
> issue is
> > > > set
> > > > > "limit:" and bring it up until it stops working.
> > > > >
> > > > > I did work out how to get ceph-volume to nearly work manually.
> > > > >
> > > > > cephadm shell
> > > > > ceph auth get client.bootstrap-osd -o
> > > > > /var/lib/ceph/bootstrap-osd/ceph.keyring
> > > > > ceph-volume lvm create --data /dev/sda --dmcrypt
> > > > >
> > > > > but given I've now got "add osd" to work, I suspect I just need to
> fine
> > > > > tune my osd creation rules, so it does not try and create too many
> osds
> > > > on
> > > > > the same node at the same time.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, 26 May 2021 at 08:25, Eugen Block <eblock@xxxxxx> wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I believe your current issue is due to a missing keyring for
> > > > >> client.bootstrap-osd on the OSD node. But even after fixing that
> > > > >> you'll probably still won't be able to deploy an OSD manually with
> > > > >> ceph-volume because 'ceph-volume activate' is not supported with
> > > > >> cephadm [1]. I just tried that in a virtual environment, it fails
> when
> > > > >> activating the systemd-unit:
> > > > >>
> > > > >> ---snip---
> > > > >> [2021-05-26 06:47:16,677][ceph_volume.process][INFO  ] Running
> > > > >> command: /usr/bin/systemctl enable
> > > > >> ceph-volume@lvm-8-1a8fc8ae-8f4c-4f91-b044-d5636bb52456
> > > > >> [2021-05-26 06:47:16,692][ceph_volume.process][INFO  ] stderr
> Failed
> > > > >> to connect to bus: No such file or directory
> > > > >> [2021-05-26 06:47:16,693][ceph_volume.devices.lvm.create][ERROR ]
> lvm
> > > > >> activate was unable to complete, while creating the OSD
> > > > >> Traceback (most recent call last):
> > > > >>    File
> > > > >>
> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py",
> > > > >> line 32, in create
> > > > >>      Activate([]).activate(args)
> > > > >>    File
> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py",
> > > > >> line 16, in is_root
> > > > >>      return func(*a, **kw)
> > > > >>    File
> > > > >>
> > > "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py",
> > > > >> line
> > > > >> 294, in activate
> > > > >>      activate_bluestore(lvs, args.no_systemd)
> > > > >>    File
> > > > >>
> > > "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py",
> > > > >> line
> > > > >> 214, in activate_bluestore
> > > > >>      systemctl.enable_volume(osd_id, osd_fsid, 'lvm')
> > > > >>    File
> > > > >>
> "/usr/lib/python3.6/site-packages/ceph_volume/systemd/systemctl.py",
> > > > >> line 82, in enable_volume
> > > > >>      return enable(volume_unit % (device_type, id_, fsid))
> > > > >>    File
> > > > >>
> "/usr/lib/python3.6/site-packages/ceph_volume/systemd/systemctl.py",
> > > > >> line 22, in enable
> > > > >>      process.run(['systemctl', 'enable', unit])
> > > > >>    File "/usr/lib/python3.6/site-packages/ceph_volume/process.py",
> > > > >> line 153, in run
> > > > >>      raise RuntimeError(msg)
> > > > >> RuntimeError: command returned non-zero exit status: 1
> > > > >> [2021-05-26 06:47:16,694][ceph_volume.devices.lvm.create][INFO  ]
> will
> > > > >> rollback OSD ID creation
> > > > >> [2021-05-26 06:47:16,697][ceph_volume.process][INFO  ] Running
> > > > >> command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd
> > > > >> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new
> osd.8
> > > > >> --yes-i-really-mean-it
> > > > >> [2021-05-26 06:47:17,597][ceph_volume.process][INFO  ] stderr
> purged
> > > > osd.8
> > > > >> ---snip---
> > > > >>
> > > > >> There's a workaround described in [2] that's not really an option
> for
> > > > >> dozens of OSDs. I think your best approach is to bring cephadm to
> > > > >> activate the OSDs for you.
> > > > >> You wrote you didn't find any helpful error messages, but did
> cephadm
> > > > >> even try to deploy OSDs? What does your osd spec file look like?
> Did
> > > > >> you explicitly run 'ceph orch apply osd -i specfile.yml'? This
> should
> > > > >> trigger cephadm and you should see at least some output like this:
> > > > >>
> > > > >> Mai 26 08:21:48 pacific1 conmon[31446]:
> 2021-05-26T06:21:48.466+0000
> > > > >> 7effc15ff700  0 log_channel(cephadm) log [INF] : Applying service
> > > > >> osd.ssd-hdd-mix on host pacific2...
> > > > >> Mai 26 08:21:49 pacific1 conmon[31009]: cephadm
> > > > >> 2021-05-26T06:21:48.469611+0000 mgr.pacific1.whndiw (mgr.14166)
> 1646 :
> > > > >> cephadm [INF] Applying service osd.ssd-hdd-mix on host pacific2...
> > > > >>
> > > > >> Regards,
> > > > >> Eugen
> > > > >>
> > > > >> [1] https://tracker.ceph.com/issues/49159
> > > > >> [2] https://tracker.ceph.com/issues/46691
> > > > >>
> > > > >>
> > > > >> Zitat von Peter Childs <pchilds@xxxxxxx>:
> > > > >>
> > > > >> > Not sure what I'm doing wrong, I suspect its the way I'm running
> > > > >> > ceph-volume.
> > > > >> >
> > > > >> > root@drywood12:~# cephadm ceph-volume lvm create --data
> /dev/sda
> > > > >> --dmcrypt
> > > > >> > Inferring fsid 1518c8e0-bbe4-11eb-9772-001e67dc85ea
> > > > >> > Using recent ceph image ceph/ceph@sha256
> > > > >> >
> :54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
> > > > >> > /usr/bin/docker: Running command: /usr/bin/ceph-authtool
> > > > --gen-print-key
> > > > >> > /usr/bin/docker: Running command: /usr/bin/ceph-authtool
> > > > --gen-print-key
> > > > >> > /usr/bin/docker: -->  RuntimeError: No valid ceph configuration
> file
> > > > was
> > > > >> > loaded.
> > > > >> > Traceback (most recent call last):
> > > > >> >   File "/usr/sbin/cephadm", line 8029, in <module>
> > > > >> >     main()
> > > > >> >   File "/usr/sbin/cephadm", line 8017, in main
> > > > >> >     r = ctx.func(ctx)
> > > > >> >   File "/usr/sbin/cephadm", line 1678, in _infer_fsid
> > > > >> >     return func(ctx)
> > > > >> >   File "/usr/sbin/cephadm", line 1738, in _infer_image
> > > > >> >     return func(ctx)
> > > > >> >   File "/usr/sbin/cephadm", line 4514, in command_ceph_volume
> > > > >> >     out, err, code = call_throws(ctx, c.run_cmd(),
> > > > verbosity=verbosity)
> > > > >> >   File "/usr/sbin/cephadm", line 1464, in call_throws
> > > > >> >     raise RuntimeError('Failed command: %s' % ' '.join(command))
> > > > >> > RuntimeError: Failed command: /usr/bin/docker run --rm
> --ipc=host
> > > > >> > --net=host --entrypoint /usr/sbin/ceph-volume --privileged
> > > > >> --group-add=disk
> > > > >> > --init -e CONTAINER_IMAGE=ceph/ceph@sha256
> > > :54e95ae1e11404157d7b329d0t
> > > > >> >
> > > > >> > root@drywood12:~# cephadm shell
> > > > >> > Inferring fsid 1518c8e0-bbe4-11eb-9772-001e67dc85ea
> > > > >> > Inferring config
> > > > >> >
> > > >
> /var/lib/ceph/1518c8e0-bbe4-11eb-9772-001e67dc85ea/mon.drywood12/config
> > > > >> > Using recent ceph image ceph/ceph@sha256
> > > > >> >
> :54e95ae1e11404157d7b329d0bef866ebbb214b195a009e87aae4eba9d282949
> > > > >> > root@drywood12:/# ceph-volume lvm create --data /dev/sda
> --dmcrypt
> > > > >> > Running command: /usr/bin/ceph-authtool --gen-print-key
> > > > >> > Running command: /usr/bin/ceph-authtool --gen-print-key
> > > > >> > Running command: /usr/bin/ceph --cluster ceph --name
> > > > client.bootstrap-osd
> > > > >> > --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
> > > > >> > 70054a5c-c176-463a-a0ac-b44c5db0987c
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
> unable
> > > to
> > > > >> find
> > > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
> such
> > > > file
> > > > >> or
> > > > >> > directory
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
> > > > >> > AuthRegistry(0x7fdef405b378) no keyring found at
> > > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
> unable
> > > to
> > > > >> find
> > > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
> such
> > > > file
> > > > >> or
> > > > >> > directory
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
> > > > >> > AuthRegistry(0x7fdef405ef20) no keyring found at
> > > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 auth:
> unable
> > > to
> > > > >> find
> > > > >> > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No
> such
> > > > file
> > > > >> or
> > > > >> > directory
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1
> > > > >> > AuthRegistry(0x7fdef8f0bea0) no keyring found at
> > > > >> > /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef2d9d700 -1
> > > > monclient(hunting):
> > > > >> > handle_auth_bad_method server allowed_methods [2] but i only
> support
> > > > [1]
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef259c700 -1
> > > > monclient(hunting):
> > > > >> > handle_auth_bad_method server allowed_methods [2] but i only
> support
> > > > [1]
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef1d9b700 -1
> > > > monclient(hunting):
> > > > >> > handle_auth_bad_method server allowed_methods [2] but i only
> support
> > > > [1]
> > > > >> >  stderr: 2021-05-25T07:46:18.188+0000 7fdef8f0d700 -1 monclient:
> > > > >> > authenticate NOTE: no keyring found; disabled cephx
> authentication
> > > > >> >  stderr: [errno 13] RADOS permission denied (error connecting
> to the
> > > > >> > cluster)
> > > > >> > -->  RuntimeError: Unable to create a new OSD id
> > > > >> > root@drywood12:/# lsblk /dev/sda
> > > > >> > NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> > > > >> > sda    8:0    0  7.3T  0 disk
> > > > >> >
> > > > >> > As far as I can see cephadm gets a little further than this as
> the
> > > > disks
> > > > >> > have lvm volumes on them just the osd's daemons are not created
> or
> > > > >> started.
> > > > >> > So maybe I'm invoking ceph-volume incorrectly.
> > > > >> >
> > > > >> >
> > > > >> > On Tue, 25 May 2021 at 06:57, Peter Childs <pchilds@xxxxxxx>
> wrote:
> > > > >> >
> > > > >> >>
> > > > >> >>
> > > > >> >> On Mon, 24 May 2021, 21:08 Marc, <Marc@xxxxxxxxxxxxxxxxx>
> wrote:
> > > > >> >>
> > > > >> >>> >
> > > > >> >>> > I'm attempting to use cephadm and Pacific, currently on
> debian
> > > > >> buster,
> > > > >> >>> > mostly because centos7 ain't supported any more and cenotos8
> > > ain't
> > > > >> >>> > support
> > > > >> >>> > by some of my hardware.
> > > > >> >>>
> > > > >> >>> Who says centos7 is not supported any more? Afaik centos7/el7
> is
> > > > being
> > > > >> >>> supported till its EOL 2024. By then maybe a good alternative
> for
> > > > >> >>> el8/stream has surfaced.
> > > > >> >>>
> > > > >> >>
> > > > >> >> Not supported by ceph Pacific, it's our os of choice otherwise.
> > > > >> >>
> > > > >> >> My testing says the version available of podman, docker and
> > > python3,
> > > > do
> > > > >> >> not work with Pacific.
> > > > >> >>
> > > > >> >> Given I've needed to upgrade docker on buster can we please
> have a
> > > > list
> > > > >> of
> > > > >> >> versions that work with cephadm, maybe even have cephadm say
> no,
> > > > please
> > > > >> >> upgrade unless your running the right version or better.
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >>> > Anyway I have a few nodes with 59x 7.2TB disks but for some
> > > reason
> > > > >> the
> > > > >> >>> > osd
> > > > >> >>> > daemons don't start, the disks get formatted and the osd are
> > > > created
> > > > >> but
> > > > >> >>> > the daemons never come up.
> > > > >> >>>
> > > > >> >>> what if you try with
> > > > >> >>> ceph-volume lvm create --data /dev/sdi --dmcrypt ?
> > > > >> >>>
> > > > >> >>
> > > > >> >> I'll have a go.
> > > > >> >>
> > > > >> >>
> > > > >> >>> > They are probably the wrong spec for ceph (48gb of memory
> and
> > > > only 4
> > > > >> >>> > cores)
> > > > >> >>>
> > > > >> >>> You can always start with just configuring a few disks per
> node.
> > > > That
> > > > >> >>> should always work.
> > > > >> >>>
> > > > >> >>
> > > > >> >> That was my thought too.
> > > > >> >>
> > > > >> >> Thanks
> > > > >> >>
> > > > >> >> Peter
> > > > >> >>
> > > > >> >>
> > > > >> >>> > but I was expecting them to start and be either dirt slow or
> > > crash
> > > > >> >>> > later,
> > > > >> >>> > anyway I've got upto 30 of them, so I was hoping on getting
> at
> > > > least
> > > > >> get
> > > > >> >>> > 6PB of raw storage out of them.
> > > > >> >>> >
> > > > >> >>> > As yet I've not spotted any helpful error messages.
> > > > >> >>> >
> > > > >> >>> _______________________________________________
> > > > >> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >> >>>
> > > > >> >>
> > > > >> > _______________________________________________
> > > > >> > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >>
> > > > >>
> > > > >> _______________________________________________
> > > > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >>
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx