Re: Brand New Cephadm Deployment, OSDs show either in/down or out/down

nORKy <joff.au@xxxxxxxxx> · Tue, 7 Sep 2021 13:15:01 +0200

Hi,

Thanks you Sebastian, create the folder /usr/lib/sysctl.d fix the bug !
So, it's a Debian specific bug

'Jof

Le jeu. 2 sept. 2021 à 10:52, Sebastian Wagner <sewagner@xxxxxxxxxx> a
écrit :

> Can you verify that the `/usr/lib/sysctl.d/` folder exists on your
> debian machines?
>
> Am 01.09.21 um 15:19 schrieb Alcatraz:
> > Sebastian,
> >
> >
> > I appreciate all your help. I actually (out of desperation) spun up
> > another cluster, same specs, just using Ubuntu 18.04 rather than
> > Debian 10. All the OSDs were recognized, and all went up/in without
> > issue.
> >
> >
> > Thanks
> >
> > On 9/1/21 06:15, Sebastian Wagner wrote:
> >> Am 30.08.21 um 17:39 schrieb Alcatraz:
> >>> Sebastian,
> >>>
> >>>
> >>> Thanks for responding! And of course.
> >>>
> >>>
> >>> 1. ceph orch ls --service-type osd --format yaml
> >>>
> >>> Output:
> >>>
> >>> service_type: osd
> >>> service_id: all-available-devices
> >>> service_name: osd.all-available-devices
> >>> placement:
> >>>   host_pattern: '*'
> >>> unmanaged: true
> >>> spec:
> >>>   data_devices:
> >>>     all: true
> >>>   filter_logic: AND
> >>>   objectstore: bluestore
> >>> status:
> >>>   created: '2021-08-30T13:57:51.000178Z'
> >>>   last_refresh: '2021-08-30T15:24:10.534710Z'
> >>>   running: 0
> >>>   size: 6
> >>> events:
> >>> - 2021-08-30T03:48:01.652108Z service:osd.all-available-devices
> >>> [INFO] "service was
> >>>   created"
> >>> - "2021-08-30T03:49:00.267808Z service:osd.all-available-devices
> >>> [ERROR] \"Failed\
> >>>   \ to apply: cephadm exited with an error code: 1, stderr:Non-zero
> >>> exit code 1 from\
> >>>   \ /usr/bin/docker container inspect --format {{.State.Status}}
> >>> ceph-d1405594-0944-11ec-8ebc-f23c92edc936-osd.0\n\
> >>>   /usr/bin/docker: stdout \n/usr/bin/docker: stderr Error: No such
> >>> container: ceph-d1405594-0944-11ec-8ebc-f23c92edc936-osd.0\n\
> >>>   Deploy daemon osd.0 ...\nTraceback (most recent call last):\n File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
> >>>   , line 8230, in <module>\n    main()\n  File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
> >>>   , line 8218, in main\n    r = ctx.func(ctx)\n  File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
> >>>   , line 1759, in _default_image\n    return func(ctx)\n  File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
>
> >>>
> >>>   , line 4326, in command_deploy\n    ports=daemon_ports)\n File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
>
> >>>
> >>>   , line 2632, in deploy_daemon\n    c, osd_fsid=osd_fsid,
> >>> ports=ports)\n  File \"\
> >>>
> /var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
>
> >>>
> >>>   , line 2801, in deploy_daemon_units\n    install_sysctl(ctx, fsid,
> >>> daemon_type)\n\
> >>>   \  File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
> >>>   , line 2963, in install_sysctl\n    _write(conf, lines)\n File
> >>>
> \"/var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/cephadm.d4237e4639c108308fe13147b1c08af93c3d5724d9ff21ae797eb4b78fea3931\"\
>
> >>>
> >>>   , line 2948, in _write\n    with open(conf, 'w') as
> >>> f:\nFileNotFoundError: [Errno\
> >>>   \ 2] No such file or directory:
> >>>
> '/usr/lib/sysctl.d/90-ceph-d1405594-0944-11ec-8ebc-f23c92edc936-osd.conf'\""
> >>
> >> https://tracker.ceph.com/issues/52481
> >>
> >>> - '2021-08-30T03:49:08.356762Z service:osd.all-available-devices
> >>> [ERROR] "Failed to
> >>>   apply: auth get failed: failed to find osd.0 in keyring retval: -2"'
> >>> - '2021-08-30T03:52:34.100977Z service:osd.all-available-devices
> >>> [ERROR] "Failed to
> >>>   apply: auth get failed: failed to find osd.3 in keyring retval: -2"'
> >>> - '2021-08-30T03:52:42.260439Z service:osd.all-available-devices
> >>> [ERROR] "Failed to
> >>>   apply: auth get failed: failed to find osd.6 in keyring retval: -2"'
> >>
> >> Will be fixed by https://github.com/ceph/ceph/pull/42989
> >>
> >>
> >>>
> >>>
> >>> 2. ceph orch ps --daemon-type osd --format yaml
> >>>
> >>> Output: ...snip...
> >>>
> >>> 3. ceph auth add osd.0 osd 'allow *' mon 'allow rwx' -i
> >>> /var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/osd.0/keyring
> >>>
> >>> I verified
> >>> /var/lib/ceph/d1405594-0944-11ec-8ebc-f23c92edc936/osd.0/keyring
> >>> file does exist.
> >>>
> >>> Output:
> >>>
> >>> Error EINVAL: caps cannot be specified both in keyring and in command
> >>
> >>
> >> You only need to create the keyring, you don't need to store the
> >> keyring anywhere. I'd still suggest to somehow create the keyring,
> >> but I haven't seen this particular error before.
> >>
> >>
> >> hth
> >>
> >> Sebastian
> >>
> >>
> >>>
> >>> Thanks
> >>>
> >>> On 8/30/21 10:28, Sebastian Wagner wrote:
> >>>> Could you run
> >>>>
> >>>> 1. ceph orch ls --service-type osd --format yaml
> >>>>
> >>>> 2. cpeh orch ps --daemon-type osd --format yaml
> >>>>
> >>>> 3. try running the `ceph auth add` call form
> >>>>
> https://docs.ceph.com/en/mimic/rados/operations/add-or-rm-osds/#adding-an-osd-manual
> >>>>
> >>>>
> >>>>
> >>>> Am 30.08.21 um 14:49 schrieb Alcatraz:
> >>>>> Hello all,
> >>>>>
> >>>>> Running into some issues trying to build a virtual PoC for Ceph.
> >>>>> Went to my cloud provider of choice and spun up some nodes. I have
> >>>>> three identical hosts consisting of:
> >>>>>
> >>>>> Debian 10
> >>>>> 8 cpu cores
> >>>>> 16GB RAM
> >>>>> 1x315GB Boot Drive
> >>>>> 3x400GB Data drives
> >>>>>
> >>>>> After deploying Ceph (v 16.2.5) using cephadm, adding hosts, and
> >>>>> logging into the dashboard, Ceph showed 9 OSDs, 0 up, 9 in. I
> >>>>> thought perhaps it just needed some time to bring up the OSDs, so
> >>>>> I left it running overnight.
> >>>>>
> >>>>> This morning, I checked, and the Ceph dashboard shows 9 OSDs, 0
> >>>>> up, 6 in, 3 out. I find this odd, as it hasn't been touched since
> >>>>> it was deployed. Ceph health shows "HEALTH_OK", `ceph osd tree`
> >>>>> outputs:
> >>>>>
> >>>>> ID  CLASS  WEIGHT  TYPE NAME     STATUS  REWEIGHT  PRI-AFF
> >>>>> -1              0  root default
> >>>>>  0              0  osd.0           down         0  1.00000
> >>>>>  1              0  osd.1           down         0  1.00000
> >>>>>  2              0  osd.2           down         0  1.00000
> >>>>>  3              0  osd.3           down   1.00000  1.00000
> >>>>>  4              0  osd.4           down   1.00000  1.00000
> >>>>>  5              0  osd.5           down   1.00000  1.00000
> >>>>>  6              0  osd.6           down   1.00000  1.00000
> >>>>>  7              0  osd.7           down   1.00000  1.00000
> >>>>>  8              0  osd.8           down   1.00000  1.00000
> >>>>>
> >>>>> and if I run `ls /var/run/ceph` the only thing it outputs is
> >>>>> "d1405594-0944-11ec-8ebc-f23c92edc936" (sans quotes), which I
> >>>>> assume is the cluster ID? So of course, if I run `ceph daemon
> >>>>> osd.8 help` for example, it just returns:
> >>>>>
> >>>>> Can't get admin socket path: unable to get conf option
> >>>>> admin_socket for osd: b"error parsing 'osd': expected string of
> >>>>> the form TYPE.ID, valid types are: auth, mon, osd, mds, mgr,
> >>>>> client\n"
> >>>>>
> >>>>> If I look at the log within the Ceph dashboard, no errors or
> >>>>> warnings appear. Will Ceph not work on virtual hardware? Is there
> >>>>> something I need to do to bring up the OSDs?
> >>>>>
> >>>>> Just as I was about to send this email I went to check the logs
> >>>>> and it shows the following (traceback ommited for length):
> >>>>>
> >>>>> 8/30/21 7:44:15 AM[ERR]Failed to apply osd.all-available-devices
> >>>>> spec
> >>>>>
> DriveGroupSpec(name=all-available-devices->placement=PlacementSpec(host_pattern='*'),
>
> >>>>> service_id='all-available-devices', service_type='osd',
> >>>>> data_devices=DeviceSelection(all=True), osd_id_claims={},
> >>>>> unmanaged=False, filter_logic='AND', preview_only=False): auth get
> >>>>> failed: failed to find osd.6 in keyring retval: -2
> >>>>>
> >>>>> 8/30/21 7:45:19 AM[ERR]executing create_from_spec_one(([('ceph01',
> >>>>> <ceph.deployment.drive_selection.selector.DriveSelection object at
> >>>>> 0x7f63a930bf98>), ('ceph02',
> >>>>> <ceph.deployment.drive_selection.selector.DriveSelection object at
> >>>>> 0x7f63a81ac8d0>), ('ceph03',
> >>>>> <ceph.deployment.drive_selection.selector.DriveSelection object at
> >>>>> 0x7f63a930b0b8>)],)) failed.
> >>>>>
> >>>>> and similar for the other OSDs. I'm not sure why it's complaining
> >>>>> about auth, because in order to even add the hosts to the cluster
> >>>>> I had to copy the ceph public key to the hosts to begin with.
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx