Re: OSDs remain not in after update to v17

Alexandre Becholey <alex@xxxxxxxxxxx> · Mon, 17 Apr 2023 09:06:15 +0000

Hi,

Thank you all for your help, I was able to fix the issue (in a dirty way, but it worked). Here is a quick summary of the steps:

- create a CentOS 8 Stream VM (I took the cloudimg from https://cloud.centos.org/centos/8-stream/x86_64/images/), to match what the container is using
- git clone https://github.com/ceph/ceph and checkout tag v17.2.6
- backport the patch from PR #49199 (edit `src/mon/OSDMonitor.cc`)
- build and install following the instructions in the `README.md`
- create and tar archive of `/usr/local/bin/ceph-mon`, `/usr/local/lib64/ceph/` and `/lib64/libfmt.so.6` and copy from the VM to the host
- extract the archive and copy the files from the host to the mon container
- docker commit the mon container to a new image
- modify the unit files `/var/lib/ceph/<fsid>/mon.id/unit.{run,image}` to use the new image
- modify the unit file `/var/lib/ceph/<fsid>/mon.id/unit.run` to use `/usr/local/bin/ceph-mon` instead of `/usr/bin/ceph-mon`
- clear the blocklist
- restart the containers
- just to be sure, I gradually increased the release name from octopus to pacific to quincy when issuing the `ceph osd require-osd-release` command. After a deep-scrub, all pgs are active+clean

`cephadm shell` seems to automatically take the latest image. If it fails to start, you might need to specify the official one with --image.

Once done, restore the unit files and restart the mon container

Kind regards,
Alexandre

------- Original Message -------
On Sunday, April 16th, 2023 at 11:03 AM, Konstantin Shalygin <k0ste@xxxxxxxx> wrote:

>
>
> Hi,
>
> This PR for the main branch and was never backpoted to another branches, currently
>
>
> k
> Sent from my iPhone
>
> > On 15 Apr 2023, at 21:00, Alexandre Becholey alex@xxxxxxxxxxx wrote:
> >
> > Hi,
> >
> > Thank you for your answer, yes this seems to be exactly my issue. The pull request related to the issue is this one: https://github.com/ceph/ceph/pull/49199 and it is not (yet?) merged into the Quincy release. Hopefully this will happen before the next major release, because I cannot run any `ceph orch` command as they hang.
> >
> > Kind regards,
> > Alexandre
> >
> > ------- Original Message -------
> >
> > > On Saturday, April 15th, 2023 at 6:26 PM, Ramin Najjarbashi ramin.najarbashi@xxxxxxxxx wrote:
> > >
> > > Hi
> > > I think the issue you are experiencing may be related to a bug that has been reported in the Ceph project. Specifically, the issue is documented in https://tracker.ceph.com/issues/58156, and a pull request has been submitted and merged in https://github.com/ceph/ceph/pull/44090.
> > >
> > > > On Fri, Apr 14, 2023 at 8:17 PM Alexandre Becholey alex@xxxxxxxxxxx wrote:
> > > >
> > > > Dear Ceph Users,
> > > >
> > > > I have a small ceph cluster for VMs on my local machine. It used to be installed with the system packages and I migrated it to docker following the documentation. It worked OK until I migrated from v16 to v17 a few months ago. Now the OSDs remain "not in" as shown in the status:
> > > >
> > > > # ceph -s
> > > > cluster:
> > > > id: abef2e91-cd07-4359-b457-f0f8dc753dfa
> > > > health: HEALTH_WARN
> > > > 6 stray daemon(s) not managed by cephadm
> > > > 1 stray host(s) with 6 daemon(s) not managed by cephadm
> > > > 2 devices (4 osds) down
> > > > 4 osds down
> > > > 1 host (4 osds) down
> > > > 1 root (4 osds) down
> > > > Reduced data availability: 129 pgs inactive
> > > >
> > > > services:
> > > > mon: 1 daemons, quorum bjorn (age 8m)
> > > > mgr: bjorn(active, since 8m)
> > > > osd: 4 osds: 0 up (since 4w), 4 in (since 4w)
> > > >
> > > > data:
> > > > pools: 2 pools, 129 pgs
> > > > objects: 0 objects, 0 B
> > > > usage: 1.8 TiB used, 1.8 TiB / 3.6 TiB avail
> > > > pgs: 100.000% pgs unknown
> > > > 129 unknown
> > > >
> > > > I can see some network communication between the OSDs and the monitor and the OSDs are running:
> > > >
> > > > # docker ps -a
> > > > CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
> > > > f8fbe8177a63 quay.io/ceph/ceph:v17 "/usr/bin/ceph-osd -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-osd-2
> > > > 6768ec871404 quay.io/ceph/ceph:v17 "/usr/bin/ceph-osd -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-osd-1
> > > > ff82f84504d5 quay.io/ceph/ceph:v17 "/usr/bin/ceph-osd -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-osd-0
> > > > 4c89e50ce974 quay.io/ceph/ceph:v17 "/usr/bin/ceph-osd -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-osd-3
> > > > fe0b6089edda quay.io/ceph/ceph:v17 "/usr/bin/ceph-mon -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-mon-bjorn
> > > > f76ac9dcdd6d quay.io/ceph/ceph:v17 "/usr/bin/ceph-mgr -…" 9 minutes ago Up 9 minutes ceph-abef2e91-cd07-4359-b457-f0f8dc753dfa-mgr-bjorn
> > > >
> > > > However when I try to use any `ceph orch` commands, they hang. I can also see some blacklist on the OSDs:
> > > >
> > > > # ceph osd blocklist ls
> > > > 10.99.0.13:6833/3770763474 2023-04-13T08:17:38.885128+0000
> > > > 10.99.0.13:6832/3770763474 2023-04-13T08:17:38.885128+0000
> > > > 10.99.0.13:0/2634718754 2023-04-13T08:17:38.885128+0000
> > > > 10.99.0.13:0/1103315748 2023-04-13T08:17:38.885128+0000
> > > > listed 4 entries
> > > >
> > > > The first two entries correspond to the manager process. `ceph osd blocked-by` does not show anything.
> > > >
> > > > I think I might have forgotten to set the `ceph osd require-osd-release ...` because 14 is written in `/var/lib/ceph/<ID>/osd.?/require_osd_release`. If I try to do it now, the monitor hits an abort:
> > > >
> > > > debug 0> 2023-04-12T08:43:27.788+0000 7f0fcf2aa700 -1 *** Caught signal (Aborted) **
> > > > in thread 7f0fcf2aa700 thread_name:ms_dispatch
> > > > ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
> > > > 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f0fd94bbcf0]
> > > > 2: gsignal()
> > > > 3: abort()
> > > > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f0fdb5124e3]
> > > > 5: /usr/lib64/ceph/libceph-common.so.2(+0x26a64f) [0x7f0fdb51264f]
> > > > 6: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basi
> > > > 7: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x38d) [0x562719cb127d]
> > > > 8: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x17b) [0x562719cb18cb]
> > > > 9: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x2ce) [0x562719c20ade]
> > > > 10: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1ebb) [0x562719ab9f6b]
> > > > 11: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f2) [0x562719abe152]
> > > > 12: (Monitor::_ms_dispatch(Message*)+0x406) [0x562719abf066]
> > > > 13: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5d) [0x562719aef13d]
> > > > 14: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f0fdb78e0e8]
> > > > 15: (DispatchQueue::entry()+0x50f) [0x7f0fdb78b52f]
> > > > 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f0fdb8543b1]
> > > > 17: /lib64/libpthread.so.0(+0x81ca) [0x7f0fd94b11ca]
> > > > 18: clone()
> > > >
> > > > Any ideas on what is going on?
> > > >
> > > > Many thanks,
> > > > Alexandre
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx