Re: Getting started with cephadm

Peter Childs <pchilds@xxxxxxx> · Sun, 28 Feb 2021 18:58:26 +0000

The fix was to upgrade podman and the issue went away on any containers
restarted, so I'll just do a rolling reboot to clear that issue. Probably
upgrade ceph to 15.2.9 at the same time.

It's a bit of a dev/proof of concept cluster currently, bit it does mean
we're going to need to workout which distro to go with going forward and
which tool box to use....

Everyone here is too used to Spectrum Scale or Lustre and this is the first
time I've really even played with ceph.

Peter.

On Sun, 28 Feb 2021, 18:13 David Orman, <ormandj@xxxxxxxxxxxx> wrote:

> Perhaps just swap out your hosts one at a time with a different
> distribution that's more current. We also use podman from the Kubic
> project instead of the OS-provided version. Just make sure to backup
> package files when you install versions from there, as they wipe their
> repos of the old version when new versions come out, leaving you with
> little ability to roll-back.
>
> That output you see is related to LVM, and will likely go away when
> you reboot. We see the same behavior even with Podman 3.0.1, but part
> of our setup process involves rebooting all hosts in order to ensure
> they behave properly. FWIW, this output doesn't impact anything
> negatively, other than being annoying on every ceph command.
>
> I suppose what I'm asking is what this: "start having problems
> starting the OSD up" means, specifically, from your initial email.
> What behavior do you see? What do logs show? Hopefully that will help
> pinpoint the root cause of your problems.
>
> On Sun, Feb 28, 2021 at 4:21 AM Peter Childs <pchilds@xxxxxxx> wrote:
> >
> > Currently I'm using the default podman that comes with CentOS7 1.6.4
> which I fear is the issue.
> >
> > /bin/podman: stderr WARNING: The same type, major and minor should not
> be used for multiple devices.
> >
> > Looks to be part of the issue, and I've heard this is an issue in older
> versions of podman.
> >
> > I can't see a ram issue, or a CPU issue it looks like its probably an
> issue with podman mounting overlays, so maybe upgrading podman past that
> available with CentOS 7 is the first plan, shame CentOS 8 is a non-project
> now :(
> >
> > Peter.
> >
> > On Sat, 27 Feb 2021 at 19:37, David Orman <ormandj@xxxxxxxxxxxx> wrote:
> >>
> >> Podman is fine (preferably 3.0+). What were those variables set to
> >> before? With most recent distributions and kernels we've not noticed a
> >> problem with the defaults. Did you notice errors that lead to you
> >> changing them? We have many clusters of 21 nodes, 24 HDDs each,
> >> multiple NVMEs serving as WAL/DB which were on 15.2.7 and prior, but
> >> now all are 15.2.9, running in podman 3.0.1 (fixes issues with the 2.2
> >> series on upgrade). We have less RAM (128G) per node without issues.
> >>
> >> On the OSDs that will not start - what error(s) do you see? You can
> >> inspect the OSDs with "podman logs <id>" if they've started inside of
> >> podman but just aren't joining the cluster; if they haven't, then
> >> looking at the systemctl status for the service or journalctl will
> >> normally give more insight. Hopefully the root cause of your problems
> >> can be identified so it can be addressed directly.
> >>
> >> On Sat, Feb 27, 2021 at 11:34 AM Peter Childs <pchilds@xxxxxxx> wrote:
> >> >
> >> > I'm new to ceph, and I've been trying to set up a new cluster with 16
> >> > computers with 30 disks each and 6 SSD (plus boot disks), 256G of
> memory,
> >> > IB Networking. (ok its currently 15 but never mind)
> >> >
> >> > When I take them over about 10 OSD's each they start having problems
> >> > starting the OSD up and I can normally fix this by rebooting them and
> it
> >> > will continue again for a while, and it is possible to get them up to
> the
> >> > full complement with a bit of poking around. (Once its working it fne
> >> > unless you start adding services or moving the OSD's around
> >> >
> >> > Is there anything I can change to make it a bit more stable.
> >> >
> >> > I've already set
> >> >
> >> > fs.aio-max-nr = 1048576
> >> > kernel.pid_max = 4194303
> >> > fs.file-max = 500000
> >> >
> >> > which made it a bit better, but I feel it could be even better.
> >> >
> >> > I'm currently trying to upgrade to 15.2.9 from the default cephadm
> version
> >> > of octopus.  The upgrade is going very very slowly. I'm currently
> using
> >> > podman if that helps, I'm not sure if docker would be better? (I've
> mainly
> >> > used singularity when I've handled containers before)
> >> >
> >> > Thanks in advance
> >> >
> >> > Peter Childs
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx