Re: systemd status

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 29 Jul 2015 07:19:17 -0700 (PDT)

On Wed, 29 Jul 2015, Alex Elsayed wrote:
> Sage Weil wrote:
> 
> > On Wed, 29 Jul 2015, Alex Elsayed wrote:
> >> Travis Rhoden wrote:
> >> 
> >> > On Tue, Jul 28, 2015 at 12:13 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >> >> Hey,
> >> >>
> >> >> I've finally had some time to play with the systemd integration branch
> >> >> on
> >> >> fedora 22.  It's in wip-systemd and my current list of issues
> >> >> includes:
> >> >>
> >> >> - after mon creation ceph-create-keys isn't run automagically
> >> >>   - Personally I kind of hate how it was always run on mon startup and
> >> >>   not
> >> >> just during cluster creation so I wouldn't mind *so* much if this
> >> >> became an explicit step, maybe triggered by ceph-deploy, after mon
> >> >> create.
> >> > 
> >> > I would be happy to see this become an explicit step as well.  We
> >> > could make it conditional such that ceph-deploy only runs it if we are
> >> > dealing with systemd, but I think re-running ceph-create-keys is
> >> > always safe.  It just aborts if
> >> > /etc/ceph/{cluster}.client.admin.keyring is already present.
> >> 
> >> Another option is to have the ceph-mon@.service have a Wants= and After=
> >> on ceph-create-keys@.service, which has a
> >> ConditionPathExists=!/path/to/key/from/templated/%I
> >> 
> >> With that, it would only run ceph-create-keys if the keys do not exist
> >> already - otherwise, it'd be skipped-as-successful.
> > 
> > This sounds promising!
> > 
> >> >> - udev's attempt to trigger ceph-disk isn't working for me.  the osd
> >> >> service gets started but the mount isn't present and it fails to
> >> >> start. I'm a systemd noob and haven't sorted out how to get udev to
> >> >> log something
> >> >> meaningful to debug it.  Perhaps we should merge in the udev +
> >> >> systemd revamp patches here too...
> >> 
> >> Personally, my opinion is that ceph-disk is doing too many things at
> >> once, and thus fits very poorly into the systemd architecture...
> >> 
> >> I mean, it tries to partition, format, mount, introspect the filesystem
> >> inside, and move the mount, depending on what the initial state was.
> > 
> > There is a series from David Disseldorp[1] that fixes much of this, by
> > doing most of these steps in short-lived systemd tasks (instead of a
> > complicated slow ceph-disk invocation directly from udev, which breaks
> > udev).
> > 
> >> Now, part of the issue is that the final mountpoint depends on data
> >> inside the filesystem - OSD id, etc. To me, that seems... mildly absurd
> >> at least.
> >> 
> >> If the _mountpoint_ was only dependent on the partuuid, and the ceph OSD
> >> self-identified from the contents of the path it's passed, that would
> >> simplify things immensely IMO when it comes to systemd integration
> >> because the mount logic wouldn't need any hokey double-mounting, and
> >> could likely use the systemd mount machinery much more easily - thus
> >> avoiding race issues like the above.
> > 
> > Hmm.  Well, we could name the mount point with the uuid and symlink the
> > osd id to that.  We could also do something sneaky like embed the osd id
> > in the least significant bits of the uuid, but that throws away a lot of
> > entropy and doesn't capture the cluster name (which also needs to be known
> > before mount).
> 
> Does it?
> 
> If the mount point is (say) /var/ceph/$UUID, and ceph-osd can take a --
> datadir parameter from which it _reads_ the cluster and ID if they aren't 
> passed on the command line, I think that'd resolve the issue rather tidily 
> _without_ requring that be known prior to mount.
> 
> And if I understand correctly, that data is _already in there_ for ceph-disk 
> to mount it in the "final location" - it's just shuffling around who reads 
> it.

So, we could do this.  It would mean either futzing with the ceph-osd 
config variables so that they take a $uuid substitution (passed at 
startup) -or- have ceph-disk set up a symlink from the current 
/var/lib/ceph/osd/$cluster-$id location (instead of doing the bind mount 
it currently does).

But, it'll come at some cost to operators, who won't be able to 'df' or 
'mount' and see which OSD mounts are which... they'll have to poke around 
in each directory to see what mount is which.

> > If the mounting and binding to the final location is done in a systemd job
> > identified by the uuid, it seems like systemd would effectively handle the
> > mutual exclusion and avoid races?
> 
> What I object to is the idea of a "final location" that depends on the 
> contents of the filesystem - it's bass-ackwards IMO.

It's unusual, but I think it can be made to work reliably.

Are there any other opinions here?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html