On Wed, Apr 4, 2018 at 1:16 PM, Vasu Kulkarni <vakulkar@xxxxxxxxxx> wrote: > On Wed, Apr 4, 2018 at 12:54 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> We identified several under-tested components in the Ceph project. >> Several of these consisted of tests that simply weren’t written: >> NFS-Ganesha has light testing in RGW, but none with CephFS; Samba’s >> testing is very light. >> >> Significantly more interesting is that none of the >> installers/orchestrators/normal process management (Ansible or DeepSea >> with systemd; containers under Kubernetes) are currently tested in >> teuthology. Changing that is a big desire for most of the integrators, >> but is a large project covering both the internal implementation and >> testing tasks. Right now, teuthology directly invokes Ceph processes >> via ssh and relies on that for control, for checking state (ie, the >> process is still running), and for easy logging of issues, and that >> has spilled over into important “task" modules such as the thrasher >> and cluster managers. There were rumors of individual efforts that >> might have been started to enable testing of a normal deployment, but >> nobody in the room knew for sure. >> PROBLEM TOPIC: support testing orchestration frameworks and the normal >> init system in teuthology > > Correcting the ceph-ansible testing part: > > We are running ceph-ansible/ceph-deploy testing for quite some time that does > systemd testing internally. There is also a systemd task in smoke that > tests process explicitly for correctness. > > a) http://pulpito.ceph.com/?suite=ceph-ansible > b) In smoke: https://github.com/ceph/ceph/blob/master/qa/tasks/systemd.py > http://pulpito.ceph.com/teuthology-2018-04-04_07:02:02-smoke-master-testing-basic-ovh/2352423 > http://pulpito.ceph.com/teuthology-2018-04-04_07:02:02-smoke-master-testing-basic-ovh/2352436 > > But definitely more work needs to be done to integrate better with > thrashers and I am hopeful we will fix > this issue soon atleast for some suites: http://tracker.ceph.com/issues/23488 Ah right, we do have those. The issue is that it's a singular suite that isn't integrated with our other testing to make sure it functions more broadly and with the more task-specific stuff. So, we make sure OSDs turn on initially, but we don't validate the behavior of them (and the systemd units!) under more stressful scenarios like repeated failures. This is more of an issue than you'd expect at first. We've had problems with I think both upstart and systemd in the past where they continually restart a damaged OSD that crashes on invalid state, because getting to the crash from restart takes longer than their timeout periods! -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html