How up to date is your VM environment? We saw something very similar last year with Linux VM’s running newish kernels. It turns out newer kernels supported a new feature of the vmxnet3 adapters which had a bug in ESXi. The fix was release last year some time in ESXi6.5 U1, or a workaround was to set an option in the VM config. https://kb.vmware.com/s/article/2151480 From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Youzhong Yang Sent: 21 January 2018 19:50 To: Brad Hubbard <bhubbard@xxxxxxxxxx> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> Subject: Re: Ubuntu 17.10 or Debian 9.3 + Luminous = random OS hang ? As someone suggested, I installed linux-generic-hwe-16.04 package on Ubuntu 16.04 to get kernel of 17.10, and then rebooted all VMs, here is what I observed: - ceph monitor node froze upon reboot, in another case froze after a few minutes - ceph OSD hosts easily froze - ceph admin node (which runs no ceph service but ceph-deploy) never freezes - ceph rgw nodes and ceph mgr so far so good Here are two images I captured: On Sat, Jan 20, 2018 at 7:03 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: On Fri, Jan 19, 2018 at 11:54 PM, Youzhong Yang <youzhong@xxxxxxxxx> wrote: > I don't think it's hardware issue. All the hosts are VMs. By the way, using > the same set of VMWare hypervisors, I switched back to Ubuntu 16.04 last > night, so far so good, no freeze.
Too little information to make any sort of assessment I'm afraid but, at this stage, this doesn't sound like a ceph issue. > > On Fri, Jan 19, 2018 at 8:50 AM, Daniel Baumann <daniel.baumann@xxxxxx> > wrote: >> >> Hi, >> >> On 01/19/18 14:46, Youzhong Yang wrote: >> > Just wondering if anyone has seen the same issue, or it's just me. >> >> we're using debian with our own backported kernels and ceph, works rock >> solid. >> >> what you're describing sounds more like hardware issues to me. if you >> don't fully "trust"/have confidence in your hardware (and your logs >> don't reveal anything), I'd recommend running some burn-in tests >> (memtest, cpuburn, etc.) on them for 24 hours/machine to rule out >> cpu/ram/etc. issues. >> >> Regards, >> Daniel >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
-- Cheers, Brad
|