After digging a lot, I have found that IB cards and switch may went to ``bad'' state after host` load spike, so I have limited all potentially cpu-hungry processes via cg. That`s has no effect at all, spikes happens almost at same time when osds on the corresponding host went down as ``wrongly marked'' for a couple of seconds. By doing manual observations, I have ensured that osds went crazy first, eating all cores with 100% SY(mean. scheduler or fs issues), then card lacking time for its interrupts start dropping the packets and so on. This can be reproduced only on heavy workload on the fast cluster, slow one with simular software versions will crawl but do not produce such locks. Those locks may went away and may hang for a while, tens of minutes, I do not sure of what it depends. Both nodes with logs pointed above contains one monitor and one osd, but locks do happen on two-osd nodes as well. Ceph instances does not share block devices in my setup(except two-osd nodes using same SSD for a journal, but since it is reproducible on mon-osd pair with completely separated storage that`s seems not to be an exact cause). For meantime, I may suggest for myself to move out from XFS and see if locks remain. The issue started in the latest 3.6 series and 0.55+ and remains in the 3.7.1 and 0.56.1. Should I move to ext4 immediately or try 3.8-rc with couple of XFS fixes first? http://xdel.ru/downloads/ceph-log/osd-lockup-1-14-25-12.875107.log.gz http://xdel.ru/downloads/ceph-log/osd-lockup-2-14-33-16.741603.log.gz Timestamps in filenames added for easier lookup, osdmap have marked osds as down after couple of beats after those marks. On Mon, Dec 31, 2012 at 1:16 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: > On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: >> Sorry for the delay. A quick look at the log doesn't show anything >> obvious... Can you elaborate on how you caused the hang? >> -Sam >> > > I am sorry for all this noise, the issue almost for sure has been > triggered by some bug in the Infiniband switch firmware because > per-port reset was able to solve ``wrong mark'' problem - at least, it > haven`t showed up yet for a week. The problem took almost two days > until resolution - all possible connectivity tests displayed no > overtimes or drops which can cause wrong marks. Finally, I have > started playing with TCP settings and found that ipv4.tcp_low_latency > raising possibility of ``wrong mark'' event several times when enabled > - so area of all possible causes quickly collapsed to the media-only > problem and I fixed problem soon. > >> On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>> Please take a look at the log below, this is slightly different bug - >>> both osd processes on the node was stuck eating all available cpu >>> until I killed them. This can be reproduced by doing parallel export >>> of different from same client IP using both ``rbd export'' or API >>> calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally >>> stuck. What is more interesting, 10.5.0.33 holds most hungry set of >>> virtual machines, eating constantly four of twenty-four HT cores, and >>> this node fails almost always, Underlying fs is an XFS, ceph version >>> gf9d090e. With high possibility my previous reports are about side >>> effects of this problem. >>> >>> http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz >>> >>> and timings for the monmap, logs are from different hosts, so they may >>> have a time shift of tens of milliseconds: >>> >>> http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt >>> >>> Thanks! >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html