Re: Suicide

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Fri, 15 Apr 2011 16:06:58 -0700

On Friday, April 15, 2011 at 3:38 PM, Zenon Panoussis wrote: 
> 
> On 04/15/2011 10:21 PM, Gregory Farnum wrote:
> 
> > Can you give me more details about how your cluster was behaving at 
> > that time? I can't quite match up what I'm seeing in the logs with 
> > your description -- looks like node02 ran out of disk space for 
> > logging at ~4:06:30 but I can't see from the logs why they were 
> > passing leadership back and forth...
> 
> I have a hunch what might have happened, but please keep in mind that
> I don't know ceph at all, so I'm much more likely to bark up a wrong
> tree than the right one.
> 
> node01 is a Fujitsu Siemens C620, P4 3GHz, 2 GB RAM retrofitted with a
> 40-euro 2,5" SATA disk. It's a small and silent little box perfect for
> testing stuff on, but not really a great performer. It runs the ceph
> monitor and the only client. It looks like this:
> 
> [root@node01 ~]# df -m
> Filesystem 1M-blocks Used Available Use% Mounted on
> /dev/sda1 15019 8267 5990 58% /
> tmpfs 993 0 993 0% /dev/shm
> /dev/sda2 7510 1308 5821 19% /var
> /dev/sda5 15019 547 13710 4% /var/log
> /dev/mapper/sda6 261029 1408 246362 1% /mnt/osd
> 192.168.178.100:6789:/
>  514984 2815 486009 1% /vol
> 
> where /mnt/osd is the raw ext4 store and /vol is the ceph-mount of it.
> 
> node02 is a Fujitsu Siemens RX-200 S2 with two dual-core 3GHz Xeon CPUs,
> 4 GB RAM and two SCSI 320 disks. If the amount of noise it makes is any
> measure of its performance, it's in the LINPACK top 10.
> 
> My CRUSH rules say min_size 2 for all data types, meant to force a full
> replication of everything to both nodes.
> 
> Last night I was copying an archived news spool of text-only newsgroups
> to the ceph store with 'rsync -a spooldir node01:/vol/'. I have just over
> a million files there, most of them between 1 and 2 KB in size. I noticed
> a huge load on node01, top reporting a load of 15 to 25, but I let it grind
> on anyway, until rsync got stuck and I gave up and went to bed. I never
> checked the load on node02, but we can take it for granted that it was
> way much lower.
> 
> Now, the cause of this tremendous load is a separate issue in itself, but
> my hunch is that it might have triggered the chain of events that lead to
> journal corruption. At 04:28 last night I wrote the following to this list,
> but saved it as a draft instead of sending it because I wasn't sure I was
> interpreting the log correctly:
> 
> ===
> 2011-04-15 04:14:29.810679 pg v1605: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
> 2011-04-15 04:14:34.937142 log 2011-04-15 04:14:34.581267 mon0 192.168.178.100:6789/0 18 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
> 2011-04-15 04:14:35.118217 pg v1606: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
> 2011-04-15 04:14:39.899001 log 2011-04-15 04:14:39.645393 mon0 192.168.178.100:6789/0 19 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
> 2011-04-15 04:14:40.046747 pg v1607: 528 pgs: 528 active+clean; 453 MB data, 2011 MB used, 475 GB / 502 GB avail
> 
> While external host X is copying files to ceph node 0 which replicates on
> node 1, node 1 claims that node 0 is dead, which it it isn't, and node 0
> believes it and goes belly up.
> ===
> 
> Reading a bit of http://ceph.newdream.net/weil-thesis.pdf today, I noticed
> this (page 127):
> 
>  The cluster initially elects a leader to serialize map updates and manage
>  consistency. Once elected, the leader begins by requesting the map epochs
>  stored by each monitor. Monitors have a fixed amount of time T (currently
>  two seconds) to respond to the probe and join the quorum. The leader ensures
>  that a majority of the monitors are active and that it has the most recent
>  map epoch (requesting incremental updates from other monitors as necessary),
>  and then begins distributing short-term leases to active monitors.
> 
> So I'm thinking now, what if node02, which I assume was alert and responsive,
> was querying node01 and node01, because of the huge load that it was under,
> was not responding within 2 seconds? Would node02 then tell all other nodes
> that node01 is dead and node01 actually believe it, although the obituary was
> about itself and it very well knew that it was alive and struggling?

Well, you're confusing the MDS and the Monitor here, but if there was a heavy load that could still cause the active MDS to pass back and forth if it starts taking too long to handle requests.

> 
> If that is what happened, it would explain the passing of control back and
> fro. In my mind it still doesn't explain the journal corruption, but that
> might just mean that I need to do more reading.
> 
> As for the load, I'm copying the same news spool again today and I found
> that 'rsync -a --bwlimit=20 spooldir node01:/vol/' gives a steady load of
> 2-3 on node01 and just around 1 on node02. That 20 is KBps, a ridiculously
> low transfer speed, but the only way to space apart ceph's write operations.
> 
> > ...or why nothing happened between 4:06 and 4:33.
> 
> I had half an eye on ceph while doing other unrelated things. I saw rsync
> hanging, checked some, but not all the logs, wrote the unposted posting
> above, and then gave up and called it a day. 04:06 is probably the time
> when the log partition filled up on node02 and 04:33 is likely the time
> when I stopped ceph and shut down the machines. I think everything was hung
> during that time, or rather deadlocked.

Hmm, that timeline doesn't quite make sense -- node01 takes over the MDS duties at 4:33 and crashes, but then it starts up again at 4:50. But it's possible that node02 took over in the interval there and we just don't see it because the log disk was full (I had erroneously thought that a filled disk would hang the daemon but that turns out not to be the case). So I'd guess you shut everything down sometime after 5:08, and that would make sense.

Unfortunately what we're really interested in is what caused the assert failure on node01 at 4:35 and the reasons for that aren't available in the logs we have. :(
This is the second time we've seen that assert but we've not been able to reproduce it or figure out how the invariant that it's checking against got broken. If you like we can come up with a hacky fix that should let your cluster come back up, but it's possible that you'd lose some data and this is a very rare condition so if it's not a big deal I'd just re-create your cluster.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html