Re: Suicide

Zenon Panoussis <oracle@xxxxxxxxxxxxxxx> · Sat, 16 Apr 2011 00:38:51 +0200

On 04/15/2011 10:21 PM, Gregory Farnum wrote:

> Can you give me more details about how your cluster was behaving at 
> that time? I can't quite match up what I'm seeing in the logs with 
> your description -- looks like node02 ran out of disk space for 
> logging at ~4:06:30 but I can't see from the logs why they were 
> passing leadership back and forth...

I have a hunch what might have happened, but please keep in mind that
I don't know ceph at all, so I'm much more likely to bark up a wrong
tree than the right one.

node01 is a Fujitsu Siemens C620, P4 3GHz, 2 GB RAM retrofitted with a
40-euro 2,5" SATA disk. It's a small and silent little box perfect for
testing stuff on, but not really a great performer. It runs the ceph
monitor and the only client. It looks like this:

[root@node01 ~]# df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                15019      8267      5990  58% /
tmpfs                      993         0       993   0% /dev/shm
/dev/sda2                 7510      1308      5821  19% /var
/dev/sda5                15019       547     13710   4% /var/log
/dev/mapper/sda6        261029      1408    246362   1% /mnt/osd
192.168.178.100:6789:/
                        514984      2815    486009   1% /vol

where /mnt/osd is the raw ext4 store and /vol is the ceph-mount of it.

node02 is a Fujitsu Siemens RX-200 S2 with two dual-core 3GHz Xeon CPUs,
4 GB RAM and two SCSI 320 disks. If the amount of noise it makes is any
measure of its performance, it's in the LINPACK top 10.

My CRUSH rules say min_size 2 for all data types, meant to force a full
replication of everything to both nodes.

Last night I was copying an archived news spool of text-only newsgroups
to the ceph store with 'rsync -a spooldir node01:/vol/'. I have just over
a million files there, most of them between 1 and 2 KB in size. I noticed
a huge load on node01, top reporting a load of 15 to 25, but I let it grind
on anyway, until rsync got stuck and I gave up and went to bed. I never
checked the load on node02, but we can take it for granted that it was
way much lower.

Now, the cause of this tremendous load is a separate issue in itself, but
my hunch is that it might have triggered the chain of events that lead to
journal corruption. At 04:28 last night I wrote the following to this list,
but saved it as a draft instead of sending it because I wasn't sure I was
interpreting the log correctly:

===
2011-04-15 04:14:29.810679    pg v1605: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
2011-04-15 04:14:34.937142   log 2011-04-15 04:14:34.581267 mon0 192.168.178.100:6789/0 18 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
2011-04-15 04:14:35.118217    pg v1606: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
2011-04-15 04:14:39.899001   log 2011-04-15 04:14:39.645393 mon0 192.168.178.100:6789/0 19 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
2011-04-15 04:14:40.046747    pg v1607: 528 pgs: 528 active+clean; 453 MB data, 2011 MB used, 475 GB / 502 GB avail

While external host X is copying files to ceph node 0 which replicates on
node 1, node 1 claims that node 0 is dead, which it it isn't, and node 0
believes it and goes belly up.
===

Reading a bit of http://ceph.newdream.net/weil-thesis.pdf today, I noticed
this (page 127):

 The cluster initially elects a leader to serialize map updates and manage
 consistency. Once elected, the leader begins by requesting the map epochs
 stored by each monitor. Monitors have a fixed amount of time T (currently
 two seconds) to respond to the probe and join the quorum. The leader ensures
 that a majority of the monitors are active and that it has the most recent
 map epoch (requesting incremental updates from other monitors as necessary),
 and then begins distributing short-term leases to active monitors.

So I'm thinking now, what if node02, which I assume was alert and responsive,
was querying node01 and node01, because of the huge load that it was under,
was not responding within 2 seconds? Would node02 then tell all other nodes
that node01 is dead and node01 actually believe it, although the obituary was
about itself and it very well knew that it was alive and struggling?

If that is what happened, it would explain the passing of control back and
fro. In my mind it still doesn't explain the journal corruption, but that
might just mean that I need to do more reading.

As for the load, I'm copying the same news spool again today and I found
that 'rsync -a --bwlimit=20 spooldir node01:/vol/' gives a steady load of
2-3 on node01 and just around 1 on node02. That 20 is KBps, a ridiculously
low transfer speed, but the only way to space apart ceph's write operations.

> ...or why nothing happened between 4:06 and 4:33.

I had half an eye on ceph while doing other unrelated things. I saw rsync
hanging, checked some, but not all the logs, wrote the unposted posting
above, and then gave up and called it a day. 04:06 is probably the time
when the log partition filled up on node02 and 04:33 is likely the time
when I stopped ceph and shut down the machines. I think everything was hung
during that time, or rather deadlocked.

Z

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html