Re: Suicide

Zenon Panoussis <oracle@xxxxxxxxxxxxxxx> · Sun, 17 Apr 2011 01:50:16 +0200

On 04/16/2011 02:00 AM, Gregory Farnum wrote:

> If you want to try to reproduce it for us in a useful fashion we'd 
> love that. 

I think I reproduced it, kind of. I got the hang and "error 5" on mount,
but not the journal corruption. This is the timeline:

ca 19:19: Restarted ceph, mounted it, started rsync. There were already
          a few hundred thousand files on the ceph mount, forcing rsync
          to do a read/compare for them all for about 20 minutes while
          ceph was producing gigabytes of useless logs.

ca 19:50: rsync begun copying new files at full speed.

ca 19:56: rsync hung; node01 load 6+, node02 load 0.15, osd partitions
          (i.e. the underlying ext4 partitions) stuck at differing sizes
          and not growing.

ca 20:01: rsync still hanging, node01 load 7.80, node02 load 0.08, osd
          partitions still different and not growing.

ca 20:07: rsync still hanging, node01 load 8.00, node02 load 0.00, osd
          partitions still different and not growing. Killed rsync:

feednews/time-00/40/05/7bb1-4696
        1697 100%    3.61kB/s    0:00:00 (xfer#3469, to-check=2515/339251)
feednews/time-00/40/05/7bb2-4696
        1691 100%    3.59kB/s    0:00:00 (xfer#3470, to-check=2514/339251)
feednews/time-00/40/05/7bb3-4696
        3476 100%    7.38kB/s    0:00:00 (xfer#3471, to-check=2513/339251)
feednews/time-00/40/05/7bb4-4696
^CKilled by signal 2.

ca 20:09: umount /ceph: device busy; lsof |grep /ceph stuck, not even
          responding to ^C.

ca 20:11: ca 20:09: umount /ceph OK, service stop -a ceph OK, osd partitions
          still at the same size as at 19:56.

ca 20:12: service ceph -a start OK, both osd partitions now started growing,
          mount mon:port /ceph gives error 5.

A couple of minutes later the osd partitions stabilized at the same size
and mounting ceph succeeded without problems.

I'm speculating that something deadlocked unrecoverably at 19:56, leaving
different data in the journals. Restarting ceph removed the deadlock and
allowed the journals to flush, restoring proper operation.

I no longer remember exactly what I did last time, but it's very likely
that I unmounted and stopped ceph before switching off the machines on
Thursday night, but kept restarting it on Friday when I couldn't mount it,
not giving it enough time to flush its journals between each failed mount
attempt and the next restart, until somehow the journals got corrupted.

You see above that the last file that appeared to be successfully transferred
by rsync this evening was feednews/time-00/40/05/7bb3-4696, which shows up
in the logs at 19:55:45.746146. However, the logs also say that another ~4000
files were transferred between rsync's apparent hang and my ^C. Those must be
the ones that ended up lingering in the journal. If they were written to the
journal but not to the OSD, I believe rsync should still have gotten a "write
OK" back from ceph, but it seems it didn't.

Anyway, trying to read and understand huge logs when you don't even know
what the log entries mean or what you're looking for, is very tiring and
not very efficient, so I'll call it a day for now.  I'd appreciate it if
you could give me some hints on what to look for, so I can go hunting again
in a more focused manner. In the meanwhile I've put the new logs at
http://www.provocation.net/tmp/logs_take_2.tar.bz2 for you.

Z

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html