Re: domino-style OSD crash

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Tue, 10 Jul 2012 11:46:29 +0200

Le 09/07/2012 19:14, Samuel Just a écrit :
Can you restart the node that failed to complete the upgrade with

Well, it's a little big complicated ; I now run those nodes with XFS, 
and I've long-running jobs on it right now, so I can't stop the ceph 
cluster at the moment.

As I've keeped the original broken btrfs volumes, I tried this morning 
to run the old osd in parrallel, using the $cluster variable. I only 
have partial success.
I tried using different port for the mons, but ceph want to use the old 
mon map. I can edit it (epoch 1) but it seems to use 'latest' instead, 
the format isn't compatible with monmaptool and I don't know how to 
"inject" the modified on a non running cluster.

Anyway, osd seems to start fine, and I can reproduce the bug :
debug filestore = 20
debug osd = 20

I've put it in [global], is it sufficient ?

and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam
Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a 
strace shows lots of pread.

Here is the log :

[..]
2012-07-10 11:33:29.560052 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by 
glibc
2012-07-10 11:33:29.560062 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC 
ioctl will suffice
2012-07-10 11:33:29.560172 7f3e615ac780 -1 
filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp 
detected: 2. Proceeding, do_update is set, performing disk format upgrade.
2012-07-10 11:33:29.560233 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount found snaps <3744666,3746725>
2012-07-10 11:33:29.560263 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  current/ seq was 3746725
2012-07-10 11:33:29.560267 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  most recent snap from 
<3744666,3746725> is 3746725
2012-07-10 11:33:29.560280 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap 
3746725
2012-07-10 11:33:29.839281 7f3e615ac780  5 
filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725

... and nothing more.

I'll let him running for 3 hours. If I have another message, I'll let 
you know.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html