Cannot restart the osd successful after reboot machine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear All:

In my testing environment, we deploy ceph cluster by version 0.43, kernel 3.2.0.
(We deploy it several months ago, so the version is not the latest one)
There are 5 MON and 8 OSD in the cluster.  We have 5 servers for the monitors.
And two storages servers, 4 OSD for each.

We meet a situation that we cannot restart the osd service successful after reboot one of the storage server. (contains 4 OSD).
Let me describe the scenario more detail.

1. One of the storage server's network have problem. Therefore, we lose four OSD in the cluster.

   When I type 'ceph -s', I get some strange message like this.  (sorry that I did not copy the clear message immediately)

       -276/3108741 degraded  (the number is a negative number, I am sure)
       8 osds: 4 up, 4 in

2. After fix the broken network, I try to restart the four OSD on it. But some of OSD would fail.

3. I repeat to execute 'service ceph start' on the storage server. Maybe after 10 times, all the OSDs finally work fine.
  And 'ceph health' returns HEALTH_OK

Appreciate for any comment for this situation, thanks!  If you want the complete log for all OSD, I can send it to you.

2012-06-07 14:42:36.482616 7f62a02547a0 ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37), process ceph-osd, pid 7146
2012-06-07 14:42:36.510945 7f62a02547a0 filestore(/srv/disk0) mount FIEMAP ioctl is supported
2012-06-07 14:42:36.511002 7f62a02547a0 filestore(/srv/disk0) mount did NOT detect btrfs
2012-06-07 14:42:36.511372 7f62a02547a0 filestore(/srv/disk0) mount found snaps <>
2012-06-07 14:42:36.640990 7f62a02547a0 filestore(/srv/disk0) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2012-06-07 14:42:36.816868 7f62a02547a0 journal _open /srv/disk0.journal fd 16: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-06-07 14:42:36.848522 7f62a02547a0 journal read_entry 410750976 : seq 1115076 1278 bytes
2012-06-07 14:42:36.848582 7f62a02547a0 journal read_entry 410759168 : seq 1115077 1275 bytes
2012-06-07 14:42:36.848810 7f62a02547a0 journal read_entry 410767360 : seq 1115078 1275 bytes
2012-06-07 14:42:36.848835 7f62a02547a0 journal read_entry 410775552 : seq 1115079 1272 bytes
2012-06-07 14:42:36.848859 7f62a02547a0 journal read_entry 410783744 : seq 1115080 1281 bytes
2012-06-07 14:42:36.848872 7f62a02547a0 journal read_entry 410791936 : seq 1115081 1281 bytes
2012-06-07 14:42:36.849181 7f62a02547a0 journal read_entry 410800128 : seq 1115082 1275 bytes
2012-06-07 14:42:36.849207 7f62a02547a0 journal read_entry 410808320 : seq 1115083 1278 bytes
2012-06-07 14:42:36.849225 7f62a02547a0 journal read_entry 410816512 : seq 1115084 1281 bytes
2012-06-07 14:42:36.849239 7f62a02547a0 journal read_entry 410824704 : seq 1115085 1275 bytes
2012-06-07 14:42:36.849255 7f62a02547a0 journal read_entry 410832896 : seq 1115086 1281 bytes
2012-06-07 14:42:36.849267 7f62a02547a0 journal read_entry 410841088 : seq 1115087 1278 bytes
2012-06-07 14:42:36.849282 7f62a02547a0 journal read_entry 410849280 : seq 1115088 1275 bytes
2012-06-07 14:42:36.849293 7f62a02547a0 journal read_entry 410857472 : seq 1115089 1275 bytes
2012-06-07 14:42:36.849328 7f62a02547a0 journal _open /srv/disk0.journal fd 16: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-06-07 14:42:36.851593 7f62a02547a0 journal close /srv/disk0.journal
2012-06-07 14:42:36.852625 7f62a02547a0 filestore(/srv/disk0) mount FIEMAP ioctl is supported
2012-06-07 14:42:36.852642 7f62a02547a0 filestore(/srv/disk0) mount did NOT detect btrfs
2012-06-07 14:42:36.852695 7f62a02547a0 filestore(/srv/disk0) mount found snaps <>
2012-06-07 14:42:36.852714 7f62a02547a0 filestore(/srv/disk0) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2012-06-07 14:42:36.855399 7f62a02547a0 journal _open /srv/disk0.journal fd 24: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-06-07 14:42:36.855429 7f62a02547a0 journal read_entry 410750976 : seq 1115076 1278 bytes
2012-06-07 14:42:36.855450 7f62a02547a0 journal read_entry 410759168 : seq 1115077 1275 bytes
2012-06-07 14:42:36.855464 7f62a02547a0 journal read_entry 410767360 : seq 1115078 1275 bytes
2012-06-07 14:42:36.855476 7f62a02547a0 journal read_entry 410775552 : seq 1115079 1272 bytes
2012-06-07 14:42:36.855487 7f62a02547a0 journal read_entry 410783744 : seq 1115080 1281 bytes
2012-06-07 14:42:36.855501 7f62a02547a0 journal read_entry 410791936 : seq 1115081 1281 bytes
2012-06-07 14:42:36.855514 7f62a02547a0 journal read_entry 410800128 : seq 1115082 1275 bytes
2012-06-07 14:42:36.855525 7f62a02547a0 journal read_entry 410808320 : seq 1115083 1278 bytes
2012-06-07 14:42:36.855536 7f62a02547a0 journal read_entry 410816512 : seq 1115084 1281 bytes
2012-06-07 14:42:36.855547 7f62a02547a0 journal read_entry 410824704 : seq 1115085 1275 bytes
2012-06-07 14:42:36.855558 7f62a02547a0 journal read_entry 410832896 : seq 1115086 1281 bytes
2012-06-07 14:42:36.855569 7f62a02547a0 journal read_entry 410841088 : seq 1115087 1278 bytes
2012-06-07 14:42:36.855580 7f62a02547a0 journal read_entry 410849280 : seq 1115088 1275 bytes
2012-06-07 14:42:36.855591 7f62a02547a0 journal read_entry 410857472 : seq 1115089 1275 bytes
2012-06-07 14:42:36.855615 7f62a02547a0 journal _open /srv/disk0.journal fd 24: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-06-07 14:42:45.975088 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:47.104389 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:50.416604 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:51.566632 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:53.026978 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:56.246388 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:57.563794 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:42:59.296518 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:43:00.571422 7f6292f49700 osd.4 9 handle_osd_map fsid 520ab01f-cd87-4214-b6ce-d4f7c29da98a != 00000000-0000-0000-0000-000000000000
2012-06-07 14:43:11.428808 7f628b11f700 -- 192.168.123.2:6800/7146 >> 192.168.123.3:0/552115952 pipe(0x4ff2c80 sd=41 pgs=0 cs=0 l=0).accept peer addr is really 192.168.123.3:0/552115952 (socket is 192.168.123.3:43234/0)
2012-06-07 14:43:25.204361 7f628fd41700 osd.4 225 pg[2.ac( v 46'5107 lc 9'5101 (9'4099,46'5107] n=165 ec=1 les/c 214/214 222/222/222) [4,1] r=0 lpr=225 lcod 0'0 mlcod 0'0 active+recovering m=1] watch: ctx->obc=0x5e5a000 cookie=13 oi.version=5102 ctx->at_version=225'5108
2012-06-07 14:43:25.204405 7f628fd41700 osd.4 225 pg[2.ac( v 46'5107 lc 9'5101 (9'4099,46'5107] n=165 ec=1 les/c 214/214 222/222/222) [4,1] r=0 lpr=225 lcod 0'0 mlcod 0'0 active+recovering m=1] watch: oi.user_version=215
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)' thread 7f629474c700 time 2012-06-07 14:43:25.232802
osd/ReplicatedPG.cc: 3380: FAILED assert(obc->unconnected_watchers.size() == 0)
ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: /usr/bin/ceph-osd() [0x52f1e2]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x182) [0x52f692]
3: (Finisher::finisher_thread_entry()+0x190) [0x74a870]
4: (()+0x7efc) [0x7f629fc2befc]
5: (clone()+0x6d) [0x7f629e25c59d]
ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: /usr/bin/ceph-osd() [0x52f1e2]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x182) [0x52f692]
3: (Finisher::finisher_thread_entry()+0x190) [0x74a870]
4: (()+0x7efc) [0x7f629fc2befc]
5: (clone()+0x6d) [0x7f629e25c59d]
*** Caught signal (Aborted) **
in thread 7f629474c700
ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: /usr/bin/ceph-osd() [0x6e3866]
2: (()+0x10060) [0x7f629fc34060]
3: (gsignal()+0x35) [0x7f629e1af3a5]
4: (abort()+0x17b) [0x7f629e1b2b0b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f629ea6fd7d]
6: (()+0xb9f26) [0x7f629ea6df26]
7: (()+0xb9f53) [0x7f629ea6df53]
8: (()+0xba04e) [0x7f629ea6e04e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x200) [0x678bc0]
10: /usr/bin/ceph-osd() [0x52f1e2]
11: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x182) [0x52f692]
12: (Finisher::finisher_thread_entry()+0x190) [0x74a870]
13: (()+0x7efc) [0x7f629fc2befc]
14: (clone()+0x6d) [0x7f629e25c59d]

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux