On Sun, 19 Jan 2014, Guang wrote: > Thanks Sage. > > I just captured part of the log (it was fast growing), the process did > not hang but I saw the same pattern repeatedly. Should I increase the > log level and send over email (it constantly reproduced)? Sure! A representative fragment of the repeating fragment shoudl be enough. s > > Thanks, > Guang > > On Jan 18, 2014, at 12:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > > > On Fri, 17 Jan 2014, Guang wrote: > >> Thanks Sage. > >> > >> I further narrow down the problem to #any command using paxos service would hang#, following are details: > >> > >> 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report). > >> > >> -bash-4.1$ sudo ceph -s > >> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > >> health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set > >> monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153 > >> osdmap e134893: 781 osds: 694 up, 751 in > >> pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%) > >> mdsmap e1: 0/0/1 up > >> > >> 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)). > >> > >> I attached the corresponding monitor log (it is like a bug). > > > > I see the osd set command coming through, but it arrives while paxos is > > converging and the log seems to end before the mon would normally process > > te delayed messages. Is there a reason why the log fragment you attached > > ends there, or did the process hang or something? > > > > Thanks- > > sage > > > >> I > >> > >> On Jan 17, 2014, at 1:35 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >> > >>> Hi Guang, > >>> > >>> On Thu, 16 Jan 2014, Guang wrote: > >>>> I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried: > >>>> 1. stop all daemons (mon & osd) > >>>> 2. change the configuration to disable cephx > >>>> 3. start mon daemons (3 in total) > >>>> 4. start osd daemon one by one > >>>> > >>>> After finishing step 3, the cluster can be reachable ('ceph -s' give results): > >>>> -bash-4.1$ sudo ceph -s > >>>> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > >>>> health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set > >>>> monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153 > >>>> osdmap e134893: 781 osds: 694 up, 751 in > >>>> pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%) > >>>> mdsmap e1: 0/0/1 up > >>>> (at this point, all OSDs should be down). > >>>> > >>>> When I tried to start OSD daemon, the starting script got hang, and the process hang is: > >>>> root 80497 80496 0 08:18 pts/0 00:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173 > >>>> > >>>> When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop: > >>>> select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) > >>>> Any idea what might trigger this? > >>> > >>> It is hard to tell from the strace what is going on from this. Do you see > >>> that the OSDs are booting in ceph.log (or ceph -w output)? If not, I > >>> would look at the osd daemon log for clues. You may need to turn up > >>> debugging to see (ceph daemon osd.NNN config set debug_osd 20 to adjust > >>> the level on the running daemon). > >>> > >>> If they are booting, it is mostly a matter of letting it recover and come > >>> up. We have seen patterns where configuration or network issues have let > >>> the system bury itself under a series of osdmap updates. If you see that > >>> in the log when you turn up debugging, or see the osds going up and down > >>> when you try to bring the cluster up, that could be what is going on. A > >>> strategy that has worked there is to let all the osds catch up on their > >>> maps before trying to peer and join the cluster. To do that, 'ceph osd > >>> set noup' (which prevents the osds from joining), wait for the ceph-osd > >>> processes to stop chewing on maps (watch the cpu utilization in top), and > >>> once they are all ready 'ceph osd unset noup' and let them join and peer > >>> all at once. > >>> > >>> sage > >>> > >>>> > >>>> ======= STRACE (PARTIAL) ========== > >>>> [pid 75873] futex(0xf707a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > >>>> [pid 75878] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f5da6529000 > >>>> [pid 75878] munmap(0x7f5da6529000, 28143616) = 0 > >>>> [pid 75878] munmap(0x7f5dac000000, 38965248) = 0 > >>>> [pid 75878] mprotect(0x7f5da8000000, 135168, PROT_READ|PROT_WRITE) = 0 > >>>> [pid 75878] futex(0xf707a0, FUTEX_WAKE_PRIVATE, 1) = 1 > >>>> [pid 75873] <... futex resumed> ) = 0 > >>>> [pid 75873] futex(0xdd3cb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > >>>> [pid 75878] futex(0xdd3cb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > >>>> [pid 75873] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > >>>> [pid 75878] <... futex resumed> ) = 0 > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> > >>>> [pid 75878] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 > >>>> [pid 75878] mmap(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5dadb28000 > >>>> [pid 75878] mprotect(0x7f5dadb28000, 4096, PROT_NONE) = 0 > >>>> [ omit some entries?] > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > >>>> [pid 75873] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > >>>> > >>>> > >>>> Thanks, > >>>> Guang > >>>> > >>>> On Jan 15, 2014, at 5:54 AM, Guang <yguang11@xxxxxxxxxxx> wrote: > >>>> > >>>>> Thanks Sage. > >>>>> > >>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status > >>>>> { "name": "osd151", > >>>>> "rank": 2, > >>>>> "state": "electing", > >>>>> "election_epoch": 85469, > >>>>> "quorum": [], > >>>>> "outside_quorum": [], > >>>>> "extra_probe_peers": [], > >>>>> "sync_provider": [], > >>>>> "monmap": { "epoch": 1, > >>>>> "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2", > >>>>> "modified": "0.000000", > >>>>> "created": "0.000000", > >>>>> "mons": [ > >>>>> { "rank": 0, > >>>>> "name": "osd152", > >>>>> "addr": "10.193.207.130:6789\/0"}, > >>>>> { "rank": 1, > >>>>> "name": "osd153", > >>>>> "addr": "10.193.207.131:6789\/0"}, > >>>>> { "rank": 2, > >>>>> "name": "osd151", > >>>>> "addr": "10.194.0.68:6789\/0"}]}} > >>>>> > >>>>> And: > >>>>> > >>>>> -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status > >>>>> { "election_epoch": 85480, > >>>>> "quorum": [ > >>>>> 0, > >>>>> 1, > >>>>> 2], > >>>>> "quorum_names": [ > >>>>> "osd151", > >>>>> "osd152", > >>>>> "osd153"], > >>>>> "quorum_leader_name": "osd152", > >>>>> "monmap": { "epoch": 1, > >>>>> "fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2", > >>>>> "modified": "0.000000", > >>>>> "created": "0.000000", > >>>>> "mons": [ > >>>>> { "rank": 0, > >>>>> "name": "osd152", > >>>>> "addr": "10.193.207.130:6789\/0"}, > >>>>> { "rank": 1, > >>>>> "name": "osd153", > >>>>> "addr": "10.193.207.131:6789\/0"}, > >>>>> { "rank": 2, > >>>>> "name": "osd151", > >>>>> "addr": "10.194.0.68:6789\/0"}]}} > >>>>> > >>>>> > >>>>> The election has been finished with leader selected from the above status. > >>>>> > >>>>> Thanks, > >>>>> Guang > >>>>> > >>>>> On Jan 14, 2014, at 10:55 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >>>>> > >>>>>> On Tue, 14 Jan 2014, GuangYang wrote: > >>>>>>> Hi ceph-users and ceph-devel, > >>>>>>> I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command. > >>>>>>> > >>>>>>> After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster. > >>>>>>> > >>>>>>> So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout? > >>>>>>> > >>>>>>> 2014-01-14 12:00:30.499397 7fc7f195e700 0 monclient(hunting): authenticate timed out after 300 > >>>>>>> 2014-01-14 12:00:30.499440 7fc7f195e700 0 librados: client.admin authentication error (110) Connection timed out > >>>>>>> Error connecting to cluster: Error > >>>>>>> > >>>>>>> Any idea why such error happened (restarting OSD would result in the same error)? > >>>>>>> > >>>>>>> I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted? > >>>>>> > >>>>>> That sounds unlikely, but you're right that the core problem is with the > >>>>>> mons. What does > >>>>>> > >>>>>> ceph daemon mon.`hostname` mon_status > >>>>>> > >>>>>> say? Perhaps they are not forming a quorum and that is what is preventing > >>>>>> authentication. > >>>>>> > >>>>>> sage > >>>>>> -- > >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>> > >>>> > >>> > >> > >> > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html