Cancel a scrub?

clewis at centraldesktop.com (Craig Lewis) · Wed, 02 Apr 2014 14:30:56 -0700

Thanks!

I knew about noscrub, but I didn't realize that the flapping would 
cancel a scrub in progress.

So the scrub doesn't appear to be the reason it wasn't recovering.  
After a flap, it goes into:
2014-04-02 14:11:09.776810 mon.0 [INF] pgmap v5323181: 2592 pgs: 2591 
active+clean, 1 active+recovery_wait; 15066 GB data, 30527 GB used, 
29060 GB / 59588 GB avail; 1/36666878 objects degraded (0.000%); 0 B/s, 
11 keys/s, 2 objects/s recovering

It stays in that state until the OSD gets kicked out again.

The problem is the flapping OSD is spamming its logs with:
2014-04-02 14:12:01.242425 7f344a97d700  1 heartbeat_map is_healthy 
'OSD::op_tp thread 0x7f3447977700' had timed out after 15

None of the other OSDs are saying that.

Is there anything I can do to repair the health map on osd.11?

In case it helps, here are the osd.11 logs after a daemon restart:
2014-04-02 14:10:58.267556 7f3467ff6780  0 ceph version 0.72.2 
(a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-osd, pid 7791
2014-04-02 14:10:58.269782 7f3467ff6780  1 
filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs
2014-04-02 14:10:58.269789 7f3467ff6780  1 
filestore(/var/lib/ceph/osd/ceph-11)  disabling 'filestore replica 
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2014-04-02 14:10:58.306112 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
FIEMAP ioctl is supported and appears to work
2014-04-02 14:10:58.306135 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-04-02 14:10:58.308070 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2014-04-02 14:10:58.357102 7f3467ff6780  0 
filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2014-04-02 14:10:58.360837 7f3467ff6780 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use 
of aio anyway
2014-04-02 14:10:58.360851 7f3467ff6780  1 journal _open 
/var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 
4096 bytes, directio = 1, aio = 0
2014-04-02 14:10:58.422842 7f3467ff6780  1 journal _open 
/var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 
4096 bytes, directio = 1, aio = 0
2014-04-02 14:10:58.423241 7f3467ff6780  1 journal close 
/var/lib/ceph/osd/ceph-11/journal
2014-04-02 14:10:58.424433 7f3467ff6780  1 
filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs
2014-04-02 14:10:58.442963 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
FIEMAP ioctl is supported and appears to work
2014-04-02 14:10:58.442974 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-04-02 14:10:58.445144 7f3467ff6780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2014-04-02 14:10:58.451977 7f3467ff6780  0 
filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2014-04-02 14:10:58.454481 7f3467ff6780 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use 
of aio anyway
2014-04-02 14:10:58.454495 7f3467ff6780  1 journal _open 
/var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 
4096 bytes, directio = 1, aio = 0
2014-04-02 14:10:58.465211 7f3467ff6780  1 journal _open 
/var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 
4096 bytes, directio = 1, aio = 0
2014-04-02 14:10:58.466825 7f3467ff6780  0 <cls> 
cls/hello/cls_hello.cc:271: loading cls_hello
2014-04-02 14:10:58.468745 7f3467ff6780  0 osd.11 11688 crush map has 
features 1073741824, adjusting msgr requires for clients
2014-04-02 14:10:58.468756 7f3467ff6780  0 osd.11 11688 crush map has 
features 1073741824, adjusting msgr requires for osds
2014-04-02 14:11:07.822045 7f343de58700  0 -- 10.194.0.7:6800/7791 >> 
10.194.0.7:6822/14075 pipe(0x1c96e000 sd=177 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1b7e3000).accept connect_seq 0 vs existing 0 state connecting
2014-04-02 14:11:07.822182 7f343f973700  0 -- 10.194.0.7:6800/7791 >> 
10.194.0.7:6806/26942 pipe(0x1c96e280 sd=82 :6800 s=0 pgs=0 cs=0 l=0 
c=0x1b7e3160).accept connect_seq 0 vs existing 0 state connecting
2014-04-02 14:11:20.333163 7f344a97d700  1 heartbeat_map is_healthy 
'OSD::op_tp thread 0x7f3447977700' had timed out after 15
<snip repeats>
2014-04-02 14:13:35.310407 7f344a97d700 -1 common/HeartbeatMap.cc: In 
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, 
const char*, time_t)' thread 7f344a97d700 time 2014-04-02 14:13:35.308718
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x107) [0x89df87]
  2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x89e937]
  3: (OSD::handle_osd_ping(MOSDPing*)+0x85b) [0x60f79b]
  4: (OSD::heartbeat_dispatch(Message*)+0x4d3) [0x610a23]
  5: (DispatchQueue::entry()+0x549) [0x9f4aa9]
  6: (DispatchQueue::DispatchThread::entry()+0xd) [0x92ffdd]
  7: (()+0x7e9a) [0x7f346721ae9a]
  8: (clone()+0x6d) [0x7f3465cbe3fd]

All of the other OSDs are spamming:
2014-04-02 14:15:45.275858 7f4eb1e35700 -1 osd.7 11697 heartbeat_check: 
no reply from osd.11 since back 2014-04-02 14:13:56.927261 front 
2014-04-02 14:13:56.927261 (cutoff 2014-04-02 14:15:25.275855)

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

On 4/2/14 13:38 , Sage Weil wrote:
> On Wed, 2 Apr 2014, Craig Lewis wrote:
>> Is there any way to cancel a scrub on a PG?
>>
>>
>> I have an OSD that's recovering, and there's a single PG left waiting:
>> 2014-04-02 13:15:39.868994 mon.0 [INF] pgmap v5322756: 2592 pgs: 2589
>> active+clean, 1 active+recovery_wait, 2 active+clean+scrubbing+deep; 15066 GB
>> data, 30527 GB used, 29061 GB / 59588 GB avail; 1/36666878 objects degraded
>> (0.000%)
>>
>> The PG that is in recovery_wait is on the same OSD that is being deep
>> scrubbed.  I don't have journals on SSD, so recovery and scrubbing are heavily
>> throttled.  I want to cancel the scrub so the recovery can complete.  I'll
>> manually restart the deep scrub when it's done.
>>
>> Normally I'd just wait, but this OSD is flapping.  It keeps getting kicked out
>> of the cluster for being unresponsive.  I'm hoping that if I cancel the scrub,
>> it will allow the recovery to complete and the OSD will stop flapping.
> You can 'ceph osd set noscrub' to prevent a new scrub from starting.
> Next time it flaps the scrub won't restart.  The only want to cancel an
> inprogress scrub is to force a peering event, usually by manually marking
> the osd down (ceph osd down N).
>
> sage

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140402/da898c78/attachment.htm>