Hello Guys,
the issue still exists :(
If we run a "ceph pg deep-scrub 0.223" nearly all VMs stop for a while
(blocked requests).
- we already replaced the OSDs (SAS Disks - journal on NVMe)
- Removed OSDs so that acting set for pg 0.223 has changed
- checked the filesystem on the acting OSDs
- changed the tunables back from jewel to default
- changed the tunables again to jewel from default
- done a deep-scrub on the hole OSDs (ceph osd deep-scrub osd.<id>) -
only when a deeph-scrub on pg 0.223 runs we get blocked requests
The deep-scrub on pg 0.223 took always 13-15 Min. to finish. It does not
matter which OSDs are in the acting set for this pg.
So, i dont have any ideas what could be the issue for this.
As long as "ceph osd set nodeep-scrub" is set - so that no deep-scrub on
0.223 is running - the cluster is fine!
Could this be a bug?
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Kernel: 4.4.0-31-generic #50-Ubuntu
Any ideas?
- Mehmet
Am 2016-08-02 17:57, schrieb c:
Am 2016-08-02 13:30, schrieb c:
Hello Guys,
this time without the original acting-set osd.4, 16 and 28. The issue
still exists...
[...]
For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?
Yes, right.
[...]
When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.
Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...
- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3]
[...]
But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.
[...]
Next week, I will do this
1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)
osd.4 is now removed completely.
The Primary PG is now on "osd.9"
# ceph pg map 0.223
osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]
1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error
xfs_repair did not find/show any error
1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"
Because now osd.9 is the Primary PG i have set the debug_osd on this
too:
ceph tell osd.9 injectargs "--debug_osd 5/5"
and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)
Start @ 15:33:27
End @ 15:48:31
The "ceph.log"
- http://slexy.org/view/s2WbdApDLz
The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for
the osds
LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw
- atop Log: http://slexy.org/view/s21wJG2qr8
LogFile - osd.16 (/dev/sdh)
- ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d
- atop Log: http://slexy.org/view/s2iMjer8rC
LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7
- atop log: http://slexy.org/view/s2gJqzu3uG
2.1 Remove osd.16 completely from Ceph
osd.16 is now removed completely - now replaced with osd.17 witihin
the acting set.
# ceph pg map 0.223
osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]
2.2 xfs_repair -n /dev/sdh1
xfs_repair did not find/show any error
2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"
and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)
Start @ 2016-08-02 10:02:44
End @ 2016-08-02 10:17:22
The "Ceph.log": http://slexy.org/view/s2ED5LvuV2
LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu
- atop Log: http://slexy.org/view/s20XjFZFEL
LogFile - osd.17 (/dev/sdi)
- ceph-osd.17.log: http://slexy.org/view/s202fpcZS9
- atop Log: http://slexy.org/view/s2TxeR1JSz
LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV
- atop log: http://slexy.org/view/s21AfebBqK
3.1 Remove osd.28 completely from Ceph
Now osd.28 is also removed completely from Ceph - now replaced with
osd.23
# ceph pg map 0.223
osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]
3.2 xfs_repair -n /dev/sdm1
As expected: xfs_repair did not find/show any error
3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"
... againg nearly all of my VMs stop working for a while...
Now are all "original" OSDs (4,16,28) removed which was in the
acting-set when i wrote my first eMail to this mailinglist. But the
issue still exists with different OSDs (9,17,23) as the acting-set
while the questionable PG 0.223 is still the same!
In suspicion that the "tunable" could be the cause, i have now changed
this back to "default" via " ceph osd crush tunables default ".
This will take a whille... then i will do " ceph pg deep-scrub 0.223 "
again (without osds 4,16,28)...
Really, i do not know whats going on here.
Ceph finished its recovering to "default" tunables but the issue still
exists!:*(
The acting set has changed again
# ceph pg map 0.223
osdmap e11230 pg 0.223 (0.223) -> up [9,11,20] acting [9,11,20]
But when i start " ceph pg deep-scrub 0.223 ", again nearly all of my
VMs stop working for a while!
Does any one have an idea where i should have a look to find the cause
for this?
It seems that everytime the Primary OSD from the acting set of PG
0.223 (*4*,16,28; *9*,17,23 or *9*,11,20) leads to "currently waiting
for subops from 9,X" and the deep-scrub takes always nearly 15 minutes
to finish.
My output from " ceph pg 0.223 query "
- http://slexy.org/view/s21d6qUqnV
Mehmet
For the records: Although nearly all disks are busy i have no
slow/blocked requests and i am watching the logfiles for nearly 20
minutes now...
Your help is realy appreciated!
- Mehmet
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com