Re: ONE pg deep-scrub blocks cluster

Mehmet <ceph@xxxxxxxxxx> · Mon, 29 Aug 2016 10:23:14 +0200

Hey JC,

thank you very much! - My answers inline :)

Am 2016-08-26 19:26, schrieb LOPEZ Jean-Charles:
Hi Mehmet,

what is interesting in the PG stats is that the PG contains around
700+ objects and you said that you are using RBD only in your cluster
if IIRC. With the default RBD order (4MB objects) this would consume
around 2.8GB and the PG stats show there is over 100GB of data.

Right, we actualy using this cluster only for KVM Disk-Images - accessed 
through librbd (using proxmox).
That is realy strange, how can that happen? :

So here are some extra questions so I can try to imagine better into
more details what could go wrong:
- Have you changed the object allocation size for some RBDs in your
environment or you always use the default RBD Image order?

we did not changed the allocation. All default - except setting tunables 
to "jewel" after setting up our ceph cluster.

- Do you snapshot your RBDs?

- If the answer is yes, is there on particular set of RBDs you
snapshot more often and keep the snapshot for a longer amount of time?

Yes (via proxmox), but actualy only before doing Updates within the VMs. 
Normaly we delete snapshots
after a few days. There are no snapshots actualy - the last one was 
taken perhaps a month ago and is deleted.

I’d recommend that you inspect the pg 0.223 head directory to find out
which RBD Image objects it is hosting. Issue a ls command on
/var/lib/ceph/osd/ceph-9/current/0.0223__head and identify unique RBD
Image header. Match those header against the header you can find for
each RBD Image in your cluster (you can check that with an rbd info
<imagename> command and by looking at the RBD Prefix output. This will
let you know for sure which RBD Images use that particular PG and then
check the snapshot settings and inspect the order setting for all of
them.

That was a very good point so i had this time a closer look on the 
objects within

- /var/lib/ceph/osd/ceph-9/current/0.223_head/DIR_3/DIR_2/DIR_2

and found this:

root@osdserver1:/var/lib/ceph/osd/ceph-9/current/0.223_head/DIR_3/DIR_2/DIR_2# 
ls -lah | grep -v "4.0M"
total 101G
drwxr-xr-x 2 ceph ceph  16K Aug 29 09:38 .
drwxr-xr-x 6 ceph ceph  24K Jul 31 01:04 ..
-rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 
benchmark\udata\uinkscope.domain.tld\u20940\uobject3394__head_4E4D8223__0
-rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 
benchmark\udata\uinkscope.domain.tld\u22557\uobject2956__head_B9E34223__0
-rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 
benchmark\udata\uinkscope.domain.tld\u22580\uobject214__head_329AE223__0
-rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 
benchmark\udata\uinkscope.domain.tld\u22580\uobject588__head_31751223__0
-rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 
benchmark\udata\uinkscope.domain.tld\u25893\uobject2770__head_7A2F2223__0
-rw-r--r-- 1 ceph ceph  64K Aug  5 13:52 
benchmark\udata\uinkscope.domain.tld\u3982\uobject200__head_9AA23223__0
-rw-r--r-- 1 ceph ceph    0 Jul 31 00:31 __head_00000223__0
-rw-r--r-- 1 ceph ceph 100G Jul 31 01:04 vm-101-disk-2__head_383C3223__0

*vm-101-disk-2__head_383C3223__0* seem wrong here and exists also on the 
replica osd for this pg.

What you can try doing is to change the following settings on all the
OSDs that host this particular PG and see if it makes things better

[osd]
osd_scrub_begin_hour = {Hour scrubbing starts}     # These 2
parameters will let you define the best window for scrubbing and hence
reduce impact on the VMs at a crucial moment of the day. Default to 0.
osd_scrub_end_hour = {Hour scrubbing stops}        # Values are
expressed in 24 hour format for both. Begin=3 End=4 scrubbing will
only take place between 3 and 4 in the morning. Default to 24.
osd_scrub_sleep = .1                                                #
Introduce a delay between scrub operation to give more disk time to
the OSD to perform client operations. Defaults to 0.0
osd_scrub_chunk_max = 5                                       #
maximum number of chunks the scrub will process in one go. Defaults to
25.
osd_deep_scrub_stride = 1048576                           # Read size
during scrubbing operations. The idea here is to do less chunks but
bigger sequential reads. Defaults to 512KB=524288.

Thanks for this suggenstions.
These are already on my notebook to setup after the cluster is running 
fine without any issues.

If your devices use coq as an elevator, you can add the following 2
lines in the section to lower the priority of the scrubbing read IOs
osd_disk_thread_ioprio_class = idle                          # Change
the coq priority for the scrub thread. Default is the same priority
class as the OSD
osd_disk_thread_ioprio_priority = 0                           # Change
the priority within the class to the lowest possible. Default is the
same priority as the OSD

Actualy they are on default (deadline), but it seems to be worth to 
change this to cfq.
We will have a closer look on this when the cluster is working without 
any issues.

Keep me posted on your tests and findings.
If this does fix the performance impact problem, I suggest you apply
those changes to the config file and push it to all the OSD nodes and
restart the OSDs in a gradual manner.

Of course!... and done with this eMail :)

Could be the mentionend file above the cause for the trouble?
Is it possible to delete this via "rm -f" or would this cause any other 
issues?

Best regards
- Mehmet

Regards
JC Lopez
S. Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelopez@xxxxxxxxxx
+1 408-680-6959

On Aug 26, 2016, at 04:16, Mehmet <ceph@xxxxxxxxxx> wrote:

Hello JC,

as promised here is my

- ceph.conf (I have done a "diff" on all involved server - all using 
the same ceph.conf) = ceph_conf.txt
- ceph pg 0.223 query = ceph_pg_0223_query_20161236.txt
- ceph -s = ceph_s.txt
- ceph df = ceph_df.txt
- ceph osd df = ceph_osd_df.txt
- ceph osd dump | grep pool = ceph_osd_dump_pool.txt
- ceph osd crush rule dump = ceph_osd_crush_rule_dump.txt

as attached txt files.

I have done again a "ceph pg deep-scrub 0.223" before I have created 
the files above. The issue still exists today ~ 12:24 on ... :*(
The deep-scrub on this pg has taken ~14 minutes:

- 2016-08-26 12:24:01.463411 osd.9 172.16.0.11:6808/29391 1777 : 
cluster [INF] 0.223 deep-scrub starts
- 2016-08-26 12:38:07.201726 osd.9 172.16.0.11:6808/29391 2485 : 
cluster [INF] 0.223 deep-scrub ok

Ceph: version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
OS: Ubuntu 16.04 LTS (Linux osdserver1 4.4.0-31-generic #50-Ubuntu SMP 
Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux)

As a remark, assuming the size parameter of the rbd pool is set to 3, 
the number of PGs in your cluster should be higher

I know I could increase this to 2048 (with 30 OSDs). But perhaps we 
will create further Pools so I did not want to set this to high for 
this pool because it is not possible to decrease the pg for the pool.
Furthermore if I would change this now and the issue is gone, we would 
not know what the cause was... :)

When you need further informations, please do not hesitate to ask - I 
will provides this as soon as possible.
Please keep in mind that I have 1 additional disk on each OSD-Node (3 
disk in sum) which I can add to the cluster, so that the acting set 
for this pg could change.
I did removed this before to force other OSDs to become the acting set 
for pg 0.223.

Thank you, your help is very appreciated!

- Mehmet

Am 2016-08-25 13:58, schrieb ceph@xxxxxxxxxx:
Hey JC,
Thank you very much for your mail!
I will provide the Informations tomorrow when i am at work again.
Hope that we will find a solution :)
- Mehmet
Am 24. August 2016 16:58:58 MESZ, schrieb LOPEZ Jean-Charles
<jelopez@xxxxxxxxxx>:
Hi Mehmet,
I’m just seeing your message and read the thread going with it.
Can you please provide me with a copy of the ceph.conf file on the
MON and OSD side assuming it’s identical and if the ceph.conf file
is different on the client side (the VM side) can you please provide
me with a copy of it.
Can you also provide me as attached txt files with
output of your pg query of the pg 0.223?
output of ceph -s
output of ceph df
output of ceph osd df
output of ceph osd dump | grep pool
output of ceph osd crush rule dump
Thank you and I’ll see if I can get something to ease your pain.
As a remark, assuming the size parameter of the rbd pool is set to
3, the number of PGs in your cluster should be higher
If we manage to move forward and get it fixed, we will repost to the
mailing list the changes we made to your configuration.
Regards
JC
On Aug 24, 2016, at 06:41, Mehmet <ceph@xxxxxxxxxx> wrote:
Hello Guys,
the issue still exists :(
If we run a "ceph pg deep-scrub 0.223" nearly all VMs stop for a
while (blocked requests).
- we already replaced the OSDs (SAS Disks - journal on NVMe)
- Removed OSDs so that acting set for pg 0.223 has changed
- checked the filesystem on the acting OSDs
- changed the tunables back from jewel to default
- changed the tunables again to jewel from default
- done a deep-scrub on the hole OSDs (ceph osd deep-scrub osd.<id>)
- only when a deeph-scrub on pg 0.223 runs we get blocked requests
The deep-scrub on pg 0.223 took always 13-15 Min. to finish. It
does not matter which OSDs are in the acting set for this pg.
So, i dont have any ideas what coul d be the issue for this.
As long as "ceph osd set nodeep-scrub" is set - so that no
deep-scrub on 0.223 is running - the cluster is fine!
Could this be a bug?
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Kernel: 4.4.0-31-generic #50-Ubuntu
Any ideas?
- Mehmet
Am 2016-08-02 17:57, schrieb c:
Am 2016-08-02 13:30, schrieb c:
Hello Guys,
this time without the original acting-set osd.4, 16 and 28. The
issue
still exists...
[...]
For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?
Yes, right.
[...]
When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.
Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...
- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1] [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2] [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3] [3]
[...]
But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.
[...]
Next week, I will do this
1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)
osd.4 is now removed completely.
The Primary PG is now on "osd.9"
# ceph pg map 0.223
osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]
1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error
xfs_repair did not find/show any error
1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"
Because now osd.9 is the Primary PG i have set the debug_osd on this
too:
ceph tell osd.9 injectargs "--debug_osd 5/5"
and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)
Start @ 15:33:27
End @ 15:48:31
The "ceph.log"
- http://slexy.org/view/s2WbdApDLz [5]
The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for
the osds
LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw [6]
- atop Log: http://slexy.org/view/s21wJG2qr8 [7]
LogFile - osd.16 (/dev/sdh)
- ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d [8]
- atop Log: http://slexy.org/view/s2iMjer8rC [9]
LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7 [10]
- atop log: http://slexy.org/view/s2gJqzu3uG [11]
2.1 Remove osd.16 completely from Ceph
osd.16 is now removed completely - now replaced with osd.17 witihin
the acting set.
# ceph pg map 0.223
osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]
2.2 xfs_repair -n /dev/sdh1
xfs_repair did not find/show any error
2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"
and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
working for a while)
Start @ 2016-08-02 10:02:44
End @ 2016-08-02 10:17:22
The "Ceph.log": http://slexy.org/view/s2ED5LvuV2 [12]
LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu [13]
- atop Log: http://slexy.org/view/s20XjFZFEL [14]
LogFile - osd.17 (/dev/sdi)
- ceph-osd.17.log: http://slexy.org/view/s202fpcZS9 [15]
- atop Log: http://slexy.org/view/s2TxeR1JSz [16]
LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV [17]
- atop log: http://slexy.org/view/s21AfebBqK [18]
3.1 Remove osd.28 completely from Ceph
Now osd.28 is also removed completely from Ceph - now replaced with
osd.23
# ceph pg map 0.223
osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]
3.2 xfs_repair -n /dev/sdm1
As expected: xfs_repair did not find/show any error
3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"
... againg nearly all of my VMs stop working for a while...
Now are all "original" OSDs (4,16,28) removed which was in the
acting-set when i wrote my first eMail to this mailinglist. But the
issue still exists with different OSDs (9,17,23) as the acting-set
while the questionable PG 0.223 is still the same!
In suspicion that the "tunable" could be the cause, i have now
changed
this back to "default" via " ceph osd crush tunables default ".
This will take a whille... then i will do " ceph pg deep-scrub 0.223
"
again (without osds 4,16,28)...
Really, i do not know whats going on here.
Ceph finished its recovering to "default" tunables but the issue
still
exists!:*(
The acting set has changed again
# ceph pg map 0.223
osdmap e11230 pg 0.223 (0.223) -> up [9,11,20] acting [9,11,20]
But when i start " ceph pg deep-scrub 0.223 ", again nearly all of my
VMs stop working for a while!
Does any one have an idea where i should have a look to find the
cause for this?
It seems that everytime the Primary OSD from the acting set of PG
0.223 (*4*,16,28; *9*,17,23 or *9*,11,20) leads to "currently waiting
for subops from 9,X" and the deep-scrub takes always nearly 15
minutes
to finish.
My output from " ceph pg 0.223 query "
- http://slexy.org/view/s21d6qUqnV [19]
Mehmet
For the records: Although nearly all disks are busy i have no
slow/blocked requests and i am watching the logfiles for nearly 20
minutes now...
Your help is realy appreciated!
- Mehmet
-------------------------
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
-------------------------
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
-------------------------
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
JC Lopez
S. Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelopez@xxxxxxxxxx
+1 408-680-6959
Links:
------
[1] http://slexy.org/view/s21emd2u6j
[2] http://slexy.org/view/s20vukWz5E
[3] http://slexy.org/view/s20YX0lzZY
[4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[5] http://slexy.org/view/s2WbdApDLz
[6] http://slexy.org/view/s2kXeLMQyw
[7] http://slexy.org/view/s21wJG2qr8
[8] http://slexy.org/view/s20D6WhD4d
[9] http://slexy.org/view/s2iMjer8rC
[10] http://slexy.org/view/s21dmXoEo7
[11] http://slexy.org/view/s2gJqzu3uG
[12] http://slexy.org/view/s2ED5LvuV2
[13] http://slexy.org/view/s21z9JmwSu
[14] http://slexy.org/view/s20XjFZFEL
[15] http://slexy.org/view/s202fpcZS9
[16] http://slexy.org/view/s2TxeR1JSz
[17] http://slexy.org/view/s2eCUyC7xV
[18] http://slexy.org/view/s21AfebBqK
[19] http://slexy.org/view/s21d6qUqnV
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<ceph_conf.txt><ceph_df.txt><ceph_osd_crush_rule_dump.txt><ceph_osd_df.txt><ceph_osd_dump_pool.txt><ceph_pg_0223_query_20161236.txt><ceph_s.txt>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com