Re: ONE pg deep-scrub blocks cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mehmet,

see inline

Keep me posted

JC

> On Aug 29, 2016, at 01:23, Mehmet <ceph@xxxxxxxxxx> wrote:
> 
> Hey JC,
> 
> thank you very much! - My answers inline :)
> 
> Am 2016-08-26 19:26, schrieb LOPEZ Jean-Charles:
>> Hi Mehmet,
>> what is interesting in the PG stats is that the PG contains around
>> 700+ objects and you said that you are using RBD only in your cluster
>> if IIRC. With the default RBD order (4MB objects) this would consume
>> around 2.8GB and the PG stats show there is over 100GB of data.
> 
> Right, we actualy using this cluster only for KVM Disk-Images - accessed through librbd (using proxmox).
> That is realy strange, how can that happen? :
> 
>> So here are some extra questions so I can try to imagine better into
>> more details what could go wrong:
>> - Have you changed the object allocation size for some RBDs in your
>> environment or you always use the default RBD Image order?
> 
> we did not changed the allocation. All default - except setting tunables to "jewel" after setting up our ceph cluster.
> 
>> - Do you snapshot your RBDs?
> 
>> - If the answer is yes, is there on particular set of RBDs you
>> snapshot more often and keep the snapshot for a longer amount of time?
> 
> Yes (via proxmox), but actualy only before doing Updates within the VMs. Normaly we delete snapshots
> after a few days. There are no snapshots actualy - the last one was taken perhaps a month ago and is deleted.
> 
>> I’d recommend that you inspect the pg 0.223 head directory to find out
>> which RBD Image objects it is hosting. Issue a ls command on
>> /var/lib/ceph/osd/ceph-9/current/0.0223__head and identify unique RBD
>> Image header. Match those header against the header you can find for
>> each RBD Image in your cluster (you can check that with an rbd info
>> <imagename> command and by looking at the RBD Prefix output. This will
>> let you know for sure which RBD Images use that particular PG and then
>> check the snapshot settings and inspect the order setting for all of
>> them.
> 
> That was a very good point so i had this time a closer look on the objects within
> 
> - /var/lib/ceph/osd/ceph-9/current/0.223_head/DIR_3/DIR_2/DIR_2
> 
> and found this:
> 
> root@osdserver1:/var/lib/ceph/osd/ceph-9/current/0.223_head/DIR_3/DIR_2/DIR_2# ls -lah | grep -v "4.0M"
> total 101G
> drwxr-xr-x 2 ceph ceph  16K Aug 29 09:38 .
> drwxr-xr-x 6 ceph ceph  24K Jul 31 01:04 ..
> -rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 benchmark\udata\uinkscope.domain.tld\u20940\uobject3394__head_4E4D8223__0
> -rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 benchmark\udata\uinkscope.domain.tld\u22557\uobject2956__head_B9E34223__0
> -rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 benchmark\udata\uinkscope.domain.tld\u22580\uobject214__head_329AE223__0
> -rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 benchmark\udata\uinkscope.domain.tld\u22580\uobject588__head_31751223__0
> -rw-r--r-- 1 ceph ceph  64K Jul 31 00:44 benchmark\udata\uinkscope.domain.tld\u25893\uobject2770__head_7A2F2223__0
> -rw-r--r-- 1 ceph ceph  64K Aug  5 13:52 benchmark\udata\uinkscope.domain.tld\u3982\uobject200__head_9AA23223__0
> -rw-r--r-- 1 ceph ceph    0 Jul 31 00:31 __head_00000223__0
> -rw-r--r-- 1 ceph ceph 100G Jul 31 01:04 vm-101-disk-2__head_383C3223__0
Looks like inkscope has been doing some benching and left some data in here but it’s actually pretty insignificant (6 objects of 64K). 
> 
> *vm-101-disk-2__head_383C3223__0* seem wrong here and exists also on the replica osd for this pg.
This seems more to me like a pure radios object of 100GB that was uploaded to the cluster. From the name it could be a VM disk image that was uploaded as an object. If it was an RBD object, it’s size would be in the boundaries of an RBD objects (order 12=4K order 25=32MB).


> 
>> What you can try doing is to change the following settings on all the
>> OSDs that host this particular PG and see if it makes things better
> 
>> [osd]
>> osd_scrub_begin_hour = {Hour scrubbing starts}     # These 2
>> parameters will let you define the best window for scrubbing and hence
>> reduce impact on the VMs at a crucial moment of the day. Default to 0.
>> osd_scrub_end_hour = {Hour scrubbing stops}        # Values are
>> expressed in 24 hour format for both. Begin=3 End=4 scrubbing will
>> only take place between 3 and 4 in the morning. Default to 24.
>> osd_scrub_sleep = .1                                                #
>> Introduce a delay between scrub operation to give more disk time to
>> the OSD to perform client operations. Defaults to 0.0
>> osd_scrub_chunk_max = 5                                       #
>> maximum number of chunks the scrub will process in one go. Defaults to
>> 25.
>> osd_deep_scrub_stride = 1048576                           # Read size
>> during scrubbing operations. The idea here is to do less chunks but
>> bigger sequential reads. Defaults to 512KB=524288.
> 
> Thanks for this suggenstions.
> These are already on my notebook to setup after the cluster is running fine without any issues.
> 
>> If your devices use coq as an elevator, you can add the following 2
>> lines in the section to lower the priority of the scrubbing read IOs
>> osd_disk_thread_ioprio_class = idle                          # Change
>> the coq priority for the scrub thread. Default is the same priority
>> class as the OSD
>> osd_disk_thread_ioprio_priority = 0                           # Change
>> the priority within the class to the lowest possible. Default is the
>> same priority as the OSD
> 
> Actualy they are on default (deadline), but it seems to be worth to change this to cfq.
> We will have a closer look on this when the cluster is working without any issues.
> 
>> Keep me posted on your tests and findings.
>> If this does fix the performance impact problem, I suggest you apply
>> those changes to the config file and push it to all the OSD nodes and
>> restart the OSDs in a gradual manner.
> 
> Of course!... and done with this eMail :)
> 
> Could be the mentionend file above the cause for the trouble?
> Is it possible to delete this via "rm -f" or would this cause any other issues?

Verify that when you do a "rados -p rbd ls | grep vm-101-disk-2” command, you can see an object named vm-101-disk-2.
Verify if you have an RBD named this way “rbd -p rbd ls | grep vm-101-disk-2"

As I’m not familiar with proxmox so I’d suggest the following:
If yes to 1, for security, copy this file somewhere else and then to a rados -p rbd rm vm-101-disk-2.
If no to 1, for security, copy this file somewhere else and then to a rm -rf vm-101-disk-2__head_383C3223__0

Make sure all your PG copies show the same content and wait for the next scrub to see what is happening.

If anything goes wrong you will be able to upload an object with the exact same content from the file you copied.

Is proxmox using such huge objects for something to your knowledge (VM boot image or something else)? Can you search the proxmox mailing list and open tickets to verify.

And is this the cause of the long deep scrub? I do think so but I’m not in front of the cluster.


> 
> Best regards
> - Mehmet
> 
>> Regards
>> JC Lopez
>> S. Technical Instructor, Global Storage Consulting Practice
>> Red Hat, Inc.
>> jelopez@xxxxxxxxxx
>> +1 408-680-6959
>>> On Aug 26, 2016, at 04:16, Mehmet <ceph@xxxxxxxxxx> wrote:
>>> Hello JC,
>>> as promised here is my
>>> - ceph.conf (I have done a "diff" on all involved server - all using the same ceph.conf) = ceph_conf.txt
>>> - ceph pg 0.223 query = ceph_pg_0223_query_20161236.txt
>>> - ceph -s = ceph_s.txt
>>> - ceph df = ceph_df.txt
>>> - ceph osd df = ceph_osd_df.txt
>>> - ceph osd dump | grep pool = ceph_osd_dump_pool.txt
>>> - ceph osd crush rule dump = ceph_osd_crush_rule_dump.txt
>>> as attached txt files.
>>> I have done again a "ceph pg deep-scrub 0.223" before I have created the files above. The issue still exists today ~ 12:24 on ... :*(
>>> The deep-scrub on this pg has taken ~14 minutes:
>>> - 2016-08-26 12:24:01.463411 osd.9 172.16.0.11:6808/29391 1777 : cluster [INF] 0.223 deep-scrub starts
>>> - 2016-08-26 12:38:07.201726 osd.9 172.16.0.11:6808/29391 2485 : cluster [INF] 0.223 deep-scrub ok
>>> Ceph: version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>> OS: Ubuntu 16.04 LTS (Linux osdserver1 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux)
>>>> As a remark, assuming the size parameter of the rbd pool is set to 3, the number of PGs in your cluster should be higher
>>> I know I could increase this to 2048 (with 30 OSDs). But perhaps we will create further Pools so I did not want to set this to high for this pool because it is not possible to decrease the pg for the pool.
>>> Furthermore if I would change this now and the issue is gone, we would not know what the cause was... :)
>>> When you need further informations, please do not hesitate to ask - I will provides this as soon as possible.
>>> Please keep in mind that I have 1 additional disk on each OSD-Node (3 disk in sum) which I can add to the cluster, so that the acting set for this pg could change.
>>> I did removed this before to force other OSDs to become the acting set for pg 0.223.
>>> Thank you, your help is very appreciated!
>>> - Mehmet
>>> Am 2016-08-25 13:58, schrieb ceph@xxxxxxxxxx:
>>>> Hey JC,
>>>> Thank you very much for your mail!
>>>> I will provide the Informations tomorrow when i am at work again.
>>>> Hope that we will find a solution :)
>>>> - Mehmet
>>>> Am 24. August 2016 16:58:58 MESZ, schrieb LOPEZ Jean-Charles
>>>> <jelopez@xxxxxxxxxx>:
>>>>> Hi Mehmet,
>>>>> I’m just seeing your message and read the thread going with it.
>>>>> Can you please provide me with a copy of the ceph.conf file on the
>>>>> MON and OSD side assuming it’s identical and if the ceph.conf file
>>>>> is different on the client side (the VM side) can you please provide
>>>>> me with a copy of it.
>>>>> Can you also provide me as attached txt files with
>>>>> output of your pg query of the pg 0.223?
>>>>> output of ceph -s
>>>>> output of ceph df
>>>>> output of ceph osd df
>>>>> output of ceph osd dump | grep pool
>>>>> output of ceph osd crush rule dump
>>>>> Thank you and I’ll see if I can get something to ease your pain.
>>>>> As a remark, assuming the size parameter of the rbd pool is set to
>>>>> 3, the number of PGs in your cluster should be higher
>>>>> If we manage to move forward and get it fixed, we will repost to the
>>>>> mailing list the changes we made to your configuration.
>>>>> Regards
>>>>> JC
>>>>> On Aug 24, 2016, at 06:41, Mehmet <ceph@xxxxxxxxxx> wrote:
>>>>> Hello Guys,
>>>>> the issue still exists :(
>>>>> If we run a "ceph pg deep-scrub 0.223" nearly all VMs stop for a
>>>>> while (blocked requests).
>>>>> - we already replaced the OSDs (SAS Disks - journal on NVMe)
>>>>> - Removed OSDs so that acting set for pg 0.223 has changed
>>>>> - checked the filesystem on the acting OSDs
>>>>> - changed the tunables back from jewel to default
>>>>> - changed the tunables again to jewel from default
>>>>> - done a deep-scrub on the hole OSDs (ceph osd deep-scrub osd.<id>)
>>>>> - only when a deeph-scrub on pg 0.223 runs we get blocked requests
>>>>> The deep-scrub on pg 0.223 took always 13-15 Min. to finish. It
>>>>> does not matter which OSDs are in the acting set for this pg.
>>>>> So, i dont have any ideas what coul d be the issue for this.
>>>>> As long as "ceph osd set nodeep-scrub" is set - so that no
>>>>> deep-scrub on 0.223 is running - the cluster is fine!
>>>>> Could this be a bug?
>>>>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>>> Kernel: 4.4.0-31-generic #50-Ubuntu
>>>>> Any ideas?
>>>>> - Mehmet
>>>>> Am 2016-08-02 17:57, schrieb c:
>>>>> Am 2016-08-02 13:30, schrieb c:
>>>>> Hello Guys,
>>>>> this time without the original acting-set osd.4, 16 and 28. The
>>>>> issue
>>>>> still exists...
>>>>> [...]
>>>>> For the record, this ONLY happens with this PG and no others that
>>>>> share
>>>>> the same OSDs, right?
>>>>> Yes, right.
>>>> [...]
>>>>> When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
>>>>> see if a
>>>>> particular OSD (HDD) stands out, as I would expect it to.
>>>>> Now I logged all disks via atop each 2 seconds while the deep-scrub
>>>>> was running ( atop -w osdXX_atop 2 ).
>>>>> As you expected all disks was 100% busy - with constant 150MB
>>>>> (osd.4), 130MB (osd.28) and 170MB (osd.16)...
>>>>> - osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1] [1]
>>>>> - osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2] [2]
>>>>> - osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3] [3]
>>>>> [...]
>>>>> But what is causing this? A deep-scrub on all other disks - same
>>>>> model and ordered at the same time - seems to not have this issue.
>>>> [...]
>>>>> Next week, I will do this
>>>>> 1.1 Remove osd.4 completely from Ceph - again (the actual primary
>>>>> for PG 0.223)
>>>> osd.4 is now removed completely.
>>>> The Primary PG is now on "osd.9"
>>>> # ceph pg map 0.223
>>>> osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]
>>>>> 1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error
>>>> xfs_repair did not find/show any error
>>>>> 1.3 ceph pg deep-scrub 0.223
>>>>> - Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"
>>>> Because now osd.9 is the Primary PG i have set the debug_osd on this
>>>> too:
>>>> ceph tell osd.9 injectargs "--debug_osd 5/5"
>>>> and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
>>>> working for a while)
>>>> Start @ 15:33:27
>>>> End @ 15:48:31
>>>> The "ceph.log"
>>>> - http://slexy.org/view/s2WbdApDLz [5]
>>>> The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for
>>>> the osds
>>>> LogFile - osd.9 (/dev/sdk)
>>>> - ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw [6]
>>>> - atop Log: http://slexy.org/view/s21wJG2qr8 [7]
>>>> LogFile - osd.16 (/dev/sdh)
>>>> - ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d [8]
>>>> - atop Log: http://slexy.org/view/s2iMjer8rC [9]
>>>> LogFile - osd.28 (/dev/sdm)
>>>> - ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7 [10]
>>>> - atop log: http://slexy.org/view/s2gJqzu3uG [11]
>>>>> 2.1 Remove osd.16 completely from Ceph
>>>> osd.16 is now removed completely - now replaced with osd.17 witihin
>>>> the acting set.
>>>> # ceph pg map 0.223
>>>> osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]
>>>>> 2.2 xfs_repair -n /dev/sdh1
>>>> xfs_repair did not find/show any error
>>>>>> 2.3 ceph pg deep-scrub 0.223
>>>>>> - Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"
>>>> and run the deep-scrub on 0.223 (and againg nearly all of my VMs stop
>>>> working for a while)
>>>> Start @ 2016-08-02 10:02:44
>>>> End @ 2016-08-02 10:17:22
>>>> The "Ceph.log": http://slexy.org/view/s2ED5LvuV2 [12]
>>>> LogFile - osd.9 (/dev/sdk)
>>>> - ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu [13]
>>>> - atop Log: http://slexy.org/view/s20XjFZFEL [14]
>>>> LogFile - osd.17 (/dev/sdi)
>>>> - ceph-osd.17.log: http://slexy.org/view/s202fpcZS9 [15]
>>>> - atop Log: http://slexy.org/view/s2TxeR1JSz [16]
>>>> LogFile - osd.28 (/dev/sdm)
>>>> - ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV [17]
>>>> - atop log: http://slexy.org/view/s21AfebBqK [18]
>>>>> 3.1 Remove osd.28 completely from Ceph
>>>> Now osd.28 is also removed completely from Ceph - now replaced with
>>>> osd.23
>>>> # ceph pg map 0.223
>>>> osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]
>>>>> 3.2 xfs_repair -n /dev/sdm1
>>>> As expected: xfs_repair did not find/show any error
>>>>> 3.3 ceph pg deep-scrub 0.223
>>>>> - Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"
>>>> ... againg nearly all of my VMs stop working for a while...
>>>> Now are all "original" OSDs (4,16,28) removed which was in the
>>>> acting-set when i wrote my first eMail to this mailinglist. But the
>>>> issue still exists with different OSDs (9,17,23) as the acting-set
>>>> while the questionable PG 0.223 is still the same!
>>>> In suspicion that the "tunable" could be the cause, i have now
>>>> changed
>>>> this back to "default" via " ceph osd crush tunables default ".
>>>> This will take a whille... then i will do " ceph pg deep-scrub 0.223
>>>> "
>>>> again (without osds 4,16,28)...
>>>> Really, i do not know whats going on here.
>>>> Ceph finished its recovering to "default" tunables but the issue
>>>> still
>>>> exists!:*(
>>>> The acting set has changed again
>>>> # ceph pg map 0.223
>>>> osdmap e11230 pg 0.223 (0.223) -> up [9,11,20] acting [9,11,20]
>>>> But when i start " ceph pg deep-scrub 0.223 ", again nearly all of my
>>>> VMs stop working for a while!
>>>> Does any one have an idea where i should have a look to find the
>>>> cause for this?
>>>> It seems that everytime the Primary OSD from the acting set of PG
>>>> 0.223 (*4*,16,28; *9*,17,23 or *9*,11,20) leads to "currently waiting
>>>> for subops from 9,X" and the deep-scrub takes always nearly 15
>>>> minutes
>>>> to finish.
>>>> My output from " ceph pg 0.223 query "
>>>> - http://slexy.org/view/s21d6qUqnV [19]
>>>> Mehmet
>>>>> For the records: Although nearly all disks are busy i have no
>>>>> slow/blocked requests and i am watching the logfiles for nearly 20
>>>>> minutes now...
>>>>> Your help is realy appreciated!
>>>>> - Mehmet
>>>>> -------------------------
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
>>>> -------------------------
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
>>>> -------------------------
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [4]
>>>> JC Lopez
>>>> S. Technical Instructor, Global Storage Consulting Practice
>>>> Red Hat, Inc.
>>>> jelopez@xxxxxxxxxx
>>>> +1 408-680-6959
>>>> Links:
>>>> ------
>>>> [1] http://slexy.org/view/s21emd2u6j
>>>> [2] http://slexy.org/view/s20vukWz5E
>>>> [3] http://slexy.org/view/s20YX0lzZY
>>>> [4] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> [5] http://slexy.org/view/s2WbdApDLz
>>>> [6] http://slexy.org/view/s2kXeLMQyw
>>>> [7] http://slexy.org/view/s21wJG2qr8
>>>> [8] http://slexy.org/view/s20D6WhD4d
>>>> [9] http://slexy.org/view/s2iMjer8rC
>>>> [10] http://slexy.org/view/s21dmXoEo7
>>>> [11] http://slexy.org/view/s2gJqzu3uG
>>>> [12] http://slexy.org/view/s2ED5LvuV2
>>>> [13] http://slexy.org/view/s21z9JmwSu
>>>> [14] http://slexy.org/view/s20XjFZFEL
>>>> [15] http://slexy.org/view/s202fpcZS9
>>>> [16] http://slexy.org/view/s2TxeR1JSz
>>>> [17] http://slexy.org/view/s2eCUyC7xV
>>>> [18] http://slexy.org/view/s21AfebBqK
>>>> [19] http://slexy.org/view/s21d6qUqnV
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> <ceph_conf.txt><ceph_df.txt><ceph_osd_crush_rule_dump.txt><ceph_osd_df.txt><ceph_osd_dump_pool.txt><ceph_pg_0223_query_20161236.txt><ceph_s.txt>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux