Hello,Paul,:
Thanks for your help.The aim I did it in my test/dev environment is to ready for my production cluster.
If set nodown,while clinet read/write on the osd that previously marked down, What will it happen? How can I avoid it? or is there any document I can refer to? Thanks!
From: Paul EmmerichDate: 2019-06-26 19:31CC: ceph-usersSubject: Re: osd be marked down when recoveringLooks like it's overloaded and runs into a timeout. For a test/dev environment: try to set the nodown flag for this experiment if you just want to ignore these timeouts completely.Paul--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90Hi,all:_______________________________________________I start ceph cluster on my machine with development mode,to estimate the time of recoverying after increasing pgp_num.all of daemon run on one machine.CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHzmemory: 377GBOS:CentOS Linux release 7.6.1810ceph version:hammerbuilded ceph according to http://docs.ceph.com/docs/hammer/dev/quick_guide/,ceph -s shows:cluster 15ec2f3f-86e5-46bc-bf98-4b35841ee6a5health HEALTH_WARNpool rbd pg_num 512 > pgp_num 256monmap e1: 1 mons at {a=172.30.250.25:6789/0}election epoch 2, quorum 0 aosdmap e88: 30 osds: 30 up, 30 inpgmap v829: 512 pgs, 1 pools, 57812 MB data, 14454 objects5691 GB used, 27791 GB / 33483 GB avail512 active+cleanand ceph osd tree[3]It start to recovering after i increased pgp_num. ceph -w says there are some osd down, but the process is runing.All configuration items of osd or mon are default[1]some messages that ceph -w[2] says,as below :2019-06-26 15:03:21.839750 mon.0 [INF] pgmap v842: 512 pgs: 127 active+degraded, 84 activating+degraded, 256 active+clean, 45 active+recovering+degraded; 57812 MB data, 5714 GB used, 27769 GB / 33483 GB avail; 22200/43362 objects degraded (51.197%); 50789 kB/s, 12 objects/s recovering2019-06-26 15:03:21.840884 mon.0 [INF] osd.1 172.30.250.25:6804/22500 failed (3 reports from 3 peers after 24.867116 >= grace 20.000000)2019-06-26 15:03:21.841459 mon.0 [INF] osd.9 172.30.250.25:6836/25078 failed (3 reports from 3 peers after 24.867645 >= grace 20.000000)2019-06-26 15:03:21.841709 mon.0 [INF] osd.0 172.30.250.25:6800/22260 failed (3 reports from 3 peers after 24.846423 >= grace 20.000000)2019-06-26 15:03:21.842286 mon.0 [INF] osd.13 172.30.250.25:6852/26651 failed (3 reports from 3 peers after 24.846896 >= grace 20.000000)2019-06-26 15:03:21.842607 mon.0 [INF] osd.5 172.30.250.25:6820/23661 failed (3 reports from 3 peers after 24.804869 >= grace 20.000000)2019-06-26 15:03:21.842938 mon.0 [INF] osd.10 172.30.250.25:6840/25490 failed (3 reports from 3 peers after 24.805155 >= grace 20.000000)2019-06-26 15:03:21.843134 mon.0 [INF] osd.12 172.30.250.25:6848/26277 failed (3 reports from 3 peers after 24.805329 >= grace 20.000000)2019-06-26 15:03:21.843591 mon.0 [INF] osd.8 172.30.250.25:6832/24722 failed (3 reports from 3 peers after 24.805843 >= grace 20.000000)2019-06-26 15:03:21.849664 mon.0 [INF] osd.21 172.30.250.25:6884/29762 failed (3 reports from 3 peers after 23.497080 >= grace 20.000000)2019-06-26 15:03:21.862729 mon.0 [INF] osd.14 172.30.250.25:6856/27025 failed (3 reports from 3 peers after 23.510172 >= grace 20.000000)2019-06-26 15:03:21.864222 mon.0 [INF] osdmap e91: 30 osds: 29 up, 30 in2019-06-26 15:03:20.336758 osd.11 [WRN] map e91 wrongly marked me down2019-06-26 15:03:23.408659 mon.0 [INF] pgmap v843: 512 pgs: 8 stale+activating+degraded, 8 stale+active+clean, 161 active+degraded, 2 stale+active+recovering+degraded, 33 activating+degraded, 248 active+clean, 45 active+recovering+degraded, 7 stale+active+degraded; 57812 MB data, 5730 GB used, 27752 GB / 33483 GB avail; 27317/43362 objects degraded (62.998%); 61309 kB/s, 14 objects/s recovering2019-06-26 15:03:27.538229 mon.0 [INF] osd.18 172.30.250.25:6872/28632 failed (3 reports from 3 peers after 23.180489 >= grace 20.000000)2019-06-26 15:03:27.539416 mon.0 [INF] osd.7 172.30.250.25:6828/24366 failed (3 reports from 3 peers after 21.900054 >= grace 20.000000)2019-06-26 15:03:27.541831 mon.0 [INF] osdmap e92: 30 osds: 19 up, 30 in2019-06-26 15:03:32.748179 mon.0 [INF] osdmap e93: 30 osds: 17 up, 30 in2019-06-26 15:03:33.678682 mon.0 [INF] pgmap v845: 512 pgs: 17 stale+activating+degraded, 95 stale+active+clean, 55 active+degraded, 13 peering, 18 stale+active+recovering+degraded, 20 activating+degraded, 155 active+clean, 22 active+recovery_wait+degraded, 48 active+recovering+degraded, 69 stale+active+degraded; 57812 MB data, 5734 GB used, 27748 GB / 33483 GB avail; 26979/43362 objects degraded (62.218%); 11510 kB/s, 2 objects/s recovering2019-06-26 15:03:33.775701 osd.1 [WRN] map e92 wrongly marked me downHas anyone got any thoughts on what might have happened, or tips on how to dig further into this?
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com