Re: Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Thu, 2 Apr 2015 00:22:50 +0000

No sure whether it is relevant to your setup or not. But, we saw OSDs are flapping while rebalancing is going on with say ~150 TB of data within 6 nodes cluster.
During root causing we saw continuous dropping of packets in dmesg and may be because of that osd heartbeat responses are lost. As a result, it is wrongly marked
 down/out.
The packet drop seems to be because of hitting
nf_conntrack limit which is I guess 65536 and for some reason Ceph is hitting that bigger connection limit !!!
Forcing nf_conntrack and related modules not to load during boot solved our OSD flapping problem. But, we are still unsure why we hit that
 connection limit ??

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Craig Lewis

Sent: Wednesday, April 01, 2015 5:09 PM

To: Karan Singh

Cc: ceph-users

Subject: Re: [ceph-users] Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean

Both of those say they want to talk to osd.115.

I see from the recovery_state, past_intervals that you have flapping OSDs.  osd.140 will drop out, then come back.  osd.115 will drop out, then come back.  osd.80 will drop out, then come back.

So really, you need to solve the OSD flapping.  That will likely solve your incompleteness.

Any idea with the OSDs are flapping?  Any errors in ceph-osd.140.log ?

The very long past_intervals looks like you might be hitting something I saw before.  I was having problems with the suicide timeout.  The OSDs fail and restart so many times that they can't apply all of the map changes before they hit
 the timeout.  Sage gave me some suggestions.  Give this a try: https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg18862.html  That process solved my suicide timeouts,
 with one caveat. When I followed it, I filled up /var/log/ceph/ and the recovery failed.  I had to manually run each OSD in debugging mode until it completed the map update.  Aside from that, I followed the procedure.

That's a symptom though, not the cause.  Once I got the OSDs to stop flapping, it would come back every couple of weeks.  I eventually determined that the real cause was an XFS malloc issues because I used 

[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096

Changing it to

[osd]
  osd mkfs type = xfs
  osd mkfs options xfs = -s size=4096 

and reformatting all disks avoided the XFS deadlock.  When the free memory got low, OSDs would get marked out.  After a few hours, it got to the points that the OSDs would suicide.

On Wed, Apr 1, 2015 at 12:17 PM, Karan Singh <karan.singh@xxxxxx> wrote:

Any pointers to fix incomplete PG would be grateful

I tried the following with no success. 

pg scrub

pg deep scrub

pg repair

osd out , down , rm , in

osd lost 

# ceph -s

    cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33

     health HEALTH_WARN 7 pgs down; 20 pgs incomplete; 1 pgs recovering; 20 pgs stuck inactive; 21 pgs stuck unclean; 4 requests are blocked > 32 sec; recovery 201/986658 objects degraded (0.020%); 133/328886
 unfound (0.040%)

     monmap e3: 3 mons at {pouta-s01=xx.xx.xx.1:6789/0,pouta-s02=xx.xx.xx.2:6789/0,pouta-s03=xx.xx.xx.3:6789/0}, election epoch 1920, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03

     osdmap e262813: 239 osds: 239 up, 239 in

      pgmap v588073: 18432 pgs, 13 pools, 2338 GB data, 321 kobjects

            19094 GB used, 849 TB / 868 TB avail

            201/986658 objects degraded (0.020%); 133/328886 unfound (0.040%)

                   7 down+incomplete

               18411 active+clean

                  13 incomplete

                   1 active+recovering

# ceph pg dump_stuck inactive

ok

pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up up_primary acting acting_primar last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp

10.70 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.152179 0'0 262813:163 [213,88,80] 213 [213,88,80] 213 0'0 2015-03-12 17:59:43.275049 0'0 2015-03-09 17:55:58.745662

3.dde 68 66 0 66 552861709 297 297 down+incomplete 2015-04-01 21:21:16.161066 33547'297 262813:230683 [174,5,179] 174 [174,5,179] 174 33547'297 2015-03-12 14:19:15.261595 28522'43 2015-03-11 14:19:13.894538

5.a2 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.145329 0'0 262813:150 [168,182,201] 168 [168,182,201] 168 0'0 2015-03-12 17:58:29.257085 0'0 2015-03-09 17:55:07.684377

13.1b6 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.139062 0'0 262813:2974 [0,176,131] 0 [0,176,131] 0 0'0 2015-03-12 18:00:13.286920 0'0 2015-03-09 17:56:18.715208

7.25b 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.113876 0'0 262813:167 [111,26,108] 111 [111,26,108] 111 27666'16 2015-03-12 17:59:06.357864 2330'3 2015-03-09 17:55:30.754522

5.19 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.199712 0'0 262813:27605 [212,43,131] 212 [212,43,131] 212 0'0 2015-03-12 13:51:37.777026 0'0 2015-03-11 13:51:35.406246

3.a2f 68 0 0 0 543686693 302 302 incomplete 2015-04-01 21:21:16.141368 33531'302 262813:3731 [149,224,33] 149 [149,224,33] 149 33531'302 2015-03-12 14:17:43.045627 28564'54 2015-03-11 14:17:40.314189

7.298 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.108523 0'0 262813:166 [221,154,225] 221 [221,154,225] 221 27666'13 2015-03-12 17:59:10.308423 2330'4 2015-03-09 17:55:35.750109

1.1e7 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.192711 0'0 262813:162 [215,232] 215 [215,232] 215 0'0 2015-03-12 17:55:45.203232 0'0 2015-03-09 17:53:49.694822

3.774 79 0 0 0 645136397 339 339 down+incomplete 2015-04-01 21:21:16.207131 33570'339 262813:168986 [162,39,161] 162 [162,39,161] 162 33570'339 2015-03-12 14:49:03.869447 2226'2 2015-03-09 13:46:49.783950

3.7d0 78 0 0 0 609222686 376 376 down+incomplete 2015-04-01 21:21:16.135599 33538'376 262813:185045 [117,118,177] 117 [117,118,177] 117 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288

3.d60 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.158179 0'0 262813:169 [60,56,220] 60 [60,56,220] 60 33552'321 2015-03-12 13:44:43.502907 28356'39 2015-03-11 13:44:41.663482

4.1fc 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.217291 0'0 262813:163 [144,58,153] 144 [144,58,153] 144 0'0 2015-03-12 17:58:19.254170 0'0 2015-03-09 17:54:55.720479

3.e02 72 0 0 0 585105425 304 304 down+incomplete 2015-04-01 21:21:16.099150 33568'304 262813:169744 [15,102,147] 15 [15,102,147] 15 33568'304 2015-03-16 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331

8.1d4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.218644 0'0 262813:21867 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12 14:34:35.258338 0'0 2015-03-12 14:34:35.258338

4.2f4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.117515 0'0 262813:116150 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12 14:59:03.529264 0'0 2015-03-09 13:46:40.601301

3.e5a 76 70 0 0 623902741 325 325 incomplete 2015-04-01 21:21:16.043300 33569'325 262813:73426 [97,22,62] 97 [97,22,62] 97 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795

8.3a0 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.056437 0'0 262813:175168 [62,14,224] 62 [62,14,224] 62 0'0 2015-03-12 13:52:44.546418 0'0 2015-03-12 13:52:44.546418

3.24e 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.130831 0'0 262813:165 [39,202,90] 39 [39,202,90] 39 33556'272 2015-03-13 11:44:41.263725 2327'4 2015-03-09 17:54:43.675552

5.f7 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.145298 0'0 262813:153 [54,193,123] 54 [54,193,123] 54 0'0 2015-03-12 17:58:30.257371 0'0 2015-03-09 17:55:11.725629

[root@pouta-s01 ceph]#

##########  Example 1 : PG 10.70 ###########

10.70 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.152179 0'0 262813:163 [213,88,80] 213 [213,88,80] 213 0'0 2015-03-12 17:59:43.275049 0'0 2015-03-09 17:55:58.745662

This is how i found location of each OSD

[root@pouta-s01 ceph]# ceph osd find 88

{ "osd": 88,

  "ip": "10.100.50.3:7079\/916853",

  "crush_location": { "host": "pouta-s03",

      "root": "default”}}

[root@pouta-s01 ceph]#

When i manually check current/pg_head directory , data is not present here ( i.e. data is lost from all the copies )

[root@pouta-s04 current]# ls -l /var/lib/ceph/osd/ceph-80/current/10.70_head

total 0

[root@pouta-s04 current]#

On some of the OSD’s HEAD directory does not exists 

[root@pouta-s03 ~]# ls -l /var/lib/ceph/osd/ceph-88/current/10.70_head

ls: cannot access /var/lib/ceph/osd/ceph-88/current/10.70_head: No such file or directory

[root@pouta-s03 ~]#

[root@pouta-s02 ~]# ls -l /var/lib/ceph/osd/ceph-213/current/10.70_head

total 0

[root@pouta-s02 ~]#

# ceph pg 10.70 query  --->  http://paste.ubuntu.com/10719840/

##########  Example 2 : PG 3.7d0 ###########

3.7d0 78 0 0 0 609222686 376 376 down+incomplete 2015-04-01 21:21:16.135599 33538'376 262813:185045 [117,118,177] 117 [117,118,177] 117 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288

[root@pouta-s04 current]# ceph pg map 3.7d0

osdmap e262813 pg 3.7d0 (3.7d0) -> up [117,118,177] acting [117,118,177]

[root@pouta-s04 current]#

Data is present here , so 1 copy is present out of 3 

[root@pouta-s04 current]# ls -l /var/lib/ceph/osd/ceph-117/current/3.7d0_head/ | wc -l

63

[root@pouta-s04 current]#

[root@pouta-s03 ~]#  ls -l /var/lib/ceph/osd/ceph-118/current/3.7d0_head/

total 0

[root@pouta-s03 ~]#

[root@pouta-s01 ceph]# ceph osd find 177

{ "osd": 177,

  "ip": "10.100.50.2:7062\/777799",

  "crush_location": { "host": "pouta-s02",

      "root": "default”}}

[root@pouta-s01 ceph]#

Even directory is not present here 

[root@pouta-s02 ~]#  ls -l /var/lib/ceph/osd/ceph-177/current/3.7d0_head/

ls: cannot access /var/lib/ceph/osd/ceph-177/current/3.7d0_head/: No such file or directory

[root@pouta-s02 ~]#

# ceph pg  3.7d0 query http://paste.ubuntu.com/10720107/

- Karan -

On 20 Mar 2015, at 22:43, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:

> osdmap e261536: 239 osds: 239 up, 238 in

Why is that last OSD not IN?  The history you need is probably there.

Run  ceph pg <pgid> query on some of the stuck PGs.  Look for the recovery_state section.  That should tell
 you what Ceph needs to complete the recovery.

If you need more help, post the output of a couple pg queries.

On Fri, Mar 20, 2015 at 4:22 AM, Karan Singh <karan.singh@xxxxxx> wrote:

Hello Guys

My CEPH cluster lost data and not its not recovering. This problem occurred when Ceph performed recovery when one of the node was down. 

Now all the nodes are up but Ceph is showing PG as incomplete , unclean , recovering.

I have tried several things to recover them like , scrub , deep-scrub , pg repair , try changing primary affinity and then scrubbing , osd_pool_default_size etc. BUT NO LUCK

Could yo please advice , how to recover PG and achieve HEALTH_OK

# ceph -s

    cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33

     health HEALTH_WARN 19 pgs incomplete; 3 pgs recovering; 20 pgs stuck inactive; 23 pgs stuck unclean; 2 requests are blocked > 32 sec; recovery 531/980676 objects degraded (0.054%); 243/326892 unfound (0.074%)

     monmap e3: 3 mons at {xxx=xxxx:6789/0,xxx=xxxx:6789:6789/0,xxx=xxxx:6789:6789/0}, election epoch 1474, quorum 0,1,2 xx,xx,xx

     osdmap e261536: 239 osds: 239 up, 238 in

      pgmap v415790: 18432 pgs, 13 pools, 2330 GB data, 319 kobjects

            20316 GB used, 844 TB / 864 TB avail

            531/980676 objects degraded (0.054%); 243/326892 unfound (0.074%)

                   1 creating

               18409 active+clean

                   3 active+recovering

                  19 incomplete

# ceph pg dump_stuck unclean

ok

pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp

10.70 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.534911 0'0 261536:1015 [153,140,80] 153 [153,140,80] 153 0'0 2015-03-12 17:59:43.275049 0'0 2015-03-09 17:55:58.745662

3.dde 68 66 0 66 552861709 297 297 incomplete 2015-03-20 12:19:49.584839 33547'297 261536:228352 [174,5,179] 174 [174,5,179] 174 33547'297 2015-03-12 14:19:15.261595 28522'43 2015-03-11 14:19:13.894538

5.a2 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.560756 0'0 261536:897 [214,191,170] 214 [214,191,170] 214 0'0 2015-03-12 17:58:29.257085 0'0 2015-03-09 17:55:07.684377

13.1b6 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.846253 0'0 261536:1050 [0,176,131] 0 [0,176,131] 0 0'0 2015-03-12 18:00:13.286920 0'0 2015-03-09 17:56:18.715208

7.25b 16 0 0 0 67108864 16 16 incomplete 2015-03-20 12:19:49.639102 27666'16 261536:4777 [194,145,45] 194 [194,145,45] 194 27666'16 2015-03-12 17:59:06.357864 2330'3 2015-03-09 17:55:30.754522

5.19 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.742698 0'0 261536:25410 [212,43,131] 212 [212,43,131] 212 0'0 2015-03-12 13:51:37.777026 0'0 2015-03-11 13:51:35.406246

3.a2f 0 0 0 0 0 0 0 creating 2015-03-20 12:42:15.586372 0'0 0:0 [] -1 [] -1 0'0 0.000000 0'0 0.000000

7.298 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.566966 0'0 261536:900 [187,95,225] 187 [187,95,225] 187 27666'13 2015-03-12 17:59:10.308423 2330'4 2015-03-09 17:55:35.750109

3.a5a 77 87 261 87 623902741 325 325 active+recovering 2015-03-20 10:54:57.443670 33569'325 261536:182464 [150,149,181] 150 [150,149,181] 150 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795

1.1e7 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610547 0'0 261536:772 [175,182] 175 [175,182] 175 0'0 2015-03-12 17:55:45.203232 0'0 2015-03-09 17:53:49.694822

3.774 79 0 0 0 645136397 339 339 incomplete 2015-03-20 12:19:49.821708 33570'339 261536:166857 [162,39,161] 162 [162,39,161] 162 33570'339 2015-03-12 14:49:03.869447 2226'2 2015-03-09 13:46:49.783950

3.7d0 78 0 0 0 609222686 376 376 incomplete 2015-03-20 12:19:49.534004 33538'376 261536:182810 [117,118,177] 117 [117,118,177] 117 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288

3.d60 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.647196 0'0 261536:833 [154,172,1] 154 [154,172,1] 154 33552'321 2015-03-12 13:44:43.502907 28356'39 2015-03-11 13:44:41.663482

4.1fc 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610103 0'0 261536:1069 [70,179,58] 70 [70,179,58] 70 0'0 2015-03-12 17:58:19.254170 0'0 2015-03-09 17:54:55.720479

3.e02 72 0 0 0 585105425 304 304 incomplete 2015-03-20 12:19:49.564768 33568'304 261536:167428 [15,102,147] 15 [15,102,147] 15 33568'304 2015-03-16 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331

8.1d4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.614727 0'0 261536:19611 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12 14:34:35.258338 0'0 2015-03-12 14:34:35.258338

4.2f4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.595109 0'0 261536:113791 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12 14:59:03.529264 0'0 2015-03-09 13:46:40.601301

3.52c 65 23 69 23 543162368 290 290 active+recovering 2015-03-20 10:51:43.664734 33553'290 261536:8431 [212,100,219] 212 [212,100,219] 212 33553'290 2015-03-13 11:44:26.396514 29686'103 2015-03-11 17:18:33.452616

3.e5a 76 70 0 0 623902741 325 325 incomplete 2015-03-20 12:19:49.552071 33569'325 261536:71248 [97,22,62] 97 [97,22,62] 97 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795

8.3a0 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.615728 0'0 261536:173184 [62,14,178] 62 [62,14,178] 62 0'0 2015-03-12 13:52:44.546418 0'0 2015-03-12 13:52:44.546418

3.24e 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.591282 0'0 261536:1026 [103,14,90] 103 [103,14,90] 103 33556'272 2015-03-13 11:44:41.263725 2327'4 2015-03-09 17:54:43.675552

5.f7 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.667823 0'0 261536:853 [73,44,123] 73 [73,44,123] 73 0'0 2015-03-12 17:58:30.257371 0'0 2015-03-09 17:55:11.725629

3.ae8 77 67 201 67 624427024 342 342 active+recovering 2015-03-20 10:50:01.693979 33516'342 261536:149258 [122,144,218] 122 [122,144,218] 122 33516'342 2015-03-12 17:11:01.899062 29638'134 2015-03-11 17:10:59.966372

#

PG data is there on multiple OSD’s but Ceph is not recovering the PG , For Example

# ceph pg map 7.25b

osdmap e261536 pg 7.25b (7.25b) -> up [194,145,45] acting [194,145,45]

# ls -l /var/lib/ceph/osd/ceph-194/current/7.25b_head | wc -l

17

# ls -l /var/lib/ceph/osd/ceph-145/current/7.25b_head | wc -l

0

#

# ls -l /var/lib/ceph/osd/ceph-45/current/7.25b_head | wc -l

17

Some of the PG are completely lost , i.e they don’t have any data . For example 

# ceph pg map 10.70

osdmap e261536 pg 10.70 (10.70) -> up [153,140,80] acting [153,140,80]

# ls -l /var/lib/ceph/osd/ceph-140/current/10.70_head | wc -l

0

# ls -l /var/lib/ceph/osd/ceph-153/current/10.70_head | wc -l

0

# ls -l /var/lib/ceph/osd/ceph-80/current/10.70_head | wc -l

0

- Karan -

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this
 message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy
 any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com