I have a 3 node ceph cluster with 6 disks in each node.
I upgraded from Bobtail 0.56.3 to 0.56.4 last night.
Before I started the upgrade, ceph status reported HEALTH_OK.
After upgrading and restarting the first node the status ended up at
HEALTH_WARN 133 pgs stale; 133 pgs stuck stale
After checking ceph health detail I checked a few random stuck pgs, all said
# ceph pg 3.8 query
pgid currently maps to no osd
I decided to continue with the upgrade and after upgrading the second
node there were 200 total stuck and after the 3rd 300.
The cluster is now at 0.56.4 but still reports 300 pgs stuck stale after
12 hours
# ceph status
health HEALTH_WARN 300 pgs stale; 300 pgs stuck stale
monmap e1: 3 mons at
{a=192.168.6.101:6789/0,b=192.168.6.102:6789/0,c=192.168.6.103:6789/0},
election epoch 8668, quorum 0,1,2 a,b,c
osdmap e976: 18 osds: 18 up, 18 in
pgmap v428986: 5148 pgs: 4848 active+clean, 300 stale+active+clean;
5643 GB data, 11305 GB used, 35831 GB / 47137 GB avail; 0B/s rd,
1136KB/s wr, 146op/s
mdsmap e1: 0/0/1 up
Strangely, the stuck pgs all start with 3
eg
HEALTH_WARN 300 pgs stale; 300 pgs stuck stale
pg 3.f is stuck stale for 41062.735988, current state
stale+active+clean, last acting [11,13]
pg 3.8 is stuck stale for 46905.375678, current state
stale+active+clean, last acting [21,35]
pg 3.9 is stuck stale for 46905.375680, current state
stale+active+clean, last acting [21,14]
pg 3.a is stuck stale for 46905.375681, current state
stale+active+clean, last acting [21,24]
pg 3.b is stuck stale for 46905.375682, current state
stale+active+clean, last acting [21,33]
pg 3.4 is stuck stale for 46905.375683, current state
stale+active+clean, last acting [21,33]
pg 3.5 is stuck stale for 46905.375682, current state
stale+active+clean, last acting [21,34]
pg 3.6 is stuck stale for 46905.375682, current state
stale+active+clean, last acting [21,35]
pg 3.7 is stuck stale for 46905.375680, current state
stale+active+clean, last acting [21,13]
pg 3.0 is stuck stale for 46905.375681, current state
stale+active+clean, last acting [20,32]
pg 3.1 is stuck stale for 46905.375683, current state
stale+active+clean, last acting [20,35]
pg 3.2 is stuck stale for 41965.928295, current state
stale+active+clean, last acting [31,13]
pg 3.3 is stuck stale for 46905.375685, current state
stale+active+clean, last acting [20,34]
pg 3.128 is stuck stale for 41965.928924, current state
stale+active+clean, last acting [31,22]
pg 3.129 is stuck stale for 41062.736776, current state
stale+active+clean, last acting [11,32]
pg 3.12a is stuck stale for 41062.736779, current state
stale+active+clean, last acting [10,34]
pg 3.12b is stuck stale for 46905.376313, current state
stale+active+clean, last acting [21,15]
pg 3.124 is stuck stale for 46905.376315, current state
stale+active+clean, last acting [21,14]
pg 3.125 is stuck stale for 41062.736787, current state
stale+active+clean, last acting [11,34]
pg 3.126 is stuck stale for 41062.736788, current state
stale+active+clean, last acting [10,15]
pg 3.127 is stuck stale for 41965.928942, current state
stale+active+clean, last acting [31,35]
pg 3.120 is stuck stale for 41965.928944, current state
stale+active+clean, last acting [30,35]
pg 3.121 is stuck stale for 41062.736795, current state
stale+active+clean, last acting [10,33]
pg 3.122 is stuck stale for 41062.736796, current state
stale+active+clean, last acting [10,12]
pg 3.123 is stuck stale for 41965.928918, current state
stale+active+clean, last acting [30,13]
pg 3.11c is stuck stale for 41965.928921, current state
stale+active+clean, last acting [30,33]
pg 3.11d is stuck stale for 41965.928921, current state
stale+active+clean, last acting [30,24]
pg 3.11e is stuck stale for 46905.376347, current state
stale+active+clean, last acting [21,32]
pg 3.11f is stuck stale for 41965.928927, current state
stale+active+clean, last acting [31,33]
pg 3.118 is stuck stale for 41062.736804, current state
stale+active+clean, last acting [10,14]
pg 3.119 is stuck stale for 41062.736804, current state
stale+active+clean, last acting [10,15]
etc
[root@ceph1 ~]# ceph pg dump_stuck stale
ok
pg_stat objects mip degr unf bytes log disklog
state state_stamp v reported up acting last_scrub
scrub_stamp last_deep_scrub deep_scrub_stamp
3.f 25 0 0 0 104857600 8162 8162
stale+active+clean 2013-03-21 10:08:15.888399 57'53 46'1250
[11,13] [11,13] 57'53 2013-03-21 10:08:15.888347 0'0
2013-03-20 10:08:04.434172
3.8 21 0 0 0 88080384 6314 6314
stale+active+clean 2013-03-21 16:09:05.501311 57'41 67'1544
[21,35] [21,35] 57'41 2013-03-21 11:56:20.557416 0'0
2013-03-20 10:08:10.204040
3.9 19 0 0 0 79691776 5852 5852
stale+active+clean 2013-03-21 16:09:05.769642 50'38 67'1343
[21,14] [21,14] 50'38 2013-03-21 11:55:35.535342 0'0
2013-03-20 10:07:53.825743
3.a 17 0 0 0 71303168 2772 2772
stale+active+clean 2013-03-21 16:09:05.530051 50'18 67'888
[21,24] [21,24] 50'18 2013-03-21 11:56:21.547257 0'0
2013-03-20 10:08:12.203190
3.b 19 0 0 0 79691776 14800 14800
stale+active+clean 2013-03-21 16:09:05.827664 50'96 67'1869
[21,33] [21,33] 50'96 2013-03-21 11:55:53.535469 0'0
2013-03-20 10:07:57.442050
3.4 22 0 0 0 92274688 11258 11258
stale+active+clean 2013-03-21 16:09:05.828011 57'73 67'1005
[21,33] [21,33] 57'73 2013-03-21 11:55:38.541522 0'0
2013-03-20 10:07:55.441505
3.5 13 0 0 0 54525952 9256 9256
stale+active+clean 2013-03-21 16:09:05.885105 57'60 67'1157
[21,34] [21,34] 57'60 2013-03-21 11:55:39.544341 0'0
2013-03-20 10:07:56.169993
3.6 19 0 0 0 79691776 4774 4774
stale+active+clean 2013-03-21 16:09:05.501393 57'31 67'1517
[21,35] [21,35] 57'31 2013-03-21 11:55:54.524017 0'0
2013-03-20 10:08:03.202763
3.7 20 0 0 0 83886080 6622 6622
stale+active+clean 2013-03-21 16:09:06.320045 50'43 67'1537
[21,13] [21,13] 50'43 2013-03-21 11:56:13.529507 0'0
2013-03-20 10:08:05.203137
3.0 26 0 0 0 109051904 6930 6930
stale+active+clean 2013-03-21 11:55:53.298871 57'45 46'1406
[20,32] [20,32] 57'45 2013-03-21 11:55:53.298830 0'0
2013-03-20 10:08:00.790042
3.1 14 0 0 0 58720256 2772 2772
stale+active+clean 2013-03-21 11:55:38.303601 57'18 46'1571
[20,35] [20,35] 57'18 2013-03-21 11:55:38.303561 0'0
2013-03-20 10:07:54.482538
3.2 18 0 0 0 75497472 6808 6808
stale+active+clean 2013-03-21 11:15:28.011664 57'44 46'1114
[31,13] [31,13] 57'44 2013-03-21 11:15:28.011635 0'0
2013-03-20 10:08:04.857520
3.3 18 0 0 0 75497472 4158 4158
stale+active+clean 2013-03-21 11:55:37.330826 57'27 46'1693
[20,34] [20,34] 57'27 2013-03-21 11:55:37.330787 0'0
2013-03-20 10:07:54.190884
3.128 19 0 0 0 79691776 6468 6468
stale+active+clean 2013-03-21 11:48:37.303563 50'42 46'1488
[31,22] [31,22] 50'42 2013-03-21 11:48:37.303477 0'0
2013-03-20 10:10:42.857644
3.129 23 0 0 0 96468992 10164 10164
stale+active+clean 2013-03-21 10:13:02.920146 50'66 46'1007
[11,32] [11,32] 50'66 2013-03-21 10:13:02.920104 0'0
2013-03-20 10:09:38.296236
3.12a 18 0 0 0 75497472 6622 6622
stale+active+clean 2013-03-21 10:13:18.986584 57'43 46'1544
[10,34] [10,34] 57'43 2013-03-21 10:13:18.986545 48'1
2013-03-20 10:10:58.194099
3.12b 19 0 0 0 79691776 6006 6006
stale+active+clean 2013-03-21 16:09:05.500935 57'39 67'1192
[21,15] [21,15] 57'39 2013-03-21 12:14:57.670099 48'1
2013-03-20 10:10:28.233369
3.124 22 0 0 0 92274688 16802 16802
stale+active+clean 2013-03-21 16:09:05.771001 57'109 67'1325
[21,14] [21,14] 57'109 2013-03-21 12:14:55.689713 48'3
2013-03-20 10:10:20.239677
Notice that they have been stuck for some time, but the cluster reported
HEALTH_OK on 0.56.3
I have tried scrubbing pgs but that does not remove them from the list.
Darryl
The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee, and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments. Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com