Hi Greg, (I'm a colleague of Ana), Thank you for your reply On 10/17/2017 11:57 PM, Gregory Farnum
wrote:
We catched also those log entries, which indeed point to a clone/snapshot problem: -9877> 2017-10-17 17:46:16.044077 7f234db16700 10 log_client will send 2017-10-17 17:46:13.367842 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 483 : cluster [ERR] 2.2fc shard 78 missing 2:3f72b543:::rbd_data.332d5a836bcc485.000000000000fcf6:466a7 -9876> 2017-10-17 17:46:16.044105 7f234db16700 10 log_client will send 2017-10-17 17:46:13.368026 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 484 : cluster [ERR] repair 2.2fc 2:3f72b543:::rbd_data.332d5a836bcc485.000000000000fcf6:466a7 is an unexpected clone -9868> 2017-10-17 17:46:16.324112 7f2354b24700 10 log_client logged 2017-10-17 17:46:13.367842 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 483 : cluster [ERR] 2.2fc shard 78 missing 2:3f72b543:::rbd_data.332d5a836bcc485.000000000000fcf6:466a7 -9867> 2017-10-17 17:46:16.324128 7f2354b24700 10 log_client logged 2017-10-17 17:46:13.368026 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 484 : cluster [ERR] repair 2.2fc 2:3f72b543:::rbd_data.332d5a836bcc485.000000000000fcf6:466a7 is an unexpected clone -36> 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log [ERR] : 2.2fc repair 1 missing, 0 inconsistent objects -35> 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log [ERR] : 2.2fc repair 3 errors, 1 fixed -4> 2017-10-17 17:48:56.046071 7f234db16700 10 log_client will send 2017-10-17 17:48:55.771390 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 485 : cluster [ERR] 2.2fc repair 1 missing, 0 inconsistent objects -3> 2017-10-17 17:48:56.046088 7f234db16700 10 log_client will send 2017-10-17 17:48:55.771419 osd.78 osd.78 [XXXX:XXXX:XXXX:XXXX::203]:6880/9116 486 : cluster [ERR] 2.2fc repair 3 errors, 1 fixed
We will submit the ticket tomorrow (we are in CEST), We want to have more pair of eyes on it when we start the OSD again. After this crash the OSD was marked as out by us. The cluster rebalanced itself, unfortunately, the same issue appear on another OSD (same pg), after several crashes of this OSD, the OSD came back up, but now with one PG down. I assume the cluster decided it 'finished' the ceph pg repair command and removed the 'repair' state, but now with a broken pg. If you have any hints on how we can get the PG online again, we would be very grateful, so we can work on that tomorrow. Thanks, Mart Some general info about this cluster: - all OSD runs the same version, also monitors are all 12.2.1 (ubuntu xenial) - the cluster is a backup cluster and has min/size 1 and
replication 2, so only 2 copies. - the cluster was recently upgraded from jewel to luminous (3 weeks ago) - the cluster was recently upgraded from straw to straw2 (1 week ago) - it was in HEALTH_OK till this happend. - we use filestore only - the cluster was installed with hammer originally. upgraded to infernalis, jewel and now luminous health: (noup/noout set on purpose while we trying to recover) $ ceph -s cluster: id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx health: HEALTH_WARN noup,noout flag(s) set Reduced data availability: 1 pg inactive, 1 pg down Degraded data redundancy: 2892/31621143 objects degraded (0.009%), 2 pgs unclean, 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum ds2-mon1,ds2-mon2,ds2-mon3 mgr: ds2-mon1(active) osd: 93 osds: 92 up, 92 in; 1 remapped pgs flags noup,noout rgw: 1 daemon active data: pools: 13 pools, 1488 pgs objects: 15255k objects, 43485 GB usage: 119 TB used, 126 TB / 245 TB avail pgs: 0.067% pgs not active 2892/31621143 objects degraded (0.009%) 1483 active+clean 2 active+clean+scrubbing+deep 1 active+undersized+degraded+remapped+backfilling 1 active+clean+scrubbing 1 down io: client: 340 B/s rd, 14995 B/s wr, 1 op/s rd, 2 op/s wr recovery: 9567 kB/s, 2 objects/s $ ceph health detail HEALTH_WARN noup,noout flag(s) set; Reduced data availability: 1 pg inactive, 1 pg down; Degraded data redundancy: 2774/31621143 objects degraded (0.009%), 2 pgs unclean, 1 pg degraded, 1 pg undersized OSDMAP_FLAGS noup,noout flag(s) set PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down pg 2.2fc is down, acting [69,93] PG_DEGRADED Degraded data redundancy: 2774/31621143 objects degraded (0.009%), 2 pgs unclean, 1 pg degraded, 1 pg undersized pg 2.1e9 is stuck undersized for 23741.295159, current state active+undersized+degraded+remapped+backfilling, last acting [41] pg 2.2fc is stuck unclean since forever, current state down, last acting [69,93]
|
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com