Inconsistent PGs because 0 copies of objects...

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Mon, 11 May 2015 15:31:02 -0700

Fellow Cephers,

I'm scratching my head on this one. Somehow a bunch of objects were lost in my cluster, which is currently ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e).

The symptoms are that "ceph -s" reports a bunch of inconsistent PGs:

    cluster 8a2c9e43-9f17-42e0-92fd-88a40152303d
     health HEALTH_ERR 13 pgs inconsistent; 123 scrub errors; mds0: Client sabnzbd:storage failing to respond to cache pressure; noout flag(s) set
     monmap e9: 3 mons at {guinan=10.42.6.48:6789/0,tuvok=10.42.6.33:6789/0,yar=10.42.6.43:6789/0}, election epoch 1252, quorum 0,1,2 tuvok,yar,guinan
     mdsmap e698: 1/1/1 up {0=pulaski=up:active}
     osdmap e41375: 29 osds: 29 up, 29 in
            flags noout
      pgmap v22573849: 1088 pgs, 3 pools, 32175 GB data, 9529 kobjects
            96663 GB used, 41779 GB / 135 TB avail
                1072 active+clean
                   3 active+clean+scrubbing+deep
                  13 active+clean+inconsistent
  client io 1004 kB/s rd, 2 op/s

I say the objects were "lost", because grepping the logs for the OSDs holding the affected PGs, I see lines like:

2015-05-10 06:27:34.720648 7f2df27fc700  0 filestore(/var/lib/ceph/osd/ceph-11) write couldn't open 0.176_head/adb9ff76/10006ecde46.00000000/head//0: (61) No data available
2015-05-10 15:44:34.723479 7f2df2ffd700 -1 filestore(/var/lib/ceph/osd/ceph-11) error creating 9be4ff76/10006ee7848.00000000/head//0 (/var/lib/ceph/osd/ceph-11/current/0.176_head/DIR_6/DIR_7/DIR_F/DIR_F/10006ee7848.00000000__head_9BE4FF76__0) in index: (61) No data available

All the affected PGs are in pool 0, which is the data pool for CephFS. The replication setting for pool 0 is 2: "ceph osd dump | head -n 9":
epoch 41375
fsid 8a2c9e43-9f17-42e0-92fd-88a40152303d
created 2014-04-06 21:16:19.449590
modified 2015-05-10 13:57:21.376468
flags noout
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 17399 flags hashpspool crash_replay_interval 45 min_read_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 3 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 18915 flags hashpspool min_read_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool min_read_recency_for_promote 1 stripe_width 0
max_osd 29

I'm a bit fuzzy on the timeline about when missing objects started appearing. It's a tad alarming and I'd like any pointers for getting a better understanding of the situation.

To make matters worse, I'm running CephFS and a lot of the missing objects are strip 0 of a file, which leaves me with no idea how to find out what the affected file was so I can delete it and restore from backups. Pointers here would be useful as well. (My current method for mapping an object to CephFS file is to read the xattrs on the 0th stripe object and pick out the strings.)

Thanks in advance for any suggestions/pointers!

-- 
Aaron Ten Clay
http://www.aarontc.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com