Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some "garbage" in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1> 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0> 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r > 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { "ops": [ { "op_num": 0, "op_name": "remove", "collection": "36.10d_head", "oid": "10d\/\/head\/\/36" }, { "op_num": 1, "op_name": "rmcoll", "collection": "36.10d_head" } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error") ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >> On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 >>> PGs go inconsistent. >>> >>> First, immediately after we updated the OSDs PG 34.10d went inconsistent: >>> >>> 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : >>> cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, >>> 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 >>> bytes,0/0 hit_set_archive bytes. >>> >>> Second, an hour later 55.10d went inconsistent: >>> >>> 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : >>> cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 >>> clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 >>> bytes,0/0 hit_set_archive bytes. >>> >>> Then last night 36.10d suffered the same fate: >>> >>> 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : >>> cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, >>> 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 >>> whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. >>> >>> >>> In all cases, one object is missing. In all cases, the PG id is 10d. >>> Is this an epic coincidence or could something else going on here? >> >> I'm betting on something else. What OSDs is each PG mapped to? >> It looks like each of them is missing one object on some of the OSDs, >> what are the objects? > > 34.10d: [52,202,218] > 55.10d: [303,231,65] > 36.10d: [30,171,69] > > So no common OSDs. I've already repaired all of these PGs, and logs > have nothing interesting, so I can't say more about the objects. > > Cheers, Dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com