Re: 10d

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 22 Jul 2015 10:55:21 +0200



I just filed a ticket after trying ceph-objectstore-tool:
http://tracker.ceph.com/issues/12428

On Fri, Jul 17, 2015 at 3:36 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> A bit of progress: rm'ing everything from inside current/36.10d_head/
> actually let the OSD start and continue deleting other PGs.
>
> Cheers, Dan
>
> On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>> Thanks for the quick reply.
>>
>> We /could/ just wipe these OSDs and start from scratch (the only other
>> pools were 4+2 ec and recovery already brought us to 100%
>> active+clean).
>>
>> But it'd be good to understand and prevent this kind of crash...
>>
>> Cheers, Dan
>>
>>
>>
>>
>> On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> I think you'll need to use the ceph-objectstore-tool to remove the
>>> PG/data consistently, but I've not done this — David or Sam will need
>>> to chime in.
>>> -Greg
>>>
>>> On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>> Hi Greg + list,
>>>>
>>>> Sorry to reply to this old'ish thread, but today one of these PGs bit
>>>> us in the ass.
>>>>
>>>> Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
>>>> and 69 all crash when trying to delete pg 36.10d. They all crash with
>>>>
>>>>    ENOTEMPTY suggests garbage data in osd data dir
>>>>
>>>> (full log below). There is indeed some "garbage" in there:
>>>>
>>>> # find 36.10d_head/
>>>> 36.10d_head/
>>>> 36.10d_head/DIR_D
>>>> 36.10d_head/DIR_D/DIR_0
>>>> 36.10d_head/DIR_D/DIR_0/DIR_1
>>>> 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
>>>> 36.10d_head/DIR_D/DIR_0/DIR_9
>>>>
>>>>
>>>> Do you have any suggestion how to get these OSDs back running? We
>>>> already tried manually moving 36.10d_head to 36.10d_head.bak but then
>>>> the OSD crashes for a different reason:
>>>>
>>>>     -1> 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
>>>> 36.10d coll 36.10d_head
>>>>      0> 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
>>>> function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
>>>> ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
>>>> 15:07:42.442902
>>>> osd/PG.cc: 2839: FAILED assert(r > 0)
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Cheers, Dan
>>>>
>>>> 2015-07-17 14:40:54.493935 7f0ba60f4700  0
>>>> filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
>>>> not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
>>>> from 0)
>>>> 2015-07-17 14:40:54.494019 7f0ba60f4700  0
>>>> filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
>>>> in osd data dir
>>>> 2015-07-17 14:40:54.494021 7f0ba60f4700  0
>>>> filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
>>>> {
>>>>    "ops": [
>>>>        {
>>>>            "op_num": 0,
>>>>            "op_name": "remove",
>>>>            "collection": "36.10d_head",
>>>>            "oid": "10d\/\/head\/\/36"
>>>>        },
>>>>        {
>>>>            "op_num": 1,
>>>>            "op_name": "rmcoll",
>>>>            "collection": "36.10d_head"
>>>>        }
>>>>    ]
>>>> }
>>>>
>>>> 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
>>>> function 'unsigned int
>>>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
>>>> ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
>>>> 14:40:54.502996
>>>> os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
>>>>
>>>> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>>>> 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
>>>> long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
>>>> 2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
>>>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>>>> ThreadPool::TPHandle*)+0x64) [0x97d794]
>>>> 3: (FileStore::_do_op(FileStore::OpSequencer*,
>>>> ThreadPool::TPHandle&)+0x2a0) [0x97da50]
>>>> 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
>>>> 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
>>>> 6: /lib64/libpthread.so.0() [0x3fbec079d1]
>>>> 7: (clone()+0x6d) [0x3fbe8e88fd]
>>>>
>>>> On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>>> On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>> On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
>>>>>>> PGs go inconsistent.
>>>>>>>
>>>>>>> First, immediately after we updated the OSDs PG 34.10d went inconsistent:
>>>>>>>
>>>>>>> 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
>>>>>>> cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
>>>>>>> 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
>>>>>>> bytes,0/0 hit_set_archive bytes.
>>>>>>>
>>>>>>> Second, an hour later 55.10d went inconsistent:
>>>>>>>
>>>>>>> 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
>>>>>>> cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
>>>>>>> clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
>>>>>>> bytes,0/0 hit_set_archive bytes.
>>>>>>>
>>>>>>> Then last night 36.10d suffered the same fate:
>>>>>>>
>>>>>>> 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
>>>>>>> cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
>>>>>>> 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
>>>>>>> whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.
>>>>>>>
>>>>>>>
>>>>>>> In all cases, one object is missing. In all cases, the PG id is 10d.
>>>>>>> Is this an epic coincidence or could something else going on here?
>>>>>>
>>>>>> I'm betting on something else. What OSDs is each PG mapped to?
>>>>>> It looks like each of them is missing one object on some of the OSDs,
>>>>>> what are the objects?
>>>>>
>>>>> 34.10d: [52,202,218]
>>>>> 55.10d: [303,231,65]
>>>>> 36.10d: [30,171,69]
>>>>>
>>>>> So no common OSDs. I've already repaired all of these PGs, and logs
>>>>> have nothing interesting, so I can't say more about the objects.
>>>>>
>>>>> Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com