Re: two osd stack on peereng after start osd to recovery

Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> · Fri, 28 Jun 2013 23:05:44 +0200

Hi,
We took osd.71 out and now problem is on osd.57.
Something curious, op_rw on osd.57 is much higher than other.
See here: https://www.dropbox.com/s/o5q0xi9wbvpwyiz/op_rw_osd57.PNG

On data on this osd I found:
> data/osd.57/current# du -sh omap/
> 2.3G    omap/

That much higher op_rw on one osd is normal?
Maybe some config is wrong set ( logs to this osd or something that ).

Today we have anogher crash ( 4 times ).
Logs with debug (level 10) here:
https://www.dropbox.com/s/vxvh8084b8ty19u/osd.57_20130628_13xx.log.tar.gz
When we debug on is higher but normal osd.57 proces consumes about
~7%CPU, iostat on disk shows max 4%util.

Prod ceph.conf debug options:
[global]
        debug_lockdep = 0/0
        debug_context = 0/0
        debug_crush = 0/0
        debug_mds = 0/0
        debug_mds_balancer = 0/0
        debug_mds_locker =0/0
        debug_mds_log = 0/0
        debug_mds_log_expire = 0/0
        debug_mds_migrator = 0/0
        debug_buffer = 0/0
        debug_timer = 0/0
        debug_filer = 0/0
        debug_objecter = 0/0
        debug_rados = 0/0
        debug_rbd = 0/0
        debug_journaler = 0/0
        debug_objectcacher = 0/0
        debug_client = 0/0
        debug_optracker = 0/0
        debug_objclass = 0/0
        debug_journal = 0/0
        debug_ms = 0/0
        debug_mon = 0/0
        debug_monc = 0/0
        debug_paxos = 0/0
        debug_tp = 0/0
        debug_auth = 0/0
        debug_finisher = 0/0
        debug_heartbeatmap = 0/0
        debug_perfcounter = 0/0
        debug_hadoop = 0/0
        debug_asok = 0/0
        debug_throttle = 0/0
[osd]
        debug osd = 1
        debug filestore = 1 ; local object storage

--
Regards
Dominik

2013/6/13 Gregory Farnum <greg@xxxxxxxxxxx>:
> On Thu, Jun 13, 2013 at 6:33 AM, Sławomir Skowron <szibis@xxxxxxxxx> wrote:
>> Hi, sorry for late response.
>>
>> https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>>
>> Logs in attachment, and on google drive, from today.
>>
>> https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>>
>> We have such problem today. And new logs are on google drive with today date.
>>
>> Strange is that problematic osd.71 have about 10-15%, more space used
>> then other osd in cluster.
>>
>> Today in one hour osd.71 fails 3 times in mon log, and after third
>> recovery has been stuck, and many 500 errors appears in http layer on
>> top of rgw. When it's stuck, restarting osd71, osd.23, and osd.108,
>> all from stucked pg, helps, but i run even repair on this osd, just in
>> case.
>>
>> I have some theory, that on this pg is rgw index of objects, or one of
>> osd in this pg, have some problems with local filesystem or drive
>> bellow (raid controller reports nothing about that), but i do not see
>> any problem in system.
>>
>> How can we find in which pg/osd index of objects in rgw bucket exist ??
>
> You can find the location of any named object by grabbing the OSD map
> from the cluster and using the osdmaptool: "osdmaptool <mapfile>
> --test-map-object <objname> --pool <poolid>".
>
> You're not providing any context for your issue though, so we really
> can't help. What symptoms are you observing?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

-- 
Pozdrawiam
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com