Hi Dominik,
What`s about performance on the osd.87 at this moment, do you have any related measurements?On Tue, Jul 2, 2013 at 1:37 PM, Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> wrote:
Hi,
I got it.
ceph health details
HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 5 pgs stuck unclean;
recovery 64/38277874 degraded (0.000%)
pg 5.df9 is stuck inactive for 138669.746512, current state peering,
last acting [87,2,151]
pg 5.a82 is stuck inactive for 138638.121867, current state peering,
last acting [151,87,42]
pg 5.80d is stuck inactive for 138621.069523, current state peering,
last acting [151,47,87]
pg 5.df9 is stuck unclean for 138669.746761, current state peering,
last acting [87,2,151]
pg 5.ae2 is stuck unclean for 139479.810499, current state active,
last acting [87,151,28]
pg 5.7b6 is stuck unclean for 139479.693271, current state active,
last acting [87,105,2]
pg 5.a82 is stuck unclean for 139479.713859, current state peering,
last acting [151,87,42]
pg 5.80d is stuck unclean for 139479.800820, current state peering,
last acting [151,47,87]
pg 5.df9 is peering, acting [87,2,151]
pg 5.a82 is peering, acting [151,87,42]
pg 5.80d is peering, acting [151,47,87]
recovery 64/38277874 degraded (0.000%)
osd pg query for 5.df9:
{ "state": "peering",
"up": [
87,
2,
151],
"acting": [
87,
2,
151],
"info": { "pgid": "5.df9",
"last_update": "119454'58844953",
"last_complete": "119454'58844953",
"log_tail": "119454'58843952",
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": { "epoch_created": 365,
"last_epoch_started": 119456,
"last_epoch_clean": 119456,
"last_epoch_split": 117806,
"same_up_since": 119458,
"same_interval_since": 119458,
"same_primary_since": 119458,
"last_scrub": "119442'58732630",
"last_scrub_stamp": "2013-06-29 20:02:24.817352",
"last_deep_scrub": "119271'57224023",
"last_deep_scrub_stamp": "2013-06-23 02:04:49.654373",
"last_clean_scrub_stamp": "2013-06-29 20:02:24.817352"},
"stats": { "version": "119454'58844953",
"reported": "119458'42382189",
"state": "peering",
"last_fresh": "2013-06-30 20:35:29.489826",
"last_change": "2013-06-30 20:35:28.469854",
"last_active": "2013-06-30 20:33:24.126599",
"last_clean": "2013-06-30 20:33:24.126599",
"last_unstale": "2013-06-30 20:35:29.489826",
"mapping_epoch": 119455,
"log_start": "119454'58843952",
"ondisk_log_start": "119454'58843952",
"created": 365,
"last_epoch_clean": 365,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "119442'58732630",
"last_scrub_stamp": "2013-06-29 20:02:24.817352",
"last_deep_scrub": "119271'57224023",
"last_deep_scrub_stamp": "2013-06-23 02:04:49.654373",
"last_clean_scrub_stamp": "2013-06-29 20:02:24.817352",
"log_size": 135341,
"ondisk_log_size": 135341,
"stats_invalid": "0",
"stat_sum": { "num_bytes": 1010563373,
"num_objects": 3099,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 302,
"num_read_kb": 0,
"num_write": 32264,
"num_write_kb": 798650,
"num_scrub_errors": 0,
"num_objects_recovered": 8235,
"num_bytes_recovered": 2085653757,
"num_keys_recovered": 249061471},
"stat_cat_sum": {},
"up": [
87,
2,
151],
"acting": [
87,
2,
151]},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 119454},
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetLog",
"enter_time": "2013-06-30 20:35:28.545478",
"newest_update_osd": 2},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2013-06-30 20:35:28.469841",
"past_intervals": [
{ "first": 119453,
"last": 119454,
"maybe_went_rw": 1,
"up": [
87,
2,
151],
"acting": [
87,
2,
151]},
{ "first": 119455,
"last": 119457,
"maybe_went_rw": 1,
"up": [
2,
151],
"acting": [
2,
151]}],
"probing_osds": [
2,
87,
151],
"down_osds_we_would_probe": [],
"peering_blocked_by": []},
{ "name": "Started",
"enter_time": "2013-06-30 20:35:28.469765"}]}
For other PGs: https://www.dropbox.com/s/q5iv8lwzecioy3d/pg_query.tar.tz
--
Regards
Dominik
2013/6/30 Andrey Korolyov <andrey@xxxxxxx>:
--> That`s not a loop as it looks, sorry - I had reproduced issue many
> times and there is no such cpu-eating behavior in most cases, only
> locked pgs are presented. Also I may celebrate returning of 'wrong
> down mark' bug, at least for the 0.61.4 tag. For first one, I`ll send
> a link with core as quick as I will be able to reproduce it on my test
> env, and second one linked with 100% disk utilization, so I`m not sure
> if this is right behavior or wrong.
>
> On Sat, Jun 29, 2013 at 1:28 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>> On Sat, 29 Jun 2013, Andrey Korolyov wrote:
>>> There is almost same problem with the 0.61 cluster, at least with same
>>> symptoms. Could be reproduced quite easily - remove an osd and then
>>> mark it as out and with quite high probability one of neighbors will
>>> be stuck at the end of peering process with couple of peering pgs with
>>> primary copy on it. Such osd process seems to be stuck in some kind of
>>> lock, eating exactly 100% of one core.
>>
>> Which version?
>> Can you attach with gdb and get a backtrace to see what it is chewing on?
>>
>> Thanks!
>> sage
>>
>>
>>>
>>> On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> > On Thu, Jun 13, 2013 at 6:33 AM, S?awomir Skowron <szibis@xxxxxxxxx> wrote:
>>> >> Hi, sorry for late response.
>>> >>
>>> >> https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>>> >>
>>> >> Logs in attachment, and on google drive, from today.
>>> >>
>>> >> https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>>> >>
>>> >> We have such problem today. And new logs are on google drive with today date.
>>> >>
>>> >> Strange is that problematic osd.71 have about 10-15%, more space used
>>> >> then other osd in cluster.
>>> >>
>>> >> Today in one hour osd.71 fails 3 times in mon log, and after third
>>> >> recovery has been stuck, and many 500 errors appears in http layer on
>>> >> top of rgw. When it's stuck, restarting osd71, osd.23, and osd.108,
>>> >> all from stucked pg, helps, but i run even repair on this osd, just in
>>> >> case.
>>> >>
>>> >> I have some theory, that on this pg is rgw index of objects, or one of
>>> >> osd in this pg, have some problems with local filesystem or drive
>>> >> bellow (raid controller reports nothing about that), but i do not see
>>> >> any problem in system.
>>> >>
>>> >> How can we find in which pg/osd index of objects in rgw bucket exist ??
>>> >
>>> > You can find the location of any named object by grabbing the OSD map
>>> > from the cluster and using the osdmaptool: "osdmaptool <mapfile>
>>> > --test-map-object <objname> --pool <poolid>".
>>> >
>>> > You're not providing any context for your issue though, so we really
>>> > can't help. What symptoms are you observing?
>>> > -Greg
>>> > Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Pozdrawiam
Dominik
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com