Hi We have a 4 physical nodes cluster running Jewel, our app talks S3 to the cluster and uses S3 index heavily no-doubt. We've had several big outages in the past that seem caused by a deep-scrub on one of the PGs in S3 index pool. Generally it starts from a deep scrub on one such PG then follows with lots of slow requests blocking/accumulating which eventually makes the whole cluster down. In the event like this, we have to set OSD to noup/nodown/noout to let OSD not suicide during such deep-scrub. In a recent outage, the deep-scrub of one PG took 2 hours to
finish, after finished, I happened to try listing all omap keys of
the objects in that PG and found that listing keys of one
particular object can cause same outage described above, it
indicates to me that the index object was corrupted, but I can't
find anything in the logs. Interestingly (to me), 2 days later
that index object seems have fixed itself: listing omap keys quick
and easy, deep-scrubbing same PG only takes 3 seconds. The deep-scrub that took 2 hours to finish: The command I used to list all omap keys: Most recent deep-scrub kicked off manually: Setting debug_leveldb to 20/5 didn't log any useful information
for the event, sorry, but a perf record shows most (83%) of the
time was used on LevelDB operations (screenshot or perf file can
be supplied if anybody interested since it's over 150KB size
limit.). I wonder if anybody came across similar issue before or can
explain what happened to the index object to make it not-usable
before but usable 2 days later? One thing that might fix the index
object is leveldb compactions I guess. By the way the above
problematic index object has ~30k keys, the biggest index object
in our cluster holds about 300k keys. Regards Stanley --
Stanley Zhang | Senior Operations Engineer |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com