Re: Data not accessible after replacing OSD with larger volume

Maxime Guyot <maxime@xxxxxxxxxxx> · Mon, 01 May 2017 14:37:59 +0000

Hi,
"Yesterday I replaced one of the 100 GB volumes with a new 2 TB volume which includes creating a snapshot, detaching the old volume, attaching the new volume, then using parted to correctly set the start/end of the data partition. This all went smoothly and no issues reported from AWS or the server."
While this method should work, I think you would be better off adding the new 2TB OSD and changing the weight of the old OSD to 0 before un-mounting, detaching and deleting it.

David is right, your weight and reweight values are off.

Do you have more info on your cluster status? maybe something like OSD nearfull? the data is in a pool with triple replication?

Side note: since you run Ceph in AWS, you might be interested in this piece from the folks at GitLab: https://about.gitlab.com/2016/11/10/why-choose-bare-metal/ 

Cheers,
Maxime

On Mon, 1 May 2017 at 06:40 David Turner <drakonstein@xxxxxxxxx> wrote:
The crush weight should match the size of your osds. The 100GB osds having 0.090 probably based on GiB vs GB. Your 2TB osds should have a weight of 2.000, or there about.  Your reweight values will be able to go back much closer to 1 once you fix the weights of the larger osds.  Fixing that might allow your cluster to finish backfilling.

How do you access your images? Is it through cephfs, rgw, or rbd? Your current health doesn't look like it should prevent access to your images.  The only thing I can think of other than mds or rgw not running would be to issue a deep scrub on some of the pgs on the newly increased osd to see if there are any inconsistent pgs on it.

On Sun, Apr 30, 2017, 10:40 AM Scott Lewis <scott@xxxxxxxxxxxxxx> wrote:
Hi,
I am a complete n00b to CEPH and cannot seem to figure out why my cluster isn't working as expected. We have 39 OSDs, 36 of which are 100 GB volumes and 3 are 2 TB volumes managed under AWS EC2. 

Yesterday I replaced one of the 100 GB volumes with a new 2 TB volume which includes creating a snapshot, detaching the old volume, attaching the new volume, then using parted to correctly set the start/end of the data partition. This all went smoothly and no issues reported from AWS or the server.

However, when I started reweighting the OSDs, the health status went to HEALTH_WARN with over 500 pgs stuck unclean, and about 14% of objects misplaced. I am adding the health detail, crushmap, and OSD tree here:

Crushmap: https://pastebin.com/HxiAChP3
Health Detail: https://pastebin.com/K7ZqLQH9
OSD Tree: https://pastebin.com/qGRk3R8S

We use CEPH to storage our image inventory which is about 5 million or so images. If you do a search on our site, https://iconfinder.com, none of the images is showing up.

This all started after doing the reweights when the new volume was added. I tried setting all of the weights back to their original settings but this did not help.

The only other thing that I changed was to set the max PID threads to the max allowed. I reset this to the original setting but that didn't work either.

sudo sysctl -w kernel.pid_max=32768

Thanks in advance for any help.

Scott LewisSr. Developer & Head of Content
Iconfinder Aps

http://iconfinder.com
http://twitter.com/iconfinder

"Helping Designers Make a Living Doing What They Love" 

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com