On Tue, Oct 29, 2019 at 9:09 PM Jérémy Gardais <jeremy.gardais@xxxxxxxxxxxxxxx> wrote: > > Thus spake Brad Hubbard (bhubbard@xxxxxxxxxx) on mardi 29 octobre 2019 à 08:20:31: > > Yes, try and get the pgs healthy, then you can just re-provision the down OSDs. > > > > Run a scrub on each of these pgs and then use the commands on the > > following page to find out more information for each case. > > > > https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/ > > > > Focus on the commands 'list-missing', 'list-inconsistent-obj', and > > 'list-inconsistent-snapset'. > > > > Let us know if you get stuck. > > > > P.S. There are several threads about these sorts of issues in this > > mailing list that should turn up when doing a web search. > > I found this thread : > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg53116.html That looks like the same issue. > > And i start to get additionnals informations to solve PG 2.2ba : > 1. rados list-inconsistent-snapset 2.2ba --format=json-pretty > { > "epoch": 192223, > "inconsistents": [ > { > "name": "rbd_data.b4537a2ae8944a.000000000000425f", > "nspace": "", > "locator": "", > "snap": 22772, > "errors": [ > "headless" > ] > }, > { > "name": "rbd_data.b4537a2ae8944a.000000000000425f", > "nspace": "", > "locator": "", > "snap": "head", > "snapset": { > "snap_context": { > "seq": 22806, > "snaps": [ > 22805, > 22804, > 22674, > 22619, > 20536, > 17248, > 14270 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 17248, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 17248 > ] > }, > { > "snap": 20536, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 20536 > ] > }, > { > "snap": 22625, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 22619 > ] > }, > { > "snap": 22674, > "size": 4194304, > "overlap": "[266240~4096]", > "snaps": [ > 22674 > ] > }, > { > "snap": 22805, > "size": 4194304, > "overlap": "[0~942080,958464~901120,1875968~16384,1908736~360448,2285568~1908736]", > "snaps": [ > 22805, > 22804 > ] > } > ] > }, > "errors": [ > "extra_clones" > ], > "extra clones": [ > 22772 > ] > } > ] > } > > 2.a ceph-objectstore-tool from osd.29 and osd.42 : > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-29/ --pgid 2.2ba --op list rbd_data.b4537a2ae8944a.000000000000425f > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > > 2.b ceph-objectstore-tool from osd.30 : > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-30/ --pgid 2.2ba --op list rbd_data.b4537a2ae8944a.000000000000425f > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > > I needed to shutdown the OSD service (30, 29 then 42) to be able to > get any result. Otherwise i only had these errors : > Mount failed with '(11) Resource temporarily unavailable' > Or > OSD has the store locked Yes, the object store tool requires the OSD to be shut down. > > > > Without doing anything else, 2 OSDs start flapping (osd.38 and > osd.27) with 1 PG switching between inactive, down and up… : Maybe you should set nodown and noout while you do these maneuvers? That will minimise peering and recovery (data movement). > > HEALTH_ERR 2 osds down; 12128/37456062 objects misplaced (0.032%); 4 scrub errors; Reduced data availability: 1 pg inactive, 1 pg down; Possible data damage: 2 pgs inconsistent; Degraded data redundancy: 2264342/37456062 objects degraded (6.045%), 859 pgs degraded > OSD_DOWN 2 osds down > osd.27 (root=default,datacenter=IPR,room=11B,rack=baie2,host=r730xd3) is down > osd.38 (root=default,datacenter=IPR,room=11B,rack=baie2,host=r740xd1) is down > OBJECT_MISPLACED 12128/37456062 objects misplaced (0.032%) > OSD_SCRUB_ERRORS 4 scrub errors > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down > pg 2.448 is down, acting [0] > PG_DAMAGED Possible data damage: 2 pgs inconsistent > pg 2.2ba is active+clean+inconsistent, acting [42,29,30] > pg 2.2bb is active+clean+inconsistent, acting [25,42,18] > pg 2.371 is active+undersized+degraded+remapped+inconsistent+backfill_wait,acting [42,9] > … > > > If i correctly understood the previous thread, i should remove the > snapid 22772 from osd.29 and osd.42 : > ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-29/ ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] remove > ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-42/ ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] remove That looks right. > > Still need to shutdown the service before or i miss an important thing ? Yes. > > Sorry for the noob's noise, i not really confortable with the current > state of my cluster -_- You should probably try and work out what caused the issue and take steps to minimise the likelihood of a recurrence. This is not expected behaviour in a correctly configured and stable environment. > > -- > Gardais Jérémy > Institut de Physique de Rennes > Université Rennes 1 > Téléphone: 02-23-23-68-60 > Mail & bonnes pratiques: http://fr.wikipedia.org/wiki/Nétiquette > ------------------------------- -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com