Last time I had to do this, I used the command outlined here: https://tracker.ceph.com/issues/10098 On Mon, Mar 4, 2019 at 11:05 AM Daniel K <sathackr@xxxxxxxxx> wrote: > > Thanks for the suggestions. > > I've tried both -- setting osd_find_best_info_ignore_history_les = true and restarting all OSDs, as well as 'ceph osd-force-create-pg' -- but both still show incomplete > > PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete > pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') > pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete') > > > The OSDs in down_osds_we_would_probe have already been marked lost > > When I ran the force-create-pg command, they went to peering for a few seconds, but then went back incomplete. > > Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu > Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX > > Any other suggestions? > > > > Thanks again, > > Daniel > > > > On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote: >> >> On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone >> <a.marangone@xxxxxxxxx> wrote: >> > >> > If you have no way to recover the drives, you can try to reboot the OSDs with `osd_find_best_info_ignore_history_les = true` (revert it afterwards), you'll lose data. If after this, the PGs are down, you can mark the OSDs blocking the PGs from become active lost. >> >> this should work for PG 18.1e, but not for 18.c. Try running "ceph osd >> force-create-pg <pgid>" to reset the PGs instead. >> Data will obviously be lost afterwards. >> >> Paul >> >> > >> > On Sat, Mar 2, 2019 at 6:08 AM Daniel K <sathackr@xxxxxxxxx> wrote: >> >> >> >> They all just started having read errors. Bus resets. Slow reads. Which is one of the reasons the cluster didn't recover fast enough to compensate. >> >> >> >> I tried to be mindful of the drive type and specifically avoided the larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for the WAL. >> >> >> >> Not sure why they failed. The data isn't critical at this point, just need to get the cluster back to normal. >> >> >> >> On Sat, Mar 2, 2019, 9:00 AM <jesper@xxxxxxxx> wrote: >> >>> >> >>> Did they break, or did something went wronng trying to replace them? >> >>> >> >>> Jespe >> >>> >> >>> >> >>> >> >>> Sent from myMail for iOS >> >>> >> >>> >> >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K <sathackr@xxxxxxxxx>: >> >>> >> >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue 5400rpm 2.5 inch laptop drives. >> >>> >> >>> They've been replace now with HGST 10K 1.8TB SAS drives. >> >>> >> >>> >> >>> >> >>> On Sat, Mar 2, 2019, 12:04 AM <jesper@xxxxxxxx> wrote: >> >>> >> >>> >> >>> >> >>> Saturday, 2 March 2019, 04.20 +0100 from sathackr@xxxxxxxxx <sathackr@xxxxxxxxx>: >> >>> >> >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox >> >>> >> >>> We had multiple drives fail(about 30%) within a few days of each other, likely faster than the cluster could recover. >> >>> >> >>> >> >>> Hov did so many drives break? >> >>> >> >>> Jesper >> >> >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@xxxxxxxxxxxxxx >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com