Happy new year to all! In these holidays i've suffered a disk failure, but hitted also an 'inconsistent pg' error, and i want to understand. Ceph 12.2.12, filestore. Starting from 27/12 i get classical disk error: Dec 27 20:52:21 capitanmarvel kernel: [345907.286795] ata1.00: exception Emask 0x0 SAct 0xfe00000 SErr 0x0 action 0x0 Dec 27 20:52:21 capitanmarvel kernel: [345907.286849] ata1.00: irq_stat 0x40000008 Dec 27 20:52:21 capitanmarvel kernel: [345907.286880] ata1.00: failed command: READ FPDMA QUEUED Dec 27 20:52:21 capitanmarvel kernel: [345907.286920] ata1.00: cmd 60/00:a8:20:87:3b/04:00:00:00:00/40 tag 21 ncq dma 524288 in Dec 27 20:52:21 capitanmarvel kernel: [345907.286920] res 41/40:00:46:8a:3b/00:00:00:00:00/40 Emask 0x409 (media error) <F> Dec 27 20:52:21 capitanmarvel kernel: [345907.287018] ata1.00: status: { DRDY ERR } Dec 27 20:52:21 capitanmarvel kernel: [345907.287046] ata1.00: error: { UNC } Dec 27 20:52:21 capitanmarvel kernel: [345907.288676] ata1.00: configured for UDMA/133 Dec 27 20:52:21 capitanmarvel kernel: [345907.288698] sd 1:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Dec 27 20:52:21 capitanmarvel kernel: [345907.288702] sd 1:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] Dec 27 20:52:21 capitanmarvel kernel: [345907.288705] sd 1:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed Dec 27 20:52:21 capitanmarvel kernel: [345907.288708] sd 1:0:0:0: [sdc] tag#21 CDB: Read(10) 28 00 00 3b 87 20 00 04 00 00 Dec 27 20:52:21 capitanmarvel kernel: [345907.288711] print_req_error: I/O error, dev sdc, sector 3902022 but also: Dec 27 20:52:24 capitanmarvel ceph-osd[3852]: 2019-12-27 20:52:24.714716 7f821fbfd700 -1 log_channel(cluster) log [ERR] : 4.9b missing primary copy of 4:d97871c4:::rbd_data.142b816b8b4567.0000000000012ae1:head, will try copies on 8,14 OSD 'flip-flop' a bit for some days. At first scrub, i got: cluster: id: 8794c124-c2ec-4e81-8631-742992159bd6 health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum blackpanther,capitanmarvel,4,2,3 mgr: hulk(active), standbys: blackpanther, deadpool, thor, capitanmarvel osd: 12 osds: 12 up, 12 in data: pools: 3 pools, 768 pgs objects: 671.04k objects, 2.54TiB usage: 7.62TiB used, 9.66TiB / 17.3TiB avail pgs: 766 active+clean 1 active+clean+inconsistent 1 active+clean+scrubbing+deep finally, OSD die, and so i got (after automatic remapping): cluster: id: 8794c124-c2ec-4e81-8631-742992159bd6 health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum blackpanther,capitanmarvel,4,2,3 mgr: hulk(active), standbys: blackpanther, deadpool, thor, capitanmarvel osd: 12 osds: 11 up, 11 in data: pools: 3 pools, 768 pgs objects: 674.26k objects, 2.55TiB usage: 7.65TiB used, 8.71TiB / 16.4TiB avail pgs: 767 active+clean 1 active+clean+inconsistent To fix the issue i've tried to read the docs (looking for 'OSD_SCRUB_ERRORS'), finding: https://docs.ceph.com/docs/doc-12.2.0-major-changes/rados/operations/health-checks/ but the link within is empty: https://docs.ceph.com/docs/doc-12.2.0-major-changes/rados/operations/pg-repair/ and after fiddling a bit with google, i've found: https://ceph.io/geen-categorie/ceph-manually-repair-object/ that permit me to fix the issue easily with 'ceph pg repair'. Two question: 1) the missing page on 'pg-repair' is a bug of documentation? There's something i can do? 2) what happens? - While, if the OSD was not able to write data to the OSD, they are not automatically relocated to other OSD? This violate the crushmap? - while, when the failing OSD get out, the inconsistent PG get not automatically fixed? I've count=3, the other 2 copies are not coherent? But, if so, how ceph was able to fix them? Sorry... and thanks. ;) -- dott. Marco Gaiarin GNUPG Key ID: 240A3D66 Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000 (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx