cluster state: osdmap e3240: 24 osds: 12 up, 12 in pgmap v46050: 1088 pgs, 2 pools, 20322 GB data, 5080 kobjects 22224 GB used, 61841 GB / 84065 GB avail 4745644/10405374 objects degraded (45.608%); 3688079/10405374 objects misplaced (35.444%) 5 stale+active+clean 59 active+clean 74 active+undersized+degraded+remapped+backfilling 53 active+remapped 577 active+undersized+degraded 37 down+peering 283 active+undersized+degraded+remapped+wait_backfill recovery io 844 MB/s, 211 objects/s On Wed, Jul 15, 2015 at 2:29 PM, Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> wrote: > Sorry for delay in replying to this, as I was doing some retries on > this issue and summarise. > > > Tony, > Setup details: > Two storage box (each with 12 drives) , each connected with 4 hosts. > Each host own 3 disk from storage box. Total of 24 OSD's. > Failure domain is at Chassis level. > > OSD tree: > -1 164.2 root default > -7 82.08 chassis chassis1 > -2 20.52 host host-1 > 0 6.84 osd.0 up 1 > 1 6.84 osd.1 up 1 > 2 6.84 osd.2 up 1 > -3 20.52 host host-2 > 3 6.84 osd.3 up 1 > 4 6.84 osd.4 up 1 > 5 6.84 osd.5 up 1 > -4 20.52 host host-3 > 6 6.84 osd.6 up 1 > 7 6.84 osd.7 up 1 > 8 6.84 osd.8 up 1 > -5 20.52 host host-4 > 9 6.84 osd.9 up 1 > 10 6.84 osd.10 up 1 > 11 6.84 osd.11 up 1 > -8 82.08 chassis chassis2 > -6 20.52 host host-5 > 12 6.84 osd.12 up 1 > 13 6.84 osd.13 up 1 > 14 6.84 osd.14 up 1 > -9 20.52 host host-6 > 15 6.84 osd.15 up 1 > 16 6.84 osd.16 up 1 > 17 6.84 osd.17 up 1 > -10 20.52 host host-7 > 18 6.84 osd.18 up 1 > 19 6.84 osd.19 up 1 > 20 6.84 osd.20 up 1 > -11 20.52 host host-8 > 21 6.84 osd.21 up 1 > 22 6.84 osd.22 up 1 > 23 6.84 osd.23 up 1 > > Cluster had ~30TB of data. Client IO is in progress on cluster. > After chassis1 underwent powercycle, > 1> all OSD's under chassis2 were intact. Up & running > 2> all OSD's under chassis1 were down as expected. > > But, client IO was paused untill all the hosts/OSD's under chassis1 > comes up. This issue is observed twice out of 5 attempts. > > Size is 2 & min_size is 1. > > -Thanks, > Mallikarjun > > > On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris <nethfel@xxxxxxxxx> wrote: >> Sounds to me like you've put yourself at too much risk - *if* I'm reading >> your message right about your configuration, you have multiple hosts >> accessing OSDs that are stored on a single shared box - so if that single >> shared box (single point of failure for multiple nodes) goes down it's >> possible for multiple replicas to disappear at the same time which could >> halt the operation of your cluster if the masters and the replicas are both >> on OSDs within that single shared storage system... >> >> On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar >> <mallikarjuna.biradar@xxxxxxxxx> wrote: >>> >>> Hi all, >>> >>> Setup details: >>> Two storage enclosures each connected to 4 OSD nodes (Shared storage). >>> Failure domain is Chassis (enclosure) level. Replication count is 2. >>> Each host has allotted with 4 drives. >>> >>> I have active client IO running on cluster. (Random write profile with >>> 4M block size & 64 Queue depth). >>> >>> One of enclosure had power loss. So all OSD's from hosts that are >>> connected to this enclosure went down as expected. >>> >>> But client IO got paused. After some time enclosure & hosts connected >>> to it came up. >>> And all OSD's on that hosts came up. >>> >>> Till this time, cluster was not serving IO. Once all hosts & OSD's >>> pertaining to that enclosure came up, client IO resumed. >>> >>> >>> Can anybody help me why cluster not serving IO during enclosure >>> failure. OR its a bug? >>> >>> -Thanks & regards, >>> Mallikarjun Biradar >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com