Can you restart the node that failed to complete the upgrade with debug filestore = 20 debug osd = 20 and post the log after an hour or so of running? The upgrade process might legitimately take a while. -Sam On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> wrote: > Le 06/07/2012 19:01, Gregory Farnum a écrit : > >> On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> >> wrote: >>> >>> Le 05/07/2012 23:32, Gregory Farnum a écrit : >>> >>> [...] >>> >>>>> ok, so as all nodes were identical, I probably have hit a btrfs bug >>>>> (like >>>>> a >>>>> erroneous out of space ) in more or less the same time. And when 1 osd >>>>> was >>>>> out, >>> >>> >>> OH , I didn't finish the sentence... When 1 osd was out, missing data was >>> copied on another nodes, probably speeding btrfs problem on those nodes >>> (I >>> suspect erroneous out of space conditions) >> >> Ah. How full are/were the disks? > > > The OSD nodes were below 50 % (all are 5 To volumes): > > osd.0 : 31% > osd.1 : 31% > osd.2 : 39% > osd.3 : 65% > no osd.4 :) > osd.5 : 35% > osd.6 : 60% > osd.7 : 42% > osd.8 : 34% > > all the volumes were using btrfs with lzo compress. > > [...] > >> >> Oh, interesting. Are the broken nodes all on the same set of arrays? >>> >>> >>> No. There are 4 completely independant raid arrays, in 4 different >>> locations. They are similar (same brand & model, but slighltly different >>> disks, and 1 different firmware), all arrays are multipathed. I don't >>> think >>> the raid array is the problem. We use those particular models since 2/3 >>> years, and in the logs I don't see any problem that can be caused by the >>> storage itself (like scsi or multipath errors) >> >> I must have misunderstood then. What did you mean by "1 Array for 2 OSD >> nodes"? > > > I have 8 osd nodes, in 4 different locations (several km away). In each > location I have 2 nodes and 1 raid Array. > On each location, each raid array has 16 2To disks, 2 controllers with 4x 8 > Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one, > 7 disks for the orher). Each raid set is primary attached to 1 controller, > and each osd node on the location has acces to the controller with 2 > distinct paths. > > There were no correlation between failed nodes & raid array. > > > Cheers, > > -- > Yann Dupont - Service IRTS, DSI Université de Nantes > Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html