Re: Demonizing generic Linux issues as Fedora Core-only issues -- WAS: Hi, Bryan

lesmikesell at gmail.com (Les Mikesell) · Thu May 26 03:44:06 2005

On Wed, 2005-05-25 at 19:18, Bryan J. Smith  wrote:
> From: Les Mikesell <lesmikesell@xxxxxxxxx>
> > Agreed.  But note that the standards are set long before that...
> 
> ???  By "standard" what do you mean.  ???

Most of the committee decisions.  

> Ironically, being an American company, this is where Red Hat has done
> a phenominal job of any "western" distro company, IMHO, of pushing
> _hard_ for UTF-8.  Red Hat made this major shift with CL3.0 (Red Hat
> Linux 8.0), which then went into EL3, which was based on  CL3.1 (Red
> Hat Linux 9).  Typical bitching and moaning was present, condemnations
> of both Perl maintainers of 5.8 and Red Hat Linux 8.0, etc...

I guess I could narrow down my complaint here to the specific RedHat
policy of not shipping version number upgrades of applications within
their own distribution versions.  In this instance, it is about
building OS versions with options that required UTF-8 (etc.) character
set support along with a perl version that didn't handle it correctly,
(which I can understand because that's the best they could do at the
time), then *not* providing updates to those broken distributions to
perl 5.8.3+ which would have fixed them in RH 8.0 -> RHEL3.x but instead
expecting users to move to the next RH/Fedora release which introduces
new broken things.  Maybe the problems have been fixed in recent
backed-in patches to RHEL3/centos3 but I don't think so.

> > Or you need mysql 4.x for transactions.
> 
> Again, there is a licensing issue with MySQL 4 that MySQL AB introduced
> that most people, short of Red Hat, are ignoring.

But Centos 4 includes it, and I assume RHEL4.  We've already covered
why it isn't reasonable to run those.  But why can't there be an
application upgrade to 4.x on a distribution that is usable today,
and one that will continue to keep itself updated with a
stock 'yum update' command?  I think this is just a policy issue,
not based on any practical problems.

> > I guess my real complaint is that the bugfixes and improvements to
> > the old releases that you are forced by the situation to run are
> > minimal as soon as it becomes obvious that the future belongs to
> > a different version - that may be a long time in stabilizing to
> > a usable point. 
> 
> So what's your solution?

Allow multiple version of apps in the update repositories, I think.
Why can't we explictly update to an app version beyond the stock
release if we want it and then have yum (etc.) track that instead
of the old one?  If I had the perl, mysql, and dovecot versions
from centos 4 backed into centos 3, I'd be happy for a while.  I
know it wouldn't be horribly hard to do this myself but I really
hate to break automatic updates and introduce problems that may
be unique to each system. 

> > It will be interesting to see how this works out for Ubuntu.  I think it
> > would be possible to be more aggressive with the application versions
> > but hold back more than RH on the kernel changes.
> 
> And that's the real question.  What is the "best balance"?

I'd be extremely conservative about changes that increase the chances of
crashing the whole system (i.e. kernel, device drivers, etc.) and stay
fairly close to the developer's version of applications that just run
in user mode.  Even better, make it easy to pick which version of each
you want, but make the update-tracking system automatically follow
what you picked.  Then if you need a 2.4 kernel, perl 5.8.5 and
mysql 4.1 in the same bundle you can have it.

> > How was the CIPE author supposed to know that what would be
> > released as FC2 would have a changed kernel interface?
> 
> My God, I actually can't believe you could even make such a statement.
> At this point, it's _futile_ to even really debate you anymore, you keep
> talking from a total standing of "assumption" and "unfamilarity." 
> Something a maintainer of a package would not be, unless he honestly
> didn't care.

I'm talking about the CIPE author, who had to be involved to write the
1.6 version not an RPM maintainer who probably couldn't have.

> Fedora Core 2 (2004May) was released 6 months _after_ Linux 2.6 (2003Dec).

So how does any of this relate to the CIPE author, who didn't write CIPE
for fedora and almost certainly didn't have an experimental 2.6 kernel
on some unreleased distribution, knowing that CIPE wasn't going to
work?  On the other hand, someone involved in building FC2 must have
known and I don't remember seeing any messages going to the CIPE list
asking if anyone was working on it.

> > He did respond with 1.6 as quickly as could be expected after a released
> > distribution didn't work with it and a user reported problems on the mailing
> > list.
> 
> How can you blame this on distributions?  Honestly, I don't see it at all!

Who else knew about the change?  Do you expect every author of something
that has been rpm-packaged to keep checking with Linus to see if he
feels like changing kernel interfaces this month so as not to disrupt
the FC release schedule?

> Many drivers were actually _deprecated_ in kernel 2.6, and not built by
> default because people didn't come forward and take ownership.  I know,
> the lack of the "advansys.c" SCSI driver got me good.  ;->

I can understand people backing away from a changing interface.

> > And if you want the things that work to not change???
> 
> Then you do _not_ adopt the latest distro that just came out -- especially
> not an "early adopter" release.  It was well known that Fedora Core 2
> was changing a lot of things, just like SuSE Linux 9.0/9.1.

And, as much as you want this to not be about RH/Fedora policies, you
are then stuck with something unnecessarily inconvenient because
of their policy of not upgrading apps within a release.

> > Where do you find the real info?

> > Is any distro likely to autodetect at bootup in time to cleanly connect
> > the RAID and then keep working?
> 
> When you disconnect a drive from the RAID array, there is no way
> for the software to assume whether or not the drive is any good anymore
> when it reconnects unless you _manually_ tell it (or configure it to always
> assume it is good).

That's not the issue - I don't expect a hot-plug to go into the raid
automatically.  I do want it to pair them up on a clean reboot as
it would if they were both directly IDE connected.  So far nothing has.

> This is not a Red Hat issue either.

Isn't it?  I see different behavior with knoppix and ubuntu.  I think
their startup order and device probing is somewhat different.

> > Maybe it matters that the filesystem is reiserfs - I see some bug
> > reports about that, but rarely have problems when the internal IDE
> > is running as a broken mirror.
> 
> Filesystem shouldn't matter, it's a Disk Label (aka Partition Table)
> consideration.

Separate issues - I'm able to use mdadm to add the firewire drive to
the raid and it will re-sync, but if I leave the drive mounted and
busy, every 2.6 kernel based distro I've tried so far will crash after
several hours.   I can get a copy by unmounting the partition, letting
the raid resync then removing the external drive (being able to take
a snapshot offsite is the main point anyway).   I've seen some bug
reports about reiserfs on raid that may relate to the crash problem
when running with the raid active. This didn't happen under FC1 which
never crashed between weekly disk swaps.  There could also be some
problems with my drive carriers. A firmware update on one type seems to
have changed things but none of the problems are strictly reproducible
so it is taking a long time to pin anything down.

> There were
> efforts to get cipe to work -- both in FC2 and then in Fedora Extras for FC2,
> but they were eventually dropped because of lack of interested by the
> CIPE developers themselveas in getting "up-to-speed" on 2.5/2.6.

There's really only one CIPE 'developer' and I don't think he has any
particular interest in any specific distributions.  If anyone else was
talking about it, and in any other place than the CIPE mailing list, I'm
not surprised that it did not have useful results.

> > The real world is a complicated place.  If you want to substitute
> > a different program for an old but non-standard one you need to
> > make sure it handles all the same things.
> 
> But what if those either conflict with standards or are broken?

You use the one that works and has a long history of working until
the replacement handles all the needed operations.   A committee
decision isn't always the most reliable way to do something even
if you follow the latest of their dozens of revisions.

> > Until just recently, star didn't even come close to getting incrementals
> > right and still. unlike gnutar, requires the runs to be on filesystem
> > boundaries for incrementals to work.  And, it probably doesn't handle
> > the options that amanda needs.  Theory, meet practice.
> 
> You assume "GNU Tar" is the "standard."  ;->

No, but I assume that Gnu tar will be available anywhere I need it.
Given that I've compiled it under DOS, linked to both an aspi scsi
driver and a tcp stack that could read/feed rsh on another machine
that seems like a reasonable assumption.  I can't think of anything
less likely to work...

> It's not, never has been, and is years late to the POSIX-2001/SUSv3
> party.

So which is more important when I want to read something from my
1990's vintage tapes?

> As far as not crossing filesystem boundaries, that is easily accomodated.

Maybe, maybe not.  I always set up backups on filesystem boundaries
anyway so I can prevent them from wandering into CD's or NFS mounts
by accident, but I can imagine times when you'd want to include them
and still do correct incrementals.  

> > It wasn't so much the cloning as it was setting the IP address
> > with their GUI tool (and I did that because when I just made
> > the change to the /etc/sysconfig/network-scripts/ifcfg-xxx file
> > the way that worked in earlier RH versions it sometimes mysteriously
> > didn't work).  Then I shipped the disks in their hot-swap carriers to
> > the remote sites to be installed.
> 
> Again, I will repeat that "configuration management" is part of avoiding
> "professional negligence."  You should have tested those carriers before
> shipping them out by putting them in another box.

You aren't following the scenario.  The drives worked as shipped. They
were running Centos 3.x which isn't supposed to have behavior-changing
updates.  I did a 'yum update' from the nicely-running remote boxes
that didn't include a kernel and thus didn't do a reboot immediately
afterwords.  I normally test on a local system, then one or a few
of the remotes, make sure nothing breaks, then proceed with the
rest of the remotes.  So, after all that, I ended up with a flock
of running remote boxes that were poised to become unreachable on
the next reboot.  And even if I had rebooted the a local box after
the corresponding update, it wouldn't have had the problem because
I would have either installed that one in place or assigned the IP
from its own console after swapping the disk in.

> > That procedure can't be uncommon.
> 
> No, it's not uncommon, I didn't say that.  I'm just saying that vendors
> can't test for a lot of different issues.

But they could at least think about what a behavior change is likely
to do in different situations, and this one is pretty obvious.  If
eth0 is your only network interface and you refuse to start it at
bootup, remote servers that used to work become unreachable.  I
do understand the opposite problem that they were trying to fix
where a change in kernel detection order changes the interface names
and has the potential to make a DHCP server start on the wrong
interface, handing out addresses that don't work.  But, it's the
kind of change that should have come at a version revision or
along with the kernel with the detection change.

> No offense, but when I ship out components to remote sites, I do my
> due dilligence and test for every condition I can think of.  But maybe
> I'm anal because I ship things out to places like White Sands, Edwards,
> etc... and techs can't be expected debugging such concepts, so
> I always test with at least a partially replicated environment (which
> would be at least 1 change of physical system ;-).

Note that I did test everything I could, and everything I could have
tested worked because the pre-shipping behavior was to include the
hardware address in the /etc/sysconfig/networking/profiles/defaults/xxxx
file, but to ignore it at startup.  So even when I tested the
cloned disks after moving to a 2nd box they worked.  The 'partially
replicated environment' to catch this would have had to be a local
machine with it's IP set while the drive was in a different box and
then rebooted after installing an update that didn't require it. I
suppose if lives were at stake I might have gone that far.

> > If I hadn't already known about it by then or if a lot of the remote
> > servers had been rebooted and came up with no network
> > access it might have ruined my whole day.
> 
> I think you're assigning blame in the wrong direction.  You might think
> that's rude and arrogant, but in reality, if you keep blaming vendors for
> those type of mistakes, you're going to have a lot more of them coming
> your way until you change that attitude.  No offense.  ;->

You are right, of course.  I take responsibility for what happened along
with credit for catching it before it caused any real downtime (which
was mostly dumb luck from seeing the message on the screen because I
happened to be at one of the remote locations when the first one was
rebooted for another reason).  Still, it gives me a queasy feeling about
what to expect from vendors - and I've been burned the other direction
too by not staying up the minute with updates so you can't just skip
them.  Hmmm, now I wonder if the code was intended to use the hardware
address all along but was broken as originally shipped.  It would be
a bit more comforting if it was included in an update because someone
thought it was a bugfix instead of someone thinking it was a good idea
to change currently working behavior.

-- 
  Les Mikesell
   lesmikesell@xxxxxxxxx