So, I decided to update my system to the next version of Ubuntu 13.04 "Raring Ringtail", this time within less than a week after it was released. Usually I wait a few weeks for the bugs to be worked out, there are always a few bugs that they let through because they are on a strict release schedule and not enough beta testing resources available to them (nobody has enough, really).
When I got my new computer this year I installed Ubuntu 12.10 "Quantal Quetzal" and I had zero problems, not even with my Nvidia graphics card. Lots of people had problems when upgrading, the apt-get had a bug of some kind that either installed the wrong driver or disabled a working driver, or some such thing. I did a clean install (since I had a brand-new blank hard disk) and I simply selected the Nouveau driver and it just worked as though by magic.
I have been updating Ubuntu since about version 10.04 "Lucid Lynx" and it has never given me a problem. Not this time, this time it was a bit of a fiasco.
I do accept partial responsibility, after all I was using B-Tree File System "btrfs", a new file system that is clearly marked "Experimental" in the Linux kernel. And I am sure the Ubuntu developers tested it on Btrfs and it worked fine. But they didn't test it in a low disk space situation and that is where they are responsible for the minor disaster that ensued.
What I Learned
- "Nouveau" has a U after the "No", (seriously, I couldn't find it because I was misspelling it in the package search!)
- Btrfs file systems live much more comfortably in a single large partition that takes up your whole disk. I had split my disk into a "root" and "home" partition (as always) but formatted both as Btrfs. Don't do that.
- Reformatting a partition changes it's UUID, and this can confuse Grub and make your system unbootable.
- The
/tmp
directory is no longer a RAM disk "tmpfs" for various reasons which I don't understand. However it still needs to be world writable with the sticky-bit set. Failure to do so will cause all kinds of problems for most applications because that is the directory used by system calls to create temporary files, which means all applications are built on the assumption/tmp
is world readable directory with the sticky bit set.
Too Long; Didn't Read
In short, Ubuntu's do-release-update
program detects a Btrfs file system and
intelligently installs the system updates into a separate subvolume so you can easily roll-back if
the system update fails. However, since my root partition was too small, the new subvolume
containing the updated system filled up the entire partition which broke the update process.
I fixed the problem by reformatting my root partition to an Ext4 file system (which changed the
UUID of that file system), then I restored the previous system from a tar archive backup of the root
and boot file systems. But I had to update the /boot/grub/grub.cfg
and the
/boot/initrd.img
because these files (restored from the backup) contain references to
the old UUID of the root file system. The initrd.img
contains a RAM-disk file system
which mounts the root and home file systems, and the /etc/fstab
file in this
initrd.img
mounts Root and Home by referring to their UUID.
Finally, once the old operating system was bootable and running again, I ran
do-release-update
once more, and this time it worked as expected -- it did not create
any Btrfs subvolumes as it was just an ordinary Ext4 file system, it installed the update by
overwriting the previous system, which doesn't take up as much disk space and could therefore be
done within the limited space of my root partition without incident.
So what happened?
Well, first off, I backed-up my existing system:tar czvf
"$HOME/sys-backup.tgz" /boot /etc /lib /bin /sbin /usr /var /lib64 /srv /vmlinuz /vmlinuz.old
/initrd.img /initrd.img.old
and then I made sure I had a USB memory stick with the
Ubuntu Live install image installed onto it with USB Startup-Disk
Creator. This is just common sense: if anything goes wrong, you need a backup and live
operating system so you you can at least boot your computer to the point where you can copy the
backup.
Then I ran the command do-release-upgrade
. It downloaded the updated
packages and began installation -- and the failed about half-way through with the message "no space
left on device." Well I had this problem with my tiny old laptop, I remember running
do-release-upgrade
and watched the remaining disk space disappear, waiting with baited
breath as the update ran praying to the disk gods that I didn't run out of space which would cause
the update to fail.
When I got my new computer I doubled the size of my root partition so I would never have to worry about free space while upgrading again. Imagine how enraged I was when I saw the words "no space left on device." I checked the disk space, there was still 15GB remaining! What was going on?
So I am not defeated yet. I can just reboot into the Live USB system and figure out what is
wrong. Once the Live CD was up and running, I mounted Root and immediately noticed that there
weren't any files in the root, there was instead what looked like two directories, one called
@
which contained and another called
@apt-snapshot-release-upgrade-raring-2013-04-29
. Aha! Btrfs is the culprit
here.
Less than five minutes of Googling later, and I see what happened. Btrfs is designed to take up an entire disk with a single large partition. Instead of partitioning, you will create "subvolumes" which can very easily be frozen into snapshots and reverted without rebooting, easily backed-up transferred to other mediums, added to RAID volumes, and all kinds of handy things.
So what happened? Ubuntu's do-release-upgrade
saw I was using Btrfs and very wisely
created a separate Btrfs "subvolume" for installing the new system, the idea being that if something
went wrong, you could easily revert to your old system. Unfortunately, I learned this the hard way.
THE PROBLEM IS since my root file system was so small, creating a second subvolume just ate
up all the remaining space until it failed. I had 15 gigabytes remaining of a 30 gigabyte partition.
15 just wasn't enough space for a copy of the previous file system plus the new file system with all
the downloaded package files.
Ubuntu developers should have tested this do-release-upgrade
scheme on smaller
volumes. 20 or 30 gigabytes would be good. Anyway, the reasoning for creating a separate
subvolume was intelligent, even if it was prone to fail on smaller file system -- which is
unfortunately how I had very intentionally setup my system. Had I known at the time that Ubuntu
would detect a Btrfs file system and behave differently, and that making a separate subvolume was
Ubuntu's strategy was for easily undoing an install gone bad, I would have simply looked-up how to
revert the system. Instead what I did was much more fool hardy -- I went into the
@apt-snapshot-release-upgrade-raring-2013-04-29
directory and typed rm -Rf
*
. Imagine my surprise when I saw the rm
command fail with the error message "no
space left on device." How can it take space to remove something?
Well, it turns out, when a Btrfs volume runs out of space, it treats this situation as a
catastrophic failure and simply freezes up. When I say "freezes up", I don't mean it freezes the
computer, I mean it just freezes all the data -- it doesn't let you touch anything. It might
make more sense to set a "read-only" flag or something so rm
returns a "read-only
volume" error instead of a "no space left on device" error. But for whatever reason, Btrfs decides
to defend your data by returning a "no space left on device" error message for anything you try to
change, even removing files. Apart from reading files, you cannot touch it.
So I am like, "screw this," and umount /mnt/rootfs ; mkfs.ext4 /dev/sda2 ;
. *Poof* reformatted, no more Btrfs, now things will go back to making sense.
Next I revert the system using the tar backup I created before this little glitch. Very simple,
just go to the /mnt/rootfs
mount point and tar xzvf
/mnt/home/@home/ramin/sys-backup.tgz
(my Home volume was also Btrfs).
In short, I just did what Btrfs was supposed to do for me, revert from a backup, except I didn't
have to look-up any commands to do it, I did it entirely using commands I already new... except for
one little problem: now my root file system has a different UUID because I had reformatted it, but
the UUID's recorded in the /boot/grub/grub.cfg
and /boot/initrd.img
files
still refer to the old file system's UUID.
So I reboot, thinking everything might go back to normal, and if not, I will just reinstall
Grub. Well, it wasn't my day. So I reboot and as I partially expected but was really hoping wouldn't
happen, Grub starts complaining that it can't find the disk that has the Linux kernel. It was
supposed to be on a disk with UUID=0123abcd-ef45-6789-0abc-def012345678
, and there
aren't any partitions with that UUID. So I realize my mistake, I realize I should have fixed the
Grub config file after I reformatted my root file system. So I go back into the Live USB system, do
the chroot
thing and run the grub-install
command and...
Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and their use is discouraged.I forgot to use the "--force" command line option to ignore this error. But at this point I am frustrated and have forgotten everything I know about Grub because I only have to work with once a year or so when something really goes wrong and I am in a hurry to try and fix things and it never works right the first time and one year is enough time for me to forget everything I learned so I have to go back to the manuals and start reading it all over again to figure out how to fix things.
Fortunately, I did not use the "--force" option (sorry Yoda) and I instead started freaking out,
shouting at my computer. Had I not done that it would not have occurred to me that it wasn't
necessary to reinstall the boot loader, I just needed to rebuild the
/boot/grub/grub.cfg
file which is done simply by the chroot
trick,
and running the grub-mkconfig -o /boot/grub/grub.cfg
command.
So that solved that problem, and the Kernel now loads and begins booting, but it fails to boot
all the way and kicks me into a recovery shell. Why? The initrd.img
contains a file
system which contains an /etc/fstab
file which has not been updated to reflect the
modified UUID's, which means the initial RAM file system that mounts all the other file systems,
including Root and Home, cannot find Root because the UUID is wrong. So I need to rebuild the
initrd.img
with the update-initramfs -u
command.
But at this point I say to myself "screw it" again and go back into the Live USB system. I
reformat the root file system again and do a fresh install of Ubuntu 13.04 from the Live USB system
installer. I reboot and it is ready to go, except not. When I log in, the screen goes dark and
eventually gnome-session
crashes and kicks me back into unity-greeter
. So
I say "screw it" again, and install Xfce4, which I really love,
but not as much as Unity.
So I set about tweaking my Xfce4 desktop environment so it works just right, trying to figure out whether or not it is working with the Nouveau graphics driver, trying to get the audio to work when I play a YouTube video. Then something occurs to me. The reason I am using Ubuntu in the first place is because everything just works, or that was how it is supposed to be. Looking through all the little details of Xfce4, seeing how many things I have to install and tweak, trying to get the audio to work, trying to get the Nvidia drivers to work, this is what I did in grad school, this isn't what a busy professional should be doing, tinkering with the stuff in his computer that should "just work." I need to get a clean Ubuntu installation working with all the defaults and proprietary drivers installed all in one go without any tinkering or hassle, like it should.
And it had worked before! Ubuntu was working, the Nouveau drivers were running smoothly, the audio worked without me giving it a second thought. Why couldn't I get it to work this time. I shouldn't settle for just "whatever I can get working," no matter how much fun it is to tinker with Xfce4. I need to do this properly.
So I backup my system again (in case I should regret my next move) and erase the entire Root and
Boot file systems. Then I go back to the sys-backup.tgz
archive I made before this fiasco began and unarchived it, putting everything back the way it was.
I rebooted into a recovery shell, chroot
-ed and installed the new
/boot/grub/grub.cfg
For good measure, I also run the update-initramfs -u
to make sure the /etc/fstab
in the initial ram file system also uses the correct UUID's
for the reformatted the root file system.
And boom, the system is back and breathing, but laboriously. When I login from the beautiful
graphical unity-greeter
, the screen blinks, going black for just a moment, then I am
back in the unity-greeter
. But I can still login by CTRL-ALT-F1
switching
to a TTY terminal.
Now I have my old system back and apart from the gnome-session
everything seems to
be running OK. I check the /var/log/Xorg.0.log
and Nouveau is back and running
smoothly. Now I have an "ext4" root file system instead of "btrfs", which means the
do-release-upgrade
should work more predictably without creating any subvolumes that
eat up all of the space and cause catastrophic installation failures. So I run
do-release-upgrade
and.... finally something works right!
Now I have a functioning Ubuntu 13.04 "Raring Ringtail" installation, and it saw I was using Nouveau and installed the latest Nouveau driver for me, and it installed the proprietary drivers (MP3 and AAC), and it setup the Grub configuration file properly. Everything went smoothly this time.
Except for one thing
When I login from unity-greeter
it still blinks (the screen goes
black for just a moment) and then immediately goes back to the unity-greeter
. Something
is wrong with gnome-session
. Of course often the simplest problems can take the longest
time to figure out.
After a lot of Googling, and looking at the /var/log/Xorg.0.log
hoping I haven't
misread it and hoping Nouveau is actually still working correctly (it isn't just my imagination, how
nice the unity-greeter looks?), and writing a throw-away Bash script that figures out exactly which
log files are updated after a failed login, and running diff
on the logs files from
before and after a gnome-session
crash, I notice that the
/var/log/Xorg.0.log
is failing because of two things: Pulse Audio cannot create a
socket /tmp/tmp.cwJeZDKn2z
, and the X-Keyboard device was failing because of an error
(and I am paraphrasing) "xkbdcomp could not compile the key map possibly due to the a mistake in the
xkeyboard-config". But to where was xkbdcomp
compiling it's data file? Of course,
/tmp
. So there are two unrelated systems, both failing due to a similar problem, in
this case writing to a file system. That indicates a permissions problem.
In all the things I had tried in fixing my system I had erased the entire root file system,
including the /tmp
mount point, and /tmp
was not included in my sys-backup.tgz
archive. While I was in the recovery shell recovering
the old system, I had created an ordinary /tmp
directory with ordinary user
permissions, and this directory was never replaced during the do-release-upgrade
process. So when the system rebooted, the /tmp
directory was just a plain-old
directory with restricted permissions such that that could only be written-to by the root user.
So one final command:sudo chmod 1777 /tmp ; sudo reboot ;
Then I login, and
I am running Ubuntu 13.04 "Raring Ringtail" as if it had always been that way. You win, game
over!
What Ubuntu could do better
I wish Ubuntu developers would do just two things to prevent something like this from happening:
- If you detect someone using Btrfs, check how much space is available before creating a subvolume for the distribution upgrade. The upgrade will fail if there is not enough space, and worse yet, this will freeze up the Btrfs volume, which for Btrfs beginners could be very difficult to fix. The 1/3rds rule is a classic engineering heuristic which you would be wise to follow: if the amount of space used by the present operating system installation on the Btrfs partition is more than 1/3rd of all available space, don't create a subvolume for the installation, instead just treat it like you would an Ext file system.
- Run a permissions check on all of
the most important files, this includes all the directories in the root file system (especially
/tmp
), and all the program files in/etc /bin /sbin /usr/bin /usr/sbin
and possibly also the"$HOME/.ssh"
directories of every user, then repair the permissions of the files and directories that are not right. This is a simple way to prevent a lot of very seemingly-complicated problems.
In fact, I ought to submit that as a bug or feature request to the Ubuntu people directly.
As for Btrfs...
I am still using it for my /home
partition. I am always making backups of this file
system, and I think with Btrfs this could become much easier. I will write another blog post if I
ever figure out how to make this work for me. So Btrfs stays, and I intend to play with it more.
However there is no real need to use Btrfs for your root or boot file systems, it is
easier to just use the old "Ext4" file system format. This especially goes for a small laptop,
unless you intend to run several different operating systems and want to drop in one or the other at
a whim, or you intend to make a lot of changes to your system and you need an easy way to undo
mistakes without backing up everything by hand using tar
. If you must use Btrfs for
root, just make sure you allow Btrfs to take up the entire disk, not just a single partition.
No comments:
Post a Comment