Tuesday, April 30, 2013

Installing Ubuntu 13.04 "Raring Ringtail"

So, I decided to update my system to the next version of Ubuntu 13.04 "Raring Ringtail", this time within less than a week after it was released. Usually I wait a few weeks for the bugs to be worked out, there are always a few bugs that they let through because they are on a strict release schedule and not enough beta testing resources available to them (nobody has enough, really).

When I got my new computer this year I installed Ubuntu 12.10 "Quantal Quetzal" and I had zero problems, not even with my Nvidia graphics card. Lots of people had problems when upgrading, the apt-get had a bug of some kind that either installed the wrong driver or disabled a working driver, or some such thing. I did a clean install (since I had a brand-new blank hard disk) and I simply selected the Nouveau driver and it just worked as though by magic.

I have been updating Ubuntu since about version 10.04 "Lucid Lynx" and it has never given me a problem. Not this time, this time it was a bit of a fiasco.

I do accept partial responsibility, after all I was using B-Tree File System "btrfs", a new file system that is clearly marked "Experimental" in the Linux kernel. And I am sure the Ubuntu developers tested it on Btrfs and it worked fine. But they didn't test it in a low disk space situation and that is where they are responsible for the minor disaster that ensued.

What I Learned

  1. "Nouveau" has a U after the "No", (seriously, I couldn't find it because I was misspelling it in the package search!)
  2. Btrfs file systems live much more comfortably in a single large partition that takes up your whole disk. I had split my disk into a "root" and "home" partition (as always) but formatted both as Btrfs. Don't do that.
  3. Reformatting a partition changes it's UUID, and this can confuse Grub and make your system unbootable.
  4. The /tmp directory is no longer a RAM disk "tmpfs" for various reasons which I don't understand. However it still needs to be world writable with the sticky-bit set. Failure to do so will cause all kinds of problems for most applications because that is the directory used by system calls to create temporary files, which means all applications are built on the assumption /tmp is world readable directory with the sticky bit set.

Too Long; Didn't Read

In short, Ubuntu's do-release-update program detects a Btrfs file system and intelligently installs the system updates into a separate subvolume so you can easily roll-back if the system update fails. However, since my root partition was too small, the new subvolume containing the updated system filled up the entire partition which broke the update process.

I fixed the problem by reformatting my root partition to an Ext4 file system (which changed the UUID of that file system), then I restored the previous system from a tar archive backup of the root and boot file systems. But I had to update the /boot/grub/grub.cfg and the /boot/initrd.img because these files (restored from the backup) contain references to the old UUID of the root file system. The initrd.img contains a RAM-disk file system which mounts the root and home file systems, and the /etc/fstab file in this initrd.img mounts Root and Home by referring to their UUID.

Finally, once the old operating system was bootable and running again, I ran do-release-update once more, and this time it worked as expected -- it did not create any Btrfs subvolumes as it was just an ordinary Ext4 file system, it installed the update by overwriting the previous system, which doesn't take up as much disk space and could therefore be done within the limited space of my root partition without incident.

So what happened?

Well, first off, I backed-up my existing system:
tar czvf "$HOME/sys-backup.tgz" /boot /etc /lib /bin /sbin /usr /var /lib64 /srv /vmlinuz /vmlinuz.old /initrd.img /initrd.img.old
and then I made sure I had a USB memory stick with the Ubuntu Live install image installed onto it with USB Startup-Disk Creator. This is just common sense: if anything goes wrong, you need a backup and live operating system so you you can at least boot your computer to the point where you can copy the backup.

Then I ran the command do-release-upgrade. It downloaded the updated packages and began installation -- and the failed about half-way through with the message "no space left on device." Well I had this problem with my tiny old laptop, I remember running do-release-upgrade and watched the remaining disk space disappear, waiting with baited breath as the update ran praying to the disk gods that I didn't run out of space which would cause the update to fail.

When I got my new computer I doubled the size of my root partition so I would never have to worry about free space while upgrading again. Imagine how enraged I was when I saw the words "no space left on device." I checked the disk space, there was still 15GB remaining! What was going on?

So I am not defeated yet. I can just reboot into the Live USB system and figure out what is wrong. Once the Live CD was up and running, I mounted Root and immediately noticed that there weren't any files in the root, there was instead what looked like two directories, one called @ which contained and another called @apt-snapshot-release-upgrade-raring-2013-04-29. Aha! Btrfs is the culprit here.

Less than five minutes of Googling later, and I see what happened. Btrfs is designed to take up an entire disk with a single large partition. Instead of partitioning, you will create "subvolumes" which can very easily be frozen into snapshots and reverted without rebooting, easily backed-up transferred to other mediums, added to RAID volumes, and all kinds of handy things.

So what happened? Ubuntu's do-release-upgrade saw I was using Btrfs and very wisely created a separate Btrfs "subvolume" for installing the new system, the idea being that if something went wrong, you could easily revert to your old system. Unfortunately, I learned this the hard way. THE PROBLEM IS since my root file system was so small, creating a second subvolume just ate up all the remaining space until it failed. I had 15 gigabytes remaining of a 30 gigabyte partition. 15 just wasn't enough space for a copy of the previous file system plus the new file system with all the downloaded package files.

Ubuntu developers should have tested this do-release-upgrade scheme on smaller volumes. 20 or 30 gigabytes would be good. Anyway, the reasoning for creating a separate subvolume was intelligent, even if it was prone to fail on smaller file system -- which is unfortunately how I had very intentionally setup my system. Had I known at the time that Ubuntu would detect a Btrfs file system and behave differently, and that making a separate subvolume was Ubuntu's strategy was for easily undoing an install gone bad, I would have simply looked-up how to revert the system. Instead what I did was much more fool hardy -- I went into the @apt-snapshot-release-upgrade-raring-2013-04-29 directory and typed rm -Rf *. Imagine my surprise when I saw the rm command fail with the error message "no space left on device." How can it take space to remove something?

Well, it turns out, when a Btrfs volume runs out of space, it treats this situation as a catastrophic failure and simply freezes up. When I say "freezes up", I don't mean it freezes the computer, I mean it just freezes all the data -- it doesn't let you touch anything. It might make more sense to set a "read-only" flag or something so rm returns a "read-only volume" error instead of a "no space left on device" error. But for whatever reason, Btrfs decides to defend your data by returning a "no space left on device" error message for anything you try to change, even removing files. Apart from reading files, you cannot touch it.

So I am like, "screw this," and umount /mnt/rootfs ; mkfs.ext4 /dev/sda2 ; . *Poof* reformatted, no more Btrfs, now things will go back to making sense.

Next I revert the system using the tar backup I created before this little glitch. Very simple, just go to the /mnt/rootfs mount point and tar xzvf /mnt/home/@home/ramin/sys-backup.tgz (my Home volume was also Btrfs).

In short, I just did what Btrfs was supposed to do for me, revert from a backup, except I didn't have to look-up any commands to do it, I did it entirely using commands I already new... except for one little problem: now my root file system has a different UUID because I had reformatted it, but the UUID's recorded in the /boot/grub/grub.cfg and /boot/initrd.img files still refer to the old file system's UUID.

So I reboot, thinking everything might go back to normal, and if not, I will just reinstall Grub. Well, it wasn't my day. So I reboot and as I partially expected but was really hoping wouldn't happen, Grub starts complaining that it can't find the disk that has the Linux kernel. It was supposed to be on a disk with UUID=0123abcd-ef45-6789-0abc-def012345678, and there aren't any partitions with that UUID. So I realize my mistake, I realize I should have fixed the Grub config file after I reformatted my root file system. So I go back into the Live USB system, do the chroot thing and run the grub-install command and...

  Embedding is not possible. GRUB can only be installed in this setup by using blocklists.
  However, blocklists are UNRELIABLE and their use is discouraged.
I forgot to use the "--force" command line option to ignore this error. But at this point I am frustrated and have forgotten everything I know about Grub because I only have to work with once a year or so when something really goes wrong and I am in a hurry to try and fix things and it never works right the first time and one year is enough time for me to forget everything I learned so I have to go back to the manuals and start reading it all over again to figure out how to fix things.

Fortunately, I did not use the "--force" option (sorry Yoda) and I instead started freaking out, shouting at my computer. Had I not done that it would not have occurred to me that it wasn't necessary to reinstall the boot loader, I just needed to rebuild the /boot/grub/grub.cfg file which is done simply by the chroot trick, and running the grub-mkconfig -o /boot/grub/grub.cfg command.

So that solved that problem, and the Kernel now loads and begins booting, but it fails to boot all the way and kicks me into a recovery shell. Why? The initrd.img contains a file system which contains an /etc/fstab file which has not been updated to reflect the modified UUID's, which means the initial RAM file system that mounts all the other file systems, including Root and Home, cannot find Root because the UUID is wrong. So I need to rebuild the initrd.img with the update-initramfs -u command.

But at this point I say to myself "screw it" again and go back into the Live USB system. I reformat the root file system again and do a fresh install of Ubuntu 13.04 from the Live USB system installer. I reboot and it is ready to go, except not. When I log in, the screen goes dark and eventually gnome-session crashes and kicks me back into unity-greeter. So I say "screw it" again, and install Xfce4, which I really love, but not as much as Unity.

So I set about tweaking my Xfce4 desktop environment so it works just right, trying to figure out whether or not it is working with the Nouveau graphics driver, trying to get the audio to work when I play a YouTube video. Then something occurs to me. The reason I am using Ubuntu in the first place is because everything just works, or that was how it is supposed to be. Looking through all the little details of Xfce4, seeing how many things I have to install and tweak, trying to get the audio to work, trying to get the Nvidia drivers to work, this is what I did in grad school, this isn't what a busy professional should be doing, tinkering with the stuff in his computer that should "just work." I need to get a clean Ubuntu installation working with all the defaults and proprietary drivers installed all in one go without any tinkering or hassle, like it should.

And it had worked before! Ubuntu was working, the Nouveau drivers were running smoothly, the audio worked without me giving it a second thought. Why couldn't I get it to work this time. I shouldn't settle for just "whatever I can get working," no matter how much fun it is to tinker with Xfce4. I need to do this properly.

So I backup my system again (in case I should regret my next move) and erase the entire Root and Boot file systems. Then I go back to the sys-backup.tgz archive I made before this fiasco began and unarchived it, putting everything back the way it was. I rebooted into a recovery shell, chroot-ed and installed the new /boot/grub/grub.cfg For good measure, I also run the update-initramfs -u to make sure the /etc/fstab in the initial ram file system also uses the correct UUID's for the reformatted the root file system.

And boom, the system is back and breathing, but laboriously. When I login from the beautiful graphical unity-greeter, the screen blinks, going black for just a moment, then I am back in the unity-greeter. But I can still login by CTRL-ALT-F1 switching to a TTY terminal.

Now I have my old system back and apart from the gnome-session everything seems to be running OK. I check the /var/log/Xorg.0.log and Nouveau is back and running smoothly. Now I have an "ext4" root file system instead of "btrfs", which means the do-release-upgrade should work more predictably without creating any subvolumes that eat up all of the space and cause catastrophic installation failures. So I run do-release-upgrade and.... finally something works right!

Now I have a functioning Ubuntu 13.04 "Raring Ringtail" installation, and it saw I was using Nouveau and installed the latest Nouveau driver for me, and it installed the proprietary drivers (MP3 and AAC), and it setup the Grub configuration file properly. Everything went smoothly this time.

Except for one thing

When I login from unity-greeter it still blinks (the screen goes black for just a moment) and then immediately goes back to the unity-greeter. Something is wrong with gnome-session. Of course often the simplest problems can take the longest time to figure out.

After a lot of Googling, and looking at the /var/log/Xorg.0.log hoping I haven't misread it and hoping Nouveau is actually still working correctly (it isn't just my imagination, how nice the unity-greeter looks?), and writing a throw-away Bash script that figures out exactly which log files are updated after a failed login, and running diff on the logs files from before and after a gnome-session crash, I notice that the /var/log/Xorg.0.log is failing because of two things: Pulse Audio cannot create a socket /tmp/tmp.cwJeZDKn2z, and the X-Keyboard device was failing because of an error (and I am paraphrasing) "xkbdcomp could not compile the key map possibly due to the a mistake in the xkeyboard-config". But to where was xkbdcomp compiling it's data file? Of course, /tmp. So there are two unrelated systems, both failing due to a similar problem, in this case writing to a file system. That indicates a permissions problem.

In all the things I had tried in fixing my system I had erased the entire root file system, including the /tmp mount point, and /tmp was not included in my sys-backup.tgz archive. While I was in the recovery shell recovering the old system, I had created an ordinary /tmp directory with ordinary user permissions, and this directory was never replaced during the do-release-upgrade process. So when the system rebooted, the /tmp directory was just a plain-old directory with restricted permissions such that that could only be written-to by the root user.

So one final command:
sudo chmod 1777 /tmp ; sudo reboot ;
Then I login, and I am running Ubuntu 13.04 "Raring Ringtail" as if it had always been that way. You win, game over!

What Ubuntu could do better

I wish Ubuntu developers would do just two things to prevent something like this from happening:

  1. If you detect someone using Btrfs, check how much space is available before creating a subvolume for the distribution upgrade. The upgrade will fail if there is not enough space, and worse yet, this will freeze up the Btrfs volume, which for Btrfs beginners could be very difficult to fix. The 1/3rds rule is a classic engineering heuristic which you would be wise to follow: if the amount of space used by the present operating system installation on the Btrfs partition is more than 1/3rd of all available space, don't create a subvolume for the installation, instead just treat it like you would an Ext file system.
  2. Run a permissions check on all of the most important files, this includes all the directories in the root file system (especially /tmp), and all the program files in /etc /bin /sbin /usr/bin /usr/sbin and possibly also the "$HOME/.ssh" directories of every user, then repair the permissions of the files and directories that are not right. This is a simple way to prevent a lot of very seemingly-complicated problems.

In fact, I ought to submit that as a bug or feature request to the Ubuntu people directly.

As for Btrfs...

I am still using it for my /home partition. I am always making backups of this file system, and I think with Btrfs this could become much easier. I will write another blog post if I ever figure out how to make this work for me. So Btrfs stays, and I intend to play with it more.

However there is no real need to use Btrfs for your root or boot file systems, it is easier to just use the old "Ext4" file system format. This especially goes for a small laptop, unless you intend to run several different operating systems and want to drop in one or the other at a whim, or you intend to make a lot of changes to your system and you need an easy way to undo mistakes without backing up everything by hand using tar. If you must use Btrfs for root, just make sure you allow Btrfs to take up the entire disk, not just a single partition.

No comments: