2.2. System Recovery

When a system crashes and fails to restart it is necessary to alter the normal booting process. This section describes a few solutions corresponding to problems that can occur at the following stages of the booting process.

Booting StageType of errorSuggested Solution
INITcorrupt root filesystem or a faulty /etc/fstab entryuse root login prompt
a kernel module fails to load or an RC script failsoverride INIT (see below Section 2.2.1, “Overriding the INIT stage”) or use alternative runlevel
KERNELKernel panicSee below Section 2.2.2, “Errors at the end of the kernel stage”
Hardware initialisation errors (often with older kernels on latest mother boards)Pass appropriate bootloader parameter - eg acpi=off. see Section 2.2.4, “Bootloader Kernel Parameters”
BOOTLOADERNot installed or brokenuse a rescue disk or a boot disk (see below Section 2.2.3, “Misconfigured Bootloaders”)

2.2.1. Overriding the INIT stage

This is necessary if the boot process fails due to a faulty init script. Once the kernel successfully locates the root file system it will attempt to run /sbin/init. But the kernel can be instructed to run a shell instead which will allow us to have access to the system before the services are started.

At the LILO or GRUB boot prompt add the following kernel parameter:

init=/bin/bash

At the end of the kernel boot stage you should get a bash prompt. Read-write access to the root filesystem is achieved with the following

mount /proc
mount -o remount,rw /

2.2.2. Errors at the end of the kernel stage

  • If the kernel can't mount the root filesystem it will print the following message:

    Kernel panic: VFS: Unable to mount root fs on 03:05
    

    The number 03 is the major number for the first IDE controller, and 05 is the 5th partition on the disk. The problem is that the kernel is missing the proper modules to access the disk.

    We need to boot the system using an alternative method. The fix next involves creating a custom initrd and using it for the normal boot process.

    Question: In the case above since the drive isn't a SCSI drive what could have caused the problem?

  • If the wrong root device is passed to the kernel by the boot loader (LILO or GRUB) then the INIT stage cannot start since /sbin/init will be missing

    Kernel Panic: No init found. Try passing init= option to kernel
    

    Again we need to boot the system using a different method, then edit the bootloader's configuration file (telling the kernel to use another device as the root filesystem), and reboot.

In both scenarios above it isn't always necessary to use a rescue disk. In fact, it often is a case of booting with a properly configured kernel. But what happens if the we don't have the option? What if the bootloader was reconfigured with the wrong kernels using no initial root disks or trying to mount the wrong root filesystem?

This leads us to the next possible cause of booting problems.

2.2.3. Misconfigured Bootloaders

At this stage we need to use a rescue method to boot the system.

2.2.3.1. Using a rescue disk

We already know from 101 that any Linux distribution CD can be used to start a system in rescue mode. The advantage of these CDs is that they work on any Linux system.

The rescue process can be broken down into the following steps:

  1. Boot from the CD and find the appropriate option (often called `rescue' or `boot an existing system')

  2. In most cases the root device for the existing system is automatically detected and mounted on a subdirectory of the initiatial root device (in RAM)

  3. If the mount point is called /system it can become the root of the filesystem for our current shell by typing:

    chroot /system
    
  4. At this stage the entire system is available and the bootloader can be fixed

When a bootloader is misconfigured one can use an alternative bootloader (on a floppy or a CD). This bootloader will load a kernel which is instructed to use the root device on the hard drive.

This method is called a boot disk and is used to recover a specific system.

2.2.3.2. Custom Boot Disk 1

All we need is a floppy with a Linux kernel image that can boot, and this image must be told were to find the root device on the hard drive.

Assuming that we are using a pre-formatted DOS floppy, the following creates a bootable floppy which will launch a linux kernel image

dd if=/boot/vmlinuz of=/dev/fd0

Next, rdev is used to tell the kernel where the root device is. The command must be run on the system we wish to protect and the floppy with the kernel must be in the drive

rdev /dev/fd0 /dev/hda2

2.2.3.3. Custom Boot Disk 2

The syslinux package installs a binary called syslinux that can be used to create bootable floppies. The procedure (taken form the packages documentation) is as follows:

  1. Make a DOS bootable disk. This can be done either by specifying the /s option when formatting the disk in DOS, or by running the DOS command SYS (this can be done under DOSEMU if DOSEMU has direct device access to the relevant drive):

    format a: /s
    

    or

    sys a:
    
  2. Boot Linux. Copy the DOS boot sector from the disk into a file:

    dd if=/dev/fd0 of=dos.bss bs=512 count=1
    
  3. Run syslinux on the disk:

    syslinux /dev/fd0
    
  4. Mount the disk and copy the DOS boot sector file to it. The file must have extension .bss:

    mount -t msdos /dev/fd0 /mnt
    cp dos.bss /mnt
    
  5. Copy the Linux kernel image(s), initrd(s), etc to the disk, and create/edit syslinux.cfg and help files if desired.

    For example if your root device is /dev/sda1 then syslinux.cfg would be:

    DEFAULT linux
    LABEL linux
    	KERNEL vmlinuz
    	APPEND initrd=initrd.img root=/dev/sda1
    

    then

    cp /boot/vmlinuz /mnt
    cp /boot/initrd.img /mnt
    
  6. Unmount the disk (if applicable.)

    umount /mnt
    
[Note]Note

Although syslinux can be installed on a CD it is recommended to use the isolinux bootloader instead (see Section 4.4.2, “Alternatives without disk emulation”).

2.2.4. Bootloader Kernel Parameters

load_ramdisk=nIf n is 1 then load a ramdisk, the default is 0
prompt_ramdisk=nIf n is 1 prompt to insert a floppy disk containing a ramdisk
nosmp or maxcpus=NDisable or limit the number of CPUs
apm=offDisable APM, sometime needed to boot from yet unsupported motherboards
init=Defaults to /sbin/init but may also be a shell or an alternative process
root=Set the root filesystem device (can be set with rdev[a]
mem=Assign available RAM size
vga=Change the console video mode (can be changed with rdev[a])

[a] The rdev manual pages say: "The rdev utility, when used other than to find a name for the current root device, is an ancient hack that works by patching a kernel image at a magic offset with magic numbers. It does not work on architectures other than i386. Its use is strongly discouraged. Use a boot loader like SysLinux or LILO instead"

2.2.5. Troubleshooting LILO

When installing LILO the bootloader mapper, /sbin/lilo, will backup the existing bootloader.

For example if you install LILO on a floppy, the original bootloader will be saved to /boot/boot.0200.

Similarly when changing the bootloader on an IDE or a SCSI disk the files will be called boot.0300 and boot.0800 respectively. The original bootloader can be restored with:

lilo -u

By default the second stage LILO is called /boot/boot.b and when it is successfully loaded it will prompt you with a boot: .

Here the possible errors during the boot stage (taken from the LILO README)

nothing

LILO is either not installed or the partition isn't active

L

The first stage loader has been loaded but the second stage has failed

LI

The second stage boot loader has loaded but was unable to execute.

This could be cause if /boot/boot.b moved and /sbin/lilo wasn't rerun

LIL

The second stage boot loader has been started, but it can't load the descriptor table from the map file or the second stage boot loader has been loaded at an incorrect address.

This could be cause if /boot/boot.b moved and /sbin/lilo wasn't rerun.

LIL-

The descriptor table is corrupt.

This could be cause if /boot/map moved and /sbin/lilo wasn't rerun.

Scrolling 010101 errors

This happens when the second stage boot loader is on a slave device