po polsku


Linux: A Network Solution for Your Office




ContentsIndex




Chapter 21: Diagnosing Your System: Case Studies



Previous
ChapterNext
Chapter








Sections in this Chapter:



 







Pre-Boot Diagnostics


Case Studies

 




 



Hardware Problems


 

Summary

 




 


Software
Problems

Manual
Pages
 
 
 


Boot
Options
 

 
 
 



 

Previous
SectionNext
Section



Case Studies



A Bad Hard Disk


The Case of the Dead CD-ROM




The Case of the Broken Tape Drive


A Broken BIOS




Swap File Trouble


 



The remainder of this chapter contains a few real-life cases that
I've encountered recently.

A Bad Hard Disk
Some of the scariest messages that ever appear in a Linux system's
log files are those that indicate read or write errors on your main system disk.
This is precisely what happened on my Linux machine that hosts the online game
MUD2.COM.
The messages in the log files indicated that a specific sector
on the hard drive had a defect. Whenever the system attempted to use this sector
as part of a file, an error occurred. If the file in question was a key system
file, the system crashed.
As a temporary solution, I attempted to mark the bad areas of
the disk using the .c option with e2fsck. However,
this proved to be insufficient; soon, additional defects developed. Because
the system was also used by several users, I feared that the problem may have
more serious consequences if it remained unfixed.
I had spare hard drives that I could use with this system, but
there was a catch--whereas the failed drive was an IDE unit, my spare drives
were all SCSI. Therefore, I not only needed to replace the disk, I also had
to update the kernel (the system in question used a custom kernel image with
no SCSI drivers).
First, I created a new kernel image, adding a driver for the SCSI
card I intended to use. With this new kernel image installed, I shut down the
system, inserted the new SCSI card and hard drive (without removing the old
IDE card and drive), and rebooted. Initially, boot messages indicated that the
new SCSI card was not found by the driver. A search on the Internet with
the correct keywords revealed a possible cause, namely that the driver in question
cannot always detect the proper hardware parameters. Armed with this information,
I booted the kernel with a boot parameter, and the SCSI card this time was successfully
initialized.
Second,
I formatted the new SCSI disk and placed a file system there. I also reserved
space for a swap area.
Third, I copied the entire contents of the old IDE drive to the
new SCSI drive. I also used the lilo command with
a special option to initialize the new drive:
lilo -r /mnt

Here, /mnt is the empty directory
I used as the mount point for the newly installed SCSI drive. The .r
option basically tells lilo to pretend that the
system's root directory is the specified directory and perform its operations
there. Therefore, instead of installing a bootable image on the real root drive
/ (which was still the old IDE drive), the command
installed an image on the freshly mounted SCSI drive at /mnt.
Now it was time to shut down the system, remove the IDE hardware,
and make an attempt to reboot the system from the SCSI drive. There were a few
glitches, but shortly, the system was up and running with the new drive and
kernel image.
After some tests to ensure that everything worked as expected,
I booted the system in multiuser mode, and it has been working reliably ever
since.

The Case of the Broken Tape Drive
As chance would have it, just as I was writing this chapter, one
Linux system under my care went dead. This is a system located elsewhere, one
that I manage remotely (over the Internet or using a modem). The first indication
of trouble came early one Sunday morning when, instead of receiving the usual
weekly email about a successful backup, I received a boot message instead. (This
system is configured to send me email every time it reboots.)
"What the devil?" I asked. Was it yet another unannounced
power test in the building that houses the system? Night owl that I am, I was
still awake when this happened, so I immediately connected to the system via
telnet and examined its log files. Looking into /var/log/messages,
I found messages describing not one, but two separate boot events, less than
a minute apart. Obviously, the first of these boots didn't successfully
complete, which is why I only received email after the second occurrence.
I was still cursing and calling the building management company
names when it occurred to me to log on to another machine located with the Linux
system. I found out that the other machine had been up for several days without
interruption. No power failure there, after all. So what?
Since the reboot event occurred at the time when the weekly backup
is usually taking place, the next obvious suspect was the system's tape
drive. I logged on again, listed the root user's crontab
entries, and with some copy-and-paste magic I extracted the weekly backup command.
I entered it at the command line and guess what? Within seconds, I lost my connection
to the system, and when I was able to reconnect, I found that it just rebooted
again.
A bad tape drive, I thought. Just recently, they began rotating
tapes at this office (previously, they used a single tape over and over again,
which, of course, isn't a very smart idea). Could it be that the sudden
mechanical stresses caused a short circuit or something in the tape unit? Probably.

Because this occurred on a Sunday, I couldn't do much to
further diagnose the problem. But the next day, I performed some diagnostics
over the telephone with the help of the onsite technical staff. I asked them
to reseat the tape to ensure that it was inserted properly; to our surprise,
the system rebooted when they touched the tape drive. Worse yet, it began a
boot cycle, going through the BIOS boot messages over and over again. Therefore,
I suggested that the system be turned off and the tape drive be disconnected
internally. After this was done, the system failed to even power up anymore.

So
it wasn't the tape drive after all. The next suspect was the power supply;
sure enough, after it was replaced, the machine powered up fine, even when the
tape drive was reattached. I once again issued the usual weekly backup command
by hand, and the backup successfully ran to completion.
As an added quirk, however, I noticed something unusual in the
most recent boot sequence: Messages pertaining to the parallel port and one
serial port were missing. So once again, I called the onsite staff, who powered
down the system and reseated the interface cards inside. That cured the problem,
and the system has been working reliably since. The "broken" tape
drive executes its weekly backups without a glitch.

Swap File Trouble
Yet another problem that occurred as I was writing this book affected
the swap file on a Linux system. Having been out of my office for a few hours,
I knew immediately that there was trouble when I returned, seeing that the disk
activity light was continuously lit on this machine, and there was a regular
clickety-click sound coming from it.
First, I suspected a hard drive failure. But when I turned on
the attached monitor and saw screen after screen scrolling by, containing messages
such as
swap_free: swap-space map bad (entry 00022a00)

I knew what the clickety-click sound was: The system kept adding
entries to the system logs on a continuous basis, and what I heard was the disk
activity as these new entries were written out.
I'd seen enough. Corrupt swap space means trouble, and the
sooner the system is rebooted, the better. Hitting Ctrl+Alt+Delete invoked /sbin/shutdown
normally (so the system wasn't quite dead yet), and I rebooted into single-user
mode. First, I made sure that all hard disks were mounted read-only, and then
I checked them for corruption:
$ mount -n /dev/sda1 -o remount,ro
$ /sbin/e2fsck /dev/sda1

The check didn't find anything significant (complaints about
a "zero dtime" on deleted blocks and minor bitmap differences are
not abnormal for a file system that's in use), so I knew the disks were
probably alright. An examination of the log files in /var/log
showed that the problem began a few hours earlier, with the following message
being the first sign of trouble:
Hmm.. Trying to use unallocated swap (00020700)

Because the system has been up for more than two months, experiencing
moderate-to-heavy use at times, I concluded that I was probably the victim of
a software bug; the system "lost its marbles," for lack of a more
scientific term. After rebooting into multiuser mode, the system continued to
operate normally.
Note that during these few hours, several megabytes of log entries
were produced, with many error messages appearing in the log files every second.

The Case of the Dead CD-ROM
On yet another Linux system, I was testing an old CD-ROM drive.
When the CD-ROM drive was connected, the system completely failed to boot; not
even the initial BIOS messages appeared. This situation demonstrates why it's
imperative that you remove all nonessential components when diagnosing your
system: A faulty component can cause unusual behavior in seemingly unrelated
parts of the system. Were it not for the fact that I just inserted that CD-ROM
drive myself, this system would have appeared, for all intents and purposes,
like one with a dead processor or motherboard.

A Broken BIOS
My last Linux war story concerns an older Pentium motherboard
that I was hoping to reuse in a Linux system. Unfortunately, I found that the
board was completely broken; when I installed it, it showed no signs of life.

Rather than giving up, I decided to investigate. I attempted to
boot this motherboard with an ISA video card, but it still failed; no image
ever appeared on the monitor. I used a diagnostic card, and it showed some activity;
the BIOS initialization stopped with a code that, according to the manual, referred
to the system's floppy drive. However, I was driving blind, because the
monitor remained dark.
Then it occurred to me to use another ISA video card, a really
old one with no acceleration features. Surprise! With this card, I suddenly
saw messages appear on the monitor that informed me of a BIOS checksum failure
and that the system was attempting to boot from a recovery floppy. Quickly,
I visited the Web site of the motherboard manufacturer, where I found out the
details for the emergency BIOS recovery procedure. I made a bootable floppy
disk according to the manufacturer's instructions and booted the old motherboard
with this floppy. I was able to launch the BIOS flash utility, which successfully
updated the motherboard's BIOS. The board has been functioning ever since
in one of my Linux boxes.




Linux: A Network Solution for Your Office




ContentsIndex




Chapter 21: Diagnosing Your System: Case Studies



Previous
ChapterNext
Chapter








Sections in this Chapter:

 







Pre-Boot Diagnostics


Case Studies

 




 



Hardware Problems


 

Summary

 




 


Software
Problems

Manual
Pages
 
 
 


Boot
Options
 

 
 
 



 

Previous
SectionNext
Section







© Copyright Macmillan USA. All rights
reserved.
  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • pajaa1981.pev.pl