Skip to : [Content] [Navigation]
 

Medical Electronics Manufacturing Magazine
MEM Article Index

Medical Electronics Manufacturing Fall 1998

Software

Zero Downtime: Setting the Standard for Reliability with Microkernel RTOS Technology

Real-time operating system architecture can contribute to and enhance the overall reliability, scalability, and availability of systems and devices.

Mal Raddalgoda

If you heard a claim that there was a medical system or device that rarely failed and required minimal downtime for system maintenance, would you believe it?

How you answer the question reveals a lot about your expectations and past experiences with software-based systems. Complex systems are expected to fail, resulting in downtime. In today's technology reality, the claim of failure-proof systems seems unbelievable. However, for some systems and devices, failure and downtime are simply not options.

Reliability, availability, and performance demands are placed on every operating system, but in some cases, meeting these demands is imperative because of the nature of the solutions they provide. An optimal operating system can ensure that these systems and devices will be failure proof.

Mission-critical computer systems and embedded systems use real-time operating systems (RTOS). An RTOS with a true microkernel architecture will, ideally, eliminate maintenance downtime, allow for failure recovery, and offer hardware-platform independence, software reusability, source-code portability, increased product quality, decreased testing, shorter time-to-market, and increased return on investment.

Why OS Architecture Matters

The RTOS forms the foundation of the medical device or system. Just as a house shouldn't be built on sand, a device shouldn't be built without understanding its foundation—the operating system.

There are some basic truths about all operating system architectures:

  • The kernel is the component of the operating system that provides all the essential services required by all operating parts of the system and its applications. Failures of the kernel are often fatal—if any software module running in kernel mode fails (the operating system kernel, any operating system component, or software driver), it can only be recovered by rebooting the system.

  • Smaller is better. As the kernel decreases in size and complexity, the reliability of the system can increase exponentially. The likelihood of any single faulty software module crashing a system is significantly decreased by reducing the number of software components that are running without memory protection in kernel mode.

  • Any component, such as a device driver, running without memory protection in kernel mode can cause the failure of the entire system.

  • The refinement and optimization of a reliable kernel takes time to perfect and to test in the field.

When the characteristics of various RTOS architectures are understood, the benefits of an advanced operating system architecture become clear. There are three basic RTOS architectures: real-time executive, monolithic kernel, and microkernel.

The major difference in the three types is how they use the memory-management unit (MMU), a component of the processor fundamental to operating system reliability. In a nutshell, the MMU manages memory paging and allocates address space to programs. Most of today's processors—MIPS, PowerPC, and a huge range of x86- and x86-compatible chips—include an onboard MMU; however, some older processor designs don't support it.

Real-Time Executive Architecture

With real-time executives, the executable environment is multithreaded with no memory protection. This means all processes and operating system components run unprotected in kernel mode—everything is essentially part of the kernel. As a result, a failure (like a corrupted pointer) in any one of these kernel components will cause the system to crash or, what can be worse, become unstable. The only way to recover from such a failure is to reboot.


Figure 1. Real-time executive architecture.

This architecture makes it impossible to add or change application code or drivers without taking down the system (see Figure 1).

Monolithic-Kernel Architecture

A monolithic-kernel architecture is superior to real-time executives because every application process running in user mode has MMU-based protection against other processes, ensuring that its memory space cannot be overwritten. This allows application code to be upgraded without taking down the system.

However, many operating system components and system drivers run in kernel mode without memory protection. This is a particular problem for embedded systems and specialized systems that require the development of custom software drivers. The debugging and testing tools available for kernel-level development environments often lack the robustness of application-level development tools. Consequently, it is much more difficult for testing and verification to eliminate software errors in drivers.


Figure 2. Monolithic kernel architecture.

To include custom software drivers, the kernel of the operating system must be rebuilt. As a result, a single software fault, such as an errant pointer or array subscript in a device driver or any other kernel component, can cause kernel faults and an entire system failure. The system must be reset to recover. The addition of custom drivers to a thoroughly verified and validated operating system completely negates the testing efforts of the operating system vendor. As the number of components running in kernel mode without memory protection increases, the likelihood of kernel faults increases significantly (see Figure 2).

Microkernel Architecture

A microkernel architecture implements only core services (interprocess communications, interrupt handling, and scheduling) in the kernel. All higher-level operating system services (file system, device I/O, networking, etc.) are provided by optional processes. If the microkernel is properly designed, these system services are virtually indistinguishable from user-written applications. This allows user-written extensions, such as new drivers, to be added to the RTOS without compromising the reliability of the kernel.

Some RTOS architectures offer MMU-based memory protection in their development environment to provide robust debugging capabilities. But at run time, most of them—even most microkernel designs—don't implement memory protection for both application processes and operating system components, mainly because doing so would cause a degradation in performance. This means that all processes run a flat memory architecture where a software fault, such as an errant pointer, can overwrite another application process or the kernel itself. In effect, this is no better than the architecture of a real-time executive.

A true microkernel architecture that implements full memory protection at run time is inherently the most robust RTOS architecture. Very little code runs in kernel mode that could cause kernel failure. This architecture also allows individual processes and operating system components to be started and stopped dynamically. As a result, these components can be updated or changed on the fly without having to bring down the system.


Figure 3. Microkernel architecture.

Properly designed, a microkernel RTOS can support full MMU-based memory protection for both application processes and operating system components without degrading performance (see Figure 3).

Advantages of Microkernels

To have a solid foundation for failure-proof products, a true microkernel RTOS architecture that implements MMU-based memory protection between all processes and all operating system components at run time is needed. From a reliability perspective, this ensures that the operating system can be extended to incorporate new user-written drivers and operating system components without compromising the integrity of the kernel; it also ensures that all application processes and operating system components run in their own MMU-protected address spaces.

"Hot Swap" Software in the Field Eliminates Maintenance Downtime. A true microkernel RTOS allows all components of the operating system and all application processes to be updated dynamically (or "hot swapped") in the field as long as the hardware has been designed to permit this. Systems no longer need to be removed from service to perform application upgrades or software maintenance. This means system maintenance no longer has to be performed during off-peak hours. This benefits systems managers because these systems are always available.

Software Watchdogs Can Prevent Disaster. With a flat memory architecture, the only way to correct a software fault at run time is to reset the system. With a memory-protected system, the operating system can catch the event and pass control to a user-written process called a software watchdog. The process is programmed to intelligently decide how best to recover from the failure.

For example, instead of forcing a full reset, the software watchdog could simply restart the failed process. Or the watchdog could terminate any related processes, then restart the entire set in a coordinated manner. As a result, devices that have a code error will not cause system failure or degradation—the system will intelligently recover.

Of course, in some systems and devices, a system shutdown will always be the safest, most logical action to take when there is any software fault—even if the system could, in theory, recover intelligently. In that case, software watchdogs can complement the function of a hardware watchdog timer by providing an additional layer of protection. Using software watchdogs, subtle errors or deviations from operating tolerances within the system can be detected, with the system then being shut down in the most appropriate manner—whether by an immediate hardware reset or by an orderly shutdown of all software processes. Better yet, the watchdog can also generate a "dump file" that allows the system developer to pinpoint the exact location of the software fault.

Graceful Recovery from Hardware Faults. With the ability to restart operating system components on the fly, a system can also recover from a variety of hardware faults without rebooting. For example, when a power spike causes the failure of an Ethernet networking card, the driver needs to be restarted to support the Ethernet address of the replacement card. With a "process-based" approach, the driver could be terminated, a new Ethernet card hot swapped, and the driver then restarted with the new card's parameters.

This same dynamic control over drivers can address something as catastrophic as a hard disk crash. If, for example, a local hard disk for the system fails, the file system process could be restarted using a capability of the operating system called transparent networking. All file operations are redirected to a standby disk on another, network-connected machine without losing valuable data. This fault-recovery capability can be provided for any hardware device using any standard driver—there are no modifications to drivers to support this functionality because the intelligence and functionality are in the RTOS, not the driver.

Software checks alone cannot deliver true hardware hot-swap capability. The hardware itself needs to integrate power-surge protection to ensure that no components will fail during a hot swap. Currently, there is a lot being done in the industry to establish standards and certification for hardware and software that support hot swap. PCMCIA and VME specifications have already been completed and CompactPCI standards are under development.

Hardware Maintenance

A card that uses another hardware architecture or chip set can be plugged in, then an appropriate driver can be started. The RTOS automatically detects the removal of a hardware card, halts the driver, then starts the appropriate driver for a new card when it is inserted. This RTOS feature allows device developers to introduce enhanced hardware diagnostic accessories for deployed products without affecting the customer's system.

Reliability and maintainability are key in the minds of customers. A true microkernel RTOS with full MMU-based memory protection for all processes and operating system subsystems offers system and device developers a level of reliability previously unseen.

Importance to the Vendor

In addition to significant reliability and maintainability enhancements to existing products, a true microkernel RTOS offers a system or device vendor numerous business advantages.

Software Reusability and Hardware Platform Independence. A true microkernel RTOS provides both software and hardware independence. This enables a system or device vendor to maintain a single source-code stream for a wide range of products. Any software application can then be integrated seamlessly and ported quickly and reliably across multiple hardware platforms at various price points and for custom applications (see Figure 4).


Figure 4. A single software stream and time-to-market.

Source-Code Portability. Given that most of the software in the medical industry has been created on Unix systems, it makes sense to adopt RTOS technology that provides a certified POSIX/UNIX API. This makes porting existing source streams a straightforward process, since most of the porting effort is achieved simply through recompiling.

While POSIX/UNIX environments have a reputation for being resource intensive, a microkernel architecture can be used to provide a POSIX/UNIX API without the architectural "weight" of a traditional UNIX kernel.

Increased Product Quality, Decreased Testing. With a monolithic architecture, the smallest changes to the source tree can require extensive retesting. The problem is that since all programs reside in one address space, each new product version—even if it involves a minor change to one module—requires a relink of the entire run-time image. This results in a different binary image, with different offset addresses. Consequently, a system component that was overwriting an unused data area or filler memory (used for structure or module alignment) in a previous release may now overwrite a more critical memory area. As a result, modules that were not modified can now introduce faults and errors. Hence the need for testing a large proportion of the software base.

A true microkernel RTOS contributes to increased product quality and reduced testing requirements. With full memory protection for all application modules, operating system components and hardware drivers, each software component runs in its own memory-protected address space. As a result, modules that have had no modifications are binary identical to the code in earlier versions. This enables the reuse of verified, validated, and field-proven code in derivative products, and permits developers to focus on adding new functionality without having to retest the entire software base. All this significantly reduces the verification and validation task in incremental testing of new functionality and software fixes.


Figure 5. Software portability and product quality.

Shorter Time-to-Market and Increased Return on Investment. In any system, the time and resources spent in verification, validation, and maintenance increase exponentially with the complexity of the system. By reducing the maintenance and verification burden to new and modified components, a microkernel RTOS can significantly contribute to decreased time-to-market (see Figure 5).

Currently, changing a simple hardware component or software module can cause months of delay in the delivery of a device. A true microkernel RTOS provides two significant business advantages: porting costs to new hardware are minimized, and only incremental testing is required for new software.

Furthermore, a single source stream for multiple products running on a common microkernel RTOS offers the optimal return on investment of R&D resources. By developing an application once and having the flexibility to reuse the proven code across multiple platforms, the business case for customer specials and small niche markets can be addressed.

Conclusion

With RTOS technology, architecture is the critical differentiator. More than any other characteristic, RTOS architecture can contribute to and enhance the overall reliability, scaleability, and availability of systems and devices. It is possible to build a truly reliable complex system or device. Moreover, it can be done with shorter time-to-market, increased flexibility, and significantly less R&D effort.

Mal Raddalgoda is senior technology analyst for QNX Software Systems, Ltd. (Kanata, Ontario, Canada). He can be reached at mal@qnx.com.


Copyright ©1998 Medical Electronics Manufacturing