SatMagazine

Home >> March 2008 Edition >> HAL Would Be Proud

HAL Would Be Proud

by Dr. Len Losik

In the 1968 movie, 2001, A Space Odyssey, moviegoers were treated to a vision of what the future could as far as computer advancements were concerned for improving the safety of NASA’s astronauts. 2001 revealed a future wherein technological advances dominate the world.

The sound and visual special effects were spellbinding for the film standards at that time. Aside from the technical plusses, the movie stands strong as one that has something important to say about humankind, where the human race is heading, the increasing reliance on machines, and our unquenchable thirst for discovery.

Hal, the persona of the on-board computer, controlled the spaceship transporting astronauts to the moon. Hal talked to the astronauts in a calm and surreal voice that provided information and control of the spaceship. Hal was able to explain to the astronauts what was happening to the spaceship and why—this was the first popular concept as to computers predicting spaceship equipment behavior.

Fifty years after Hal, two Space Shuttle accidents killed 14 astronauts—science fiction has become science fact. We are now a lot closer to having a “Hal” protect our astronauts. Talking computers are available today as is the technology to protect astronauts. Proactive diagnostics, also known as prognostics, use a technology that is certainly of “Star Wars” class. Prognostics offer information for prognosticians to predict when, and what, electronic equipment is going to fail—a computer can then tell the astronauts what is happening.

The technology that will protect astronauts from harm also benefits everyone. When electronic equipment failures can be identified before such occurs, both high reliability aerospace equipment and consumer electronic products can benefit. Televisions, automobiles, in fact, all electrical equipment that is going to fail can quickly be identified while the products are still at the factory. This will eliminate the frustrating experience of anyone having to return a defective product.

Whys and Wherefores

Defects occur in satellites and launch vehicles. Third-party manufacturers and suppliers at other then the prime contractor’s facilities produce most of the equipment in satellites and launch vehicles. As infant mortality failures occur in satellites and launch vehicles, the factories that are shipping their equipment to satellite and launch vehicle builders are the one’s responsible for the defects. Satellite and rocket equipment are rigorously tested before they launched into space. However, as there continues to be infant mortality failures, equipment testing must be improved to prevent such from occurring.

One-in-four satellite owners file a claim with their insurance company within the first year of satellite operation for a substantial loss to their equipment. This totals billions of dollars a year. The failure rate is so high, satellite builders have to be ranked by insurance companies on the number of failures per year those companies suffer. There is no warranty for satellites and launch vehicles if they fail quickly as is common for consumer electronics. Purchasers can’t return them to the factory for a refund—there’s only the insurance—which is absorbed by the American taxpayer. In fact, American taxpayers are forking over $10B/year in failed, taxpayer-owned satellites and launch vehicles.

Prognostics developed when engineers, intimately familiar with the design and performance of their equipment, took an expanded look at data after an electrical or electro-mechanical system failed. Engineers originally only looked at the information available immediately surrounding a failure while conducting their analysis. When they expanded their view to include behavior from long before the failure, they found a surprising connection between precursors and subsequent malfunctions.

The Mortality Of Equipment

Infant mortalities occur when products and equipment fails after first use. They occur in all complex electrical systems and, when they do, we should be reminded there is , it should remind us that there is something wrong with the manufacturing processes that allow these to occur. To minimize such problems for consumer electronic products, companies offer a return policy or a warranty on a product to counter a defective unit. When a satellite or launch vehicle suffers an infant mortality failure, the builder takes no action unless the purchaser pays for the activity.

The organizations that should be interested in eliminating infant mortalities are the customers who purchase satellites and launch vehicles—NASA, Intelsat and the U.S. Air Force (the U.S taxpayer) as well as the private insurance companies insuring comsats for the first year of in-orbit operation. Private insurance companies offer a warranty to purchasers of satellites and launch vehicles, which NASA and the U.S.A.F. avail themselves of using appropriated funds. Certainly, they should be highly motivated to evaluate any technology that promises a cessation of taxpayer losses.

A Historical View

Today’s process for the design and test of high-tech electrical equipment was a result of Germany’s Adolf Hitler and Russia’s Joseph Stalin. In the 1930’s, the German government designed the roadway system known as the autobahn. American transportation specialists flew to Germany to learn about them and our freeway systems are based on Hitler’s autobahn.

Developed and used in missiles, rockets and satellites, our manufacturing process evolved out of the cold war. In 1953, U.S. intelligence services were caught unaware when Russia mounted nuclear bombs on their ICBMs and pointed them at the U.S. The U.S. missile program finally had its first enemy to focus their actions upon and this reinvigorated the U.S. government to support missile production. The Army Air Corp created the Western Development Laboratory in Los Angeles, CA (now Space & Missile Systems). All U.S. missile development and installations were consolidated into that organization.

Using antiquated processes for recording vehicle test data, which consisted of teams of data processors hand-logging and hand-plotting, the process developed an industry reputation as expensive, unreliable, complicated, and unnecessary. As instrumentation was meager and information was inadequate for identifying failure mechanisms with certainty, engineers created a list of the most likely failure mechanism and offered to redesign them—if the military was willing to pay for them.

Using this approach, the need for test instrumentation data to identify exact failure mechanisms was undermined. As a result, the necessary information that would have led to the ability to predict individual vehicle performance was not developed.

Needing missiles quickly, due to the Cold War, the Army Air Corp forced missile builders to use a radical manufacturing approach called concurrent manufacturing. This meant simultaneously designing, manufacturing, testing and fielding missile systems.

Using concurrent manufacturing process, American missiles reliability was at about 25 percent. To offset a huge failure rate, rather than demand from industry the ability to predict individual vehicle performance, the U.S. military simply purchased 4 times as many missiles as they had targets. This action, the only one that seemed appropriate at that time, discouraged any improvement in missile reliability by builders, as there would be far fewer missiles purchased if they became too reliable.

Missile builders said they were unable to predict the performance for any one, single, vehicle. With a 75 percent failure rate, vehicle unreliability validated their conclusion. As there was no need to improve processes or practices, because missile reliability was countered with larger contracts, missile builders used the least demanding processes for equipment manufacture and test. These same processes are used today in missile, rocket, and satellite factories.

In 1957, jet aircraft under test at the Dryden Air Force Base in California were flying so quickly they were crashing and killing test pilots before aircraft performance information could be recorded. Jet aircraft under test were fitted with instrumentation to transmit test data back to the aircraft factory during flight test to counter this problem. Recognizing the importance of test data during flight, missiles were back-fitted with telemetry systems providing important information to determine equipment performance as well as equipment failures.

After 1958, missiles were back-fitted with telemetry systems and then used to launch satellites into space. The same missile builders became rocket suppliers and satellite builders, applying the same minimal processes that were used to build unreliable missiles. As a consequence, rockets and satellites suffer from the same unreliability as missiles.

The Origins Of Telemetry Prognostication

In 1973, the Global Positioning System was created by the U.S.A.F., combining into a single program three satellite navigation-based projects. In 1974, North American Rockwell (now Boeing) won a contract to build 12 Global Positioning System satellites. The first was successfully launched in 1978 and was followed by many others.

In 1980, the U.S.A.F. officers in charge of the GPS program asked a very important question of the Boeing GPS satellite engineering team—could atomic clock failures be predicted for GPS satellites? At the time, the GPS Air Force program was competing against both the Navy NRL TIMATION and APL’s TRANSIT satellite-based navigation programs for funding.

The atomic clocks were the weak link in the all-new GPS satellite technology. If multi-services testing could be done while the atomic clocks were operating at their peak performance, results would be spectacular, ensuring program funding for GPS by the Department of Defense (DoD).

Nuclear scientists created quantum mechanics in the 1940’s to explain why electrons emit RF radiation and do not orbit an atom’s nucleus. Quantum mechanics explains the how and the why of GPS atomic clocks operation.

The author collected and processed satellite telemetry and Kalman filter data for more than six years, while leading a group of more than 80 engineers from Boeing, Lockheed Martin, General Dynamics, Aerospace Corporation engineers, and the Air Force personnel. This team analyzed the functional and performance behavior of every satellite Bus, payload and atomic clock and completed failure analysis and root-cause analysis. Accumulating more than 25 years of in-orbit research data, the results were published by Boeing engineers and then used by the Air Force to schedule future multi-service system testing and to improve satellite equipment and atomic clocks planned for use in future GPS satellites. The results of the atomic clock and satellite data engineering analysis developed failure analysis and root cause analysis, which were used to identify the presence of suspect precursors.

As there was no a priori (before examination or analysis) information available for satellite equipment measurements once they reach space, model-based approaches for identifying changes from normal behavior were not appropriate. Pattern recognition technology uses a priori information, as well. For some electronic equipment, factory specialists do develop a definitive understanding and predictive capability for their equipment, but this isn’t true for aerospace equipment, which is built in low quantities and have numerous design changes.

The Boeing engineering team, having little in the way of real-time data, but did have some stored historical information available, developed data-driven algorithms that could identify failure precursors. The GPS in-orbit satellites shared telemetry collection resources with U.S. spy satellites and couldn’t acquire the data as needed. Only 40 minutes/day/satellite of satellite Bus data, and almost 8 hours of Kalman filter data, was available for predicting failures. Because of this lack of resources, the Air Force paid for the design of their own dedicated control network, bypassing the spy satellites resources.

Interest in on-board satellite atomic clock performance behavior decreased with the award of the contract of the next 28 GPS satellites to Boeing. Boeing atomic clock engineers had accumulated enough information to increase the reliability to a value that satisfied the Air Force and their GPS groups. Predicting on-board satellite failures was regarded as not needing factory-engineering resources. Instead, those resources were applied to designing the next block of satellites. Boeing engineers dedicated to the performance analysis of GPS satellites continued to provide analysis of atomic clock and satellite behavior to the GPS Air Force officers.

Most of the design changes for the next block of 28 GPS satellites found increased reliability and performance, mostly due to failure analysis, root-cause analysis, and failure prediction technology. However, Boeing management stopped the effort for its satellite factory engineers when the GPS Air Force halted funding of the contract.

In 1984, the author took failure prediction technology to Space Systems/LORAL and used the process from 1984 through 1994 in the design, test, and launch of NASA GOES Next geostationary weather satellite, the INTELSA 7 & 7A geostationary communications satellite, and the SCC SUPERBIRD geostationary communications satellite.

In 1994 and 1995, the author used telemetry prognostic technology while at the U.C, Berkeley Space Sciences Laboratory on the NASA/U.C. Berkeley Hubble telescope sister satellite, the Extreme Ultraviolet Explorer low earth orbit satellite, and was able to predict many satellite failures. Failure prediction has now become popular in industries such as nuclear power, commercial and military aircraft and the high reliability telecommunications computer servers.

The Approach and Application

There are two approaches currently available for prognostics; model-based and data-driven. Model-based prognostics uses a priori information to determine what normal behavior is and then compares actual with normal behavior. If a change from normal behavior has occurred, then a failure is assumed. Experts decide what normal behavior is, and as equipment can operate from 10 to 15 years, the prediction has to be accurate for the long term. That can be extremely difficult and the results are suspect. Long-term behavior must factor in all environmental conditions, operating scenarios, equipment degradation behavior, and sensor aging characteristics. For simple prognostics, upper and lower limits can be effective.

NASA uses a commercially available, limit checking prognostic software program to track Space Shuttle propellant tank pressure and temperatures on the launch pad prior to lift-off. SUN Microsystems uses model-based prognostics for predicting server failures and added more than 1,000 sensors per system to use the technology.

A priori information is usually expensive to generate, requiring an expert’s long term time and resources. Model-based prognostics can be modeled using pattern recognition technology—deviations from normal can be easily identified. Model-based prognostics are appropriate for measurements that are not dynamic, such as circuit voltages and temperature controlled environments. This is a good solution for stationary equipment with little measurement variation.

Data-driven prognostics uses recent data to determine what normal behavior is and does not use a priori behavior. Data-driven prognostics possess the same performance as model-based prognostics but they don’t require any a priori information. Data-driven prognostics have the benefit of eliminating the need to factor in such elements as equipment degradation and sensor aging characteristics, which must be accurately applied for the model-based solution. Data-driven prognostics are insensitive to the amount of data available, the reliability of the data as noise reduction algorithm is used, insensitive to the equipment the data is from as well as the type of data available. Data-driven prognostics requires equipment be instrumented with some data from the equipment to be used for analysis.

Telemetry prognostics uses telemetry to provide the information used by prognostic algorithms to illustrate failure precursors, indicating the equipment will be failing. Telemetry is the generation of analog information, which is then digitized and sent to another location for analysis, then reconverted back to analog information and displayed for analysis.

Telemetry is used across many industries, including agriculture for remotely located equipment, medical devices, hospitals, rockets, satellites, aircraft, and computers.
Failure Analysis’ telemetry prognostics use the awareness of failure precursors in electrical piece-part performance behavior that’s present before equipment failures occur and uses that information to predict equipment failure. An algorithm is used to determine how long the equipment will operate and the day the failure will occur, once the precursor is identified. Piece-part reliability uses probability analysis to predict piece-part reliability.

Satellites and launch vehicles are made of electrical piece-parts to make systems of electrical and electro-mechanical equipment. Reliability specialists are used to quantify the reliability of electronic piece-parts. Their systems provide the information, allowing engineers to compare performance.

The Role Of Randomness

Today’s satellites and launch vehicles use hundreds of thousands of piece parts. These parts come from the same factories that produce electrical parts for other products, as well. In an attempt at predicting the reliability of electrical parts used in satellite and launch vehicles, random analysis and probability theory are used to quantify piece-part reliability for quantities of electrical parts in the hundreds of thousands. Believing that only random failures occur for piece-parts, as long as there are no design flaws or manufacturing process problems, piece-parts duration after power is applied is understandable by using a normal Gaussian distribution.

Not having any better measurements to quantify the reliability of piece-parts, probability theory has provided the only method for information for decision-making. For prognostics, with large quantities of piece-parts (in the hundreds of thousands), this means the duration that a piece-part performance will remain within the bounds needed to meet design life is random.

Research conducted by the author revealed the duration of remaining usable life (RUL) for an electrical piece-part, once the piece-part’s performance characteristics have begun to change to a value such that the change in behavior becomes visible in data, is also random. The probability of failure is also a random function, where the duration of remaining life for that piece-part has begun to change performance.

We know that all piece-parts will fail if under long term electrical stress. Some will fail sooner than others, and some will operate longer than others, according to a normal Gaussian distribution curve. Effecting the duration of remaining usable life of each piece-part are random variables, such as environment, operational use and operational conditions. We know that the higher the operating temperature, the quicker the piece-part and the system fails. This is not a random influence and is quantifiable and predictable.

The distribution of independent random errors (which exhibit themselves as piece-part failures and identifiable piece-part failure) behavior takes on a normal distribution as the number of piece-parts becomes large in number (again, for space equipment, piece-parts are in the hundreds of thousands).

Satellites can operate in space for many years. When something occurs unexpectedly, only limited information is available from the telemetry system to diagnose what caused the unexpected behavior and why such occurred. When space equipment is on the ground and a problem occurs, additional diagnostic data is directly available to complete root cause analysis.

There is little process improvement done at vehicle factories as the builders seldom receive feedback regarding problems their satellites have suffered in space. The builders assume they have done the best job possible, regardless of how well or how poorly a satellite behaves. Prognostics add a new level of performance to design and manufacturing, replacing still-used, antiquated processes.

Although we cannot predict when an individual piece-part will fail without initially testing the piece-part, which is cost and schedule prohibitive, we can predict with 100 percent certainty the part will fail someday. Recent simulation indicates that an eight-month burn-in period is necessary to eliminate 80 percent of the infant mortality large quantity piece-part failures prior to their integration into equipment.

We can also now also predict the probability of the duration of remaining life, once piece-part performance accelerates to such a value that behavior change can be identified in test data. We also know piece-parts will fail sooner when its behavior changes from the performance specified at purchase. The change in performance of a piece-part is an indication that molecular changes are occurring faster than expected and will continue to occur until the piece-part fails catastrophically.

Once piece-part behavior changes electrically in an active circuit, the duration of remaining usable life is predictable. Using information that identifies failure precursors in test data, electrical and electro-mechanical equipment, the remaining usable life can be determined.

Prognostication Program

In 2006, Failure Analysis began offering telemetry prognostics technology to missile builders, missile equipment suppliers, space equipment suppliers, satellite, and launch vehicle builders as well as to NASA, the U.S. Air Force and commercial spacecraft owners and operators. Using Failure Analysis’ telemetry prognostic technology, first year in-orbit satellite failure rate could be reduced from one-in-four to as low as one-in-20. When satellite and launch vehicle telemetry systems are upgraded to use prognostics, first year in-orbit satellite failures could be as low as one-in-40.

The author would like to thank the Aerospace Corporation and the U.S. Air Force for designing and developing the GPS technology, which has resulted in rewarding work for hundreds of companies, products and services that effect the U.S. military and American citizens in almost all aspects of daily life. For further information regarding Failure Analysis, the author, and the technology, please visit the company website at: www.failureanalysisco.com

Dr. Len Losik is founder of Failure Analysis, a space systems services provider and world leader in Telemetry Prognostic technology. Dr. Losik has instrumented some of the world’s most complex satellites and launch vehicles and has used Telemetry Prognostics to predict flight equipment failures on NASA, Air Force and commercial communications satellites and launch vehicles. He has published two books on Telemetry Prognostics, available at books stores and at Amazon.com and EBAY. Dr. Losik will be presenting a paper about telemetry prognostics titled, “Telemetry Prognostics, Upgrading Space Flight Equipment Design, Manufacture, Test, Integration, Launch and on Orbit Operations” at the Space 2008 conference, held in San Diego, CA in September 2008.

Dr. Losik has earned a Ph.D. in Electrical Engineering, an M.A. in Electrical Engineering, an M.S. in Education, M.A. in Education, B.S. in Physics and Mathematics. Dr. Losik has worked for most of the major aerospace companies in the US with over 32 years in satellite and launch vehicle design and test experience.