Electronic systems are used for a wide range of applications from consumer to medical, industrial, power, aerospace, and even outer space. Regardless of application, it is necessary for the designer to consider the reliability of the system. Of course, the degree to which the reliability is considered will vary depending upon the application. For example:
- Space / Aerospace / Military – Ensure both the safety of the operators / passengers and the success of the mission.
- Telecommunications – Ensure no outage of service which impacts revenue flow and reputation.
- Industrial and process control – Ensure minimal down time, safety of operation, and fail-safe operation in the event of a failure.
- Commercial – Ensure warranty periods are achieved and minimise the reputational impact and cost which are associated with a recall.
At the highest level, we can split reliability into two distinct areas. The first one is confidence that the system will operate for the required life time. This is where the mean time between failure (MTBF), probability of success (P(s)) and the very well-known bathtub curve are of use. The second, often called functional safety, is ensuring the design will continue to function or fail safe should an erroneous event occur (see the TechSpotlight here on functional safety).
What is and how do we calculate the MTBF and P(s)?
The definition of Mean Time Between Failure (MTBF) is “the predicted elapsed time between failures of a system during operation.” The calculation of the MTBF is achieved by taking the reciprocal of the sum of the individual component failure rates. These failure rates are generally referred to as the FIT Rate, where one FIT = 1e-9 hours-1. Failure rates can normally be obtained from the component supplier, or with the use of a standards such as the Mil Handbook 217F (although this is very out of date, it is still a popular reference text on the subject). The relationship between MTBF and Failure Rate is:
Figure 1: MTBF and Product Reliability
You maybe forgiven for assuming the equipment will therefore work for the period specified by the MTBF. However, this is not the case. The MTBF is a statistical representation of the failure rate which can be expected during the useful life, and it does not represent the predicted operating life of the product. To obtain the predicted operating life of the product, we need to consider the probability of success as demonstrated by the equation below, where t is the desired operating time in hours:
Plotting the probability of success shows that, as desired operating time approaches the MTBF, the probability of success is approximately 0.37. This means that the probability a single module will still be working at an elapsed time equal to the MTBF is 0.37. If you are considering a batch of units, then 37% of them will still be functioning.
Figure 2: Probability of success against time
Solutions which need to ensure an acceptable probability of success for their mission life therefore require a significantly higher MTBF than the intended operating life. For example, a five-year operating life with a 0.99 probability of success would require a MTBF of 4,361,048 hours or 497 years, which is considerably longer than the operating life.
What impacts the reliability of a system?
There are many factors which impact the reliability of a system. One of the main factors is the operating environment. The operating environment includes the operating temperature range, vibration and shock envelopes, Electrical Compatibility and, for aerospace and space applications, the radiation environment.
Operating temperature has one of the largest impacts on reliability. Operating continuously at higher junction temperatures will impact the reliability of devices. Higher operating temperatures also increase the power dissipation and may also affect the system timing.
While you may think of vibration or shock as leading mainly to a physical failure on the electronic circuit board (such as a device connection breaking), vibration and shock can also lead to electrical failures. These may be transient, such as bond wires touching and leading to the lock up of a state machine.
It is unlikely that our system will operate alone and away from other electronic equipment. As such, we need our system to be able to operate with other electronic systems that may generate electrical noise. If not correctly addressed, this electrical noise could lead to corruption of signals or unexpected drifts and biases in sensitive analogue components.
Finally, if the application is designed to operate in space, then we need to ensure that both ionising and heavy ion radiation cannot result in the destruction of electronic devices or corruption of the data / application, leading to an unexpected operation or loss of control.
How can we mitigate the factors affecting reliability?
One of the simplest things we can do which has a large positive impact on the reliability is the selection of the correct component quality. Typically, components can be available in one of several different grades, for example: military, industrial, extended, automotive, or commercial. However, not all components are available in all grades. These different quality levels are screened to ensure not only wider operating temperature ranges, but also vibration and shock levels depending upon the grade. For example, automotive components are tested in accordance with AEC-Q100. When it comes to military grade components used in space / aerospace and defence there are additional classifications of components. The table below shows the different standards required for ICs, Hybrids, and Discrete components for military grades.
|QML Q Class B
|QML V Class S
Table 1: Different Standards Required for ICs, Hybrids and Discrete components for Military Grades
Be warned, however, that as the component quality increases so too does the cost, which also impacts the unit price of the final product. If components are not available in a specific grade, it may be possible for third party up-screening. However, that brings about the risk of failure and is typically valid only for devices from the same batch / lot.
Once we have completed the initial schematic design, we can also perform a part stress analysis (PSA) to ensure the thermal and electrical stresses placed on components are acceptable. For many applications, the acceptable stresses are be determined by a derating standard such as the European Space Agency’s ESCC-Q-30-11. Alternatively, if no derating rules are applied, the PSA should still show that all components are within their operational parameters as defined in the device data sheet.
Of course, if components are identified as not being sufficiently de-rated, then the design can be updated before correction gets too costly. When it comes to calculating the MTBF and Probability of Success, considering the component stresses in the calculation provides a much more accurate calculation than a simple parts count reliability.
What additional considerations do we need to take?
In addition to the design analysis outlined above, there are several design techniques we may wish to include in our design to achieve the reliability target. Of course, the most common approach is to introduce redundancy such that if one element fails, the redundant channel can take over. Introducing redundancy brings with it several considerations. The designers must perform rigorous analysis to ensure a single point of failure cannot affect both the primary and redundant channels, resulting in a total system failure. Common single points of failure include power supplies, oscillators, and IO structures as both the primary and redundant channels need to interact with the same sensors, interfaces, and actuators. A similar analysis must also be considered for the propagation of faults to ensure a fault in one module, either primary or redundant, cannot propagate and, again, result in a complete systems failure. One common failure propagation is thermal dissipation due to a failed component heating the system such that the redundant channel also fails.
What about inside the FPGA?
Many systems contain a processing element (either a heterogeneous SoC or FPGA) as the central processing element. These devices provide a wide range of design and implementation techniques which can be used to work towards achieving the reliability target. Let’s take the example of a Xilinx FPGA or Heterogeneous SoC. Within the Vivado tool chain, we can define the available power budget, signal activity, and cooling solution, enabling accurate calculations of device power dissipation and junction temperature. These calculations also include a confidence metric to indicate the level of confidence in the tool’s power and thermal prediction.
Figure 3: Xilinx FPGA or Heterogeneous SoC: Within the Vivado tool chain, we can define the available power budget, signal activity and cooling solution enabling accurate calculations of device power dissipation and junction temperature.
When it comes to implementing redundancy and other design techniques which increase design reliability, we can leverage the parallel nature of the programmable logic to implement several redundancy schemes such as Global, Large Grain, and Local Triple Modular Redundancy. However, there are other design techniques that we also wish to leverage, like the inclusion of built-in tests on memory and processor subsystems, the use of error-correcting codes on memory and, of course, leveraging the internal System Monitor ADC to monitor device junction temperatures and supply voltages, providing prognostics to enable early detection of failures.
Reliability is a mindset which must be considered from day one. Correct component grade selection and design analysis to ensure alignment with appropriate derating standards / data sheet maximums heavily influence a system’s reliability. FPGAs and Heterogeneous SoCs such as the Zynq and Zynq MPSoC provide the resources, implementation, and analysis tools required to achieve a sufficiently reliable implementation.