The Bathtub Curve, Nuclear Safety, and Run-to-Failure

Disaster by Design: Safety by Intent #7

Disaster by Design

The bathtub curve (Fig. 1) is a common way of showing the failure rate as a function of time. The observed failure rate (blue curve) reflects the overall failure rate. It is the sum of three individual failure rates: (1) the failure rate due to infant mortality (red dotted line), (2) the failure date due to random causes (green line), and (3) the failure rate caused by wear out (yellow dotted line).

For products, the infant mortality failures include material imperfections, assembly errors, and user mistakes with a new widget.

Products can fail due to factors unrelated to whether they are new or old. For example, products can be dropped and broken, destroyed by a fire started by something else in the area, or backed over by a car. Failures due to such causes are represented by the green line showing a constant failure rate.

Wear out failures are caused by rusting, embrittlement, hardening of soft parts, and softening of hard parts.

The failure rate is never zero at any point over the entire curve. Nevertheless, the three curves that form the overall failure rate curve can be depressed by some measures and inflated by others. Consider these three curves applied to a newly purchased laptop computer. The infant mortality failure curve can be lowered by buying one from an established company with a solid track record of high quality. Conversely, the infant mortality failure curve rises for knock-off brands and ones sold from the back of vans at truck stops.

The wear out failure curve can be lowered by using the computer in environment-controlled spaces and raised by taking the computer out onto the beach on a windy day. And even the random cause failure curve can be lowered by careful handling of the laptop and raised by using it to squash insects scurrying across a countertop.

Some recent events illustrate how the bathtub curve applies to nuclear power plant safety.

San Onofre

The tubes inside the steam generators at the two pressurized water reactors operating at the San Onofre nuclear plant in California were wearing out. Workers inspected the steam generator tubes during each refueling outage and determined that they were close to, if not already within, the wear out region of the bathtub curve. So, they spent nearly $700 million to replace the two steam generators on each operating reactor seeking to avoid the wear out region’s increasing failure rate. But the bathtub has two ends. The replacement steam generators began on the left-hand side of the bathtub curve where the failure rate is initially high. The tubes in the replacement steam generators experienced a high infant mortality failure rate. The owner opted to permanently shut down both reactors rather than replace or repair the replacements.

James A. FitzPatrick

Tube degradation was also a problem at the James A. FitzPatrick nuclear plant in New York. But this boiling water reactor does not have steam generators and therefore the plant was not experiencing steam generator tube wear. Instead, the tubes inside the main condenser were wearing out. The owner replaced the condenser tubes during a refueling outage in 1995. Unlike the replacement steam generator tubes at San Onofre, the replacement condenser tubes successfully navigated through the left-hand side of the bathtub curve. It’s a good news, bad news situation—the high failure rate on the left-hand side was avoided only to approach the high failure rate on the right-hand side of the bathtub curve. The replacement condenser tubes had a 15-year service life. The owner originally planned to replace the replacement condenser tubes during a refueling outage in 2012, but deferred that task until a refueling outage in fall 2014. (Yes, 2012 is already two years past the 15-year service life of the replacement tubes and 2014 pushed the tubes even farther past their expected lifetimes.) Worn-out condenser tubes began breaking left and right, and center, and top, and bottom. The operators had to reduce the reactor power level to 50% several times each week to allow workers to find and plug the broken tubes.

Run-to-Failure

The NRC’s Operating Experience Branch issued a report on component failures from 2007 to 2011. More than 75% of the failures examined in this study involved components used longer than their recommended service lifetimes. The NRC’s report referred to it as being “run-to-failure.” In other words, instead of monitoring the condition of safety equipment and replacing, repairing, or refurbishing the equipment before safety margins are compromised, owners are waiting until the equipment breaks. The failure rate curve gets replaced by a number—100% chance of failure because they are going to use equipment until it breaks. The NRC’s Office of the Inspector General (OIG) followed up the Operating Experience Branch’s study by auditing the NRC’s oversight of regulatory requirements governing component aging. The OIG’s audit report was critical of the agency’s efforts:

“The NRC’s approach for oversight of licensee’s management of active component aging is not focused or coordinated. This has occurred because NRC has not conducted a systematic evaluation of program needs for overseeing licensees’ aging management for active components since the establishment of the Reactor Oversight Process (ROP) in 2000, and does not have mechanisms for systematic and continual monitoring, collecting, and trending of age-related data for active components. Consequently, NRC cannot be fully assured that it is effectively overseeing licensees’ management of aging active components.”

Safety by Intent

The bathtub curve shows the failure rate over time. The chance of failure is never zero. Nor is failure guaranteed (unless, of course, one deliberately invites failure with a “run-to-failure” policy).

The quality assurance measures mandated in Appendix B to 10 CFR Part 50—a federal regulation, not a voluntary or optional objective—prescribe steps to manage the failure rate to a reasonable low level. But these federally mandated safety requirements only provide safety when they are followed.

“Run-to-failure” policies cannot apply to safety equipment at nuclear power plants. For when they succeed, failures occur. The U.S. nuclear industry has enough problems avoiding safety equipment failures without deliberately seeking even more.

Deliberately running safety components beyond service lifetimes until they fail is not only irresponsible, it is illegal. The NRC must not run away from its charter and obligation to protect public health and safety by failing to stop this lawbreaking nonsense.

UCS’s Disaster by Design/Safety by Intent series of blog posts is intended to help readers understand how a seemingly unrelated assortment of minor problems can coalesce to cause disaster and how addressing pre-existing problems can lead to a more effective defense-in-depth protection.