MSL Sol-200 Anomaly

Archived from Mars Science Laboratory (MSL) Sol-200 Anomaly on 2020-12-18.

Abstract

Six months after landing on Mars, uncorrectable errors in the NAND flash memory led to an inability of the Mars Science Laboratory (MSL) prime computer to turn off for its normal recharge session. Ground controllers commanded a swap to the backup computer, leaving the MSL rover with single-string avionics of questionable reliability prior to a Mars solar conjunction. Recovery from the anomaly was possible because of system and mission design features; nine recommendations are provided to mitigate the risk to future missions.

Driving Event

MSL has two lithium ion batteries that are recharged several times per day. These batteries enable the Curiosity rover's power subsystem to meet the peak power demands of Rover activities when the demand temporarily exceeds the onboard multi-mission radioisotope thermoelectric generator (MMRTG) steady output level of ~100 watts. The flight computers (labeled "RCEs" in Figure 1) are always shut down prior to these recharge cycles.

Six months after landing on Mars (i.e., Mars sol-200), telemetry reported uncorrectable errors in the NAND flash memory (Reference (1)). Analysis revealed that several flight software (FSW) tasks had hung up, leading to an inability of the prime computer (Rover Compute Element (RCE) 'A') to turn off for its normal recharge session. Normally, fault protection would intervene: a watchdog timer would count down to zero and trigger a computer reboot. Instead, the watchdog timers were being reset. This could inexorably lead to a power brown-out of the Rover in three to six days: such loss of commandability that leaves the Rover discharging is a potentially mission-catastrophic event.

Within 16 hours of the initial error message, mission controllers at the NASA/Caltech Jet Propulsion Laboratory (JPL) bypassed the FSW and commanded a swap from RCE-A to the backup computer (RCE-B). Tones (i.e., "signals") were subsequently received from the Rover confirming that the "backup" string had become "prime" and had entered safe mode. Information that the new prime computer gathered from the failed computer indicated that errors in the FSW had exacerbated a hardware fault in the flash memory. The MSL Flight Team faced a situation where the Rover was effectively left with "single string" avionics only 35 days prior to the Mars solar conjunction (when the spacecraft would not be commandable for 25 sols). Also, since the same FSW and flash memory were present in the new prime computer (RCE-B), the remaining string was of questionable reliability.

Failure investigation indicated that a single chip in the flash memory array was generating errors during erase cycles, likely due to a connectivity problem on the circuit board, or due to infant mortality of the commercial part. (Pre-flight testing had only erased the NAND ~12 times, as compared to an additional 38 erases (Reference (2)) after launch. The NAND part should have a life of 100,000 cycles.) Spacecraft functionality was recovered by segregating the bad flash memory; a direct hardware reset then rebooted RCE-A to operate with a half-size flash file system. (Because the data storage volume was sized with substantial margin, the loss of half the memory does not impact the mission.) Also, an additional (maximum up-time) watchdog timer was added to the flight software to strengthen fault protection. However, JPL would have been unable to diagnose the problem were it not for an avionics architecture that allowed the non-prime computer to be powered and providing telemetry on its "health" even when the (RAM-based) FSW was not running on it.

References:

  1. "Double Bit Error was observed on NVMCAM NAND on Sol-200," JPL Incident Surprise Anomaly (ISA) No. 54013, February 27, 2013.

  2. James A. Donaldson, "The MSL Sol-200 Anomaly: 'The Perfect Storm' or 'How the MSL Avionics Architecture Enabled the Recovery of the Curiosity Rover'," September 24, 2013.

Lesson(s) Learned

According to Reference (2), recovery from the "MSL Sol-200 Anomaly" was only possible because:

  1. The flight computer design included a capability to provide the telemetry data needed to assess spacecraft status under faulted operations.

  2. The fault history associated with the computer's operation was stored in a location that could easily be read without the involvement of the 'isolated' computer's FSW.

  3. Hardware commands existed that could swap prime strings independent of FSW.

  4. Communications across redundant strings and hardware-assisted 'autopsy' capabilities enabled diagnosis of the problem and recovery of the RCE-A string to a 'science-worthy' state.

  5. The NAND flash data store was architected with plenty of margin, with data that spanned multiple physical devices.

  6. Designers were available during the Operations phase to evaluate diagnostic evidence and develop recovery solutions.

Recommendation(s)

  1. When operating nominally, 'non-prime' computers must provide telemetry regarding their health to the prime computer or to the support equipment upon request.

  2. Perform more complete screening of commercial parts, including burn-in of flight parts to eliminate infant mortality flaws.

  3. Add additional thermal cycles and erase cycles to the environmental box test program.

  4. Due to it's known low reliability, avoid using NAND flash memory to store critical parameters.

  5. Re-architect the function that provides FSW health monitoring to cover more FSW issues, including improving the watchdog timing architecture.

  6. Even if NAND memory cannot reliably store data, the spacecraft must be capable of generating (a) real-time telemetry without a file system operational and (b) Data Products from files stored in the RAM File System.

  7. Maintain a capability to debug FSW while in "crippled mode" (e.g., operation in a low power state, without the File System, etc.)

  8. Implement a minimum level of hardware commands capable of recovering from flight system faults without recourse to FSW.

  9. Provide software support for reformatting flash memory to any desired size starting at any address, based on a value received prior to booting.

Evidence of Recurrence Control Effectiveness

JPL has referenced this lesson learned as additional rationale and guidance supporting Paragraph 4.4.6.1 ("Information System Design: Telemetry Visibility - Visibility of s/c status") in the JPL standard "Design, Verification/Validation and Operations Principles for Flight Systems (Design Principles)," JPL Document D-17868, Rev. 6, October 4, 2012.


This page does not use javascript.