SpaceX, a pioneer in commercial space transportation, most recently successfully took Astronauts to the space station with their Crew Dragon launch vehicle. SpaceX have essentially gone from a blank sheet of paper and text books to safely completing one of the riskiest and most challenging aspect of space transportation, ferrying humans.
In modern space flight, mission critical systems are becoming increasingly dependent on software controls for performing critical functions. Unfortunately, software failures are one of the most dominant causes of failures in today’s mission-critical systems. Software failures in such missions can cause mission performance degradation or even complete mission failure, incurring a heavy scientific and economical penalty. Some major failures over the years include
- Mariner I – Missing Hyphen in code
- Ariane-5 – Unhandled floating point exception
- Mars Pathfinder – Priority inversion /scheduling bug
- Mars Climate Orbiter – Navigation system failure, metric to imperial units conversion failure
Keeping in mind that these are programs developed by organizations with many years’ experience in developing software systems for space flight, it shows us the real challenges in crossing the newest boundary in human endeavor.
Given these challenges, it is very refreshing to see a company like SpaceX so quickly (18 years) and successfully climb the rung of increasing complexity in space flight.
How SpaceX develops software – uncovering the DNA
Progress in rocketry is incremental. The basic science of liquid-fuelled rockets hasn’t changed much since the days of Robert Goddard. And solid fuelled rockets go back millennia. The advances are made at the margins, and SpaceX is doing well by bringing innovation there. SpaceX’s rockets are modular; the Falcon is a single engine model, the Falcon 9 has nine, while the Falcon Heavy has 27, in three clusters of nine engines. This enables reuse, and streamlines production and software development. It works in the same way that building different car bodies atop similar chassis and components helps keep costs down for car manufacturers.
Triplex redundancy system
To provide the reliability and cost effectiveness required for these systems, along with addressing some of the challenges introduced by the environment in which the rockets operate (e.g. radiation), SpaceX uses a triplex redundancy system. The triple redundancy gives the system radiation tolerance without the need for expensive radiation hardened components:
- Most flight control systems are triple redundant for reliability (“triplex”).
- The use of radiation hardened components is not needed for a suborbital Flight Control Systems like that used on Falcon rockets as the flight control is not exposed to enough radiation over a long enough period to induce a fault in the processor, bus, etc.
- System that are on-orbit or used for deep space control would generally use radiation hardened silicon on insulator or silicon on sapphire processors like the hardened PowerPC
The Triplex Redundancy system implements the Actor-Judge algorithm to provide redundancy and reliability to its rockets and spacecraft. The Falcon 9 has three dual core x86 processors running an instance of Linux on each core, with the Flight software code implemented in C/C++. For each calculation/decision, the “flight string” compares the results from both cores. If there is an inconsistency, the string is bad and does not send any commands. And if both cores return the same response, the string sends the command to the various microcontrollers on the rocket that control things like the engines and grid fins. If all three strings agree the PPC microcontroller executes the command, but if one of the three is bad, it will go with the strings that have previously been correct. The Falcon 9 can successfully complete its mission with a single flight string.
Design and architecture reuse
As we can see here, both in design and architecture reuse is the key here to SpaceX’s success, among a few other things. As of 2019, SpaceX had only 50 developers building their software for their 9 vehicles. This is an order of magnitude different to how these systems have traditionally been built. Traditional space programs would have 2500 developers to do the same thing, almost 50x what SpaceX is doing today.
There are four software teams contributing to at SpaceX
- Flight Software
- Ground Software
- Avionics Test
- Enterprise Information Systems
Continuous testing in software development
The use of commodity components (x86, unhardened PPC processors and Linux) allows a single workstation to simulate every controller and processor. Hence allowing for automated testing en masse. SpaceX tests all flight software on what can be called a table rocket. They lay out all the computers and flight controllers on the Falcon 9 on a table and connect them like they would be on the actual rocket. For integration testing they run a complete simulated flight on the components, monitoring performance and potential failures. For stress testing, engineers perform what they call “Cutting the strings” where they randomly shut off a flight computer mid simulation, to see how it responds. This level of simulation mixed with a significant amount of automation is used to achieve high outputs from these developers. In fact, SpaceX can push software into product 17,000 times a day with confidence!
What we can derive from how SpaceX builds software is that reuse, DevOps and continuous testing workflows are key to their success. In fact, more and more companies are deploying DevOps and continuous testing workflows similar to SpaceX. As a result they have been able to make big leaps in innovation.
The software projects that are staying ahead of the curve have implemented the correct foundation for continuous testing. They have been adopting the following five steps and developing a plan that is continuously optimized, maintained and adjusted. Things change in the market or on your product road map!
Five steps to implement continuous testing
- Risk vs. Reward
- Code coverage is important, while being exhaustive may be economically infeasible. Using quality metrics and understanding your quality deficit can help you optimize what is done to maximize the value of continuous testing.
- Automate end-to-end testing
- Implement the right tests. Make sure your continuous testing test buckets are correct and leverage reporting appropriately.
- Use Change Impact Analysis to run tests per each code commit as part of a consolidated CI process.
- Have a stable lab and test environment
- The test environment needs to be easy to replicate.
- Use containers or virtual machine snapshots. Your lab environment will be enabled to quickly and easily replicate the test environment.
- Analyse your continuous integration / continuous testing data and reports
- Use Artificial Intelligence (AI) & Machine Learning (ML). This helps you optimize your continuous testing suite and reduce the amount of time in release activities.
- Software delivery pipeline and DevOps toolchain
- Continuous testing needs to work seamlessly with everything.
- Build a continuous integration architecture. No matter the environment and dependencies, continuous testing needs to pick up all the appropriate testing. Execute them automatically and provide feedback for a GO/NO GO on the release.
- Wikipedia – Software bugs in space flight applications
- Wikipedia – Robert H. Goddard
- We are SpaceX Software Engineers – We Launch Rockets into Space – AMA – https://www.reddit.com/r/IAmA/comments/1853ap/we_are_spacex_software_engineers_we_launch/
- How SpaceX does software for 9 vehicles with only 50 developers – https://acquisitiontalk.com/2019/11/how-spacex-does-software-for-9-vehicles-with-only-50-developers-and-govt-requiring-50x-the-staff/