I interned at a company called stratus which did hardware fault tolerant computers in the 80s/90s. I think they called it a “Pair and spare” approach, where every component had 3 copies running and comparing state every cycle. If one component’s state stopped matching the other 2, the failing component would be taken offline and the system would call home for a replacement to be fedexed overnight. I think just about every component was hotswappable too. Pretty cool, but expensive, and other architectures for improving availability, or mitigating impact from loss of availability, won out (except for a handful of exotic use cases).
I'm a big fan of Dissimilar Redundancies (but didn't know that was the term until today) for building system software.
Build for various Linux distros, and some of the BSDs. You'll encounter weird compile errors or edge cases that will pop up. Often times I've found that these will expose undefined behaviour or incorrect assumptions that you wouldn't notice if you were building for a single platform.
The engineering behind Artemis and SLS is a masterclass in safety-critical design. The quad redundant Primary runs on
on a quadruple config PPC-750 CPU with the Green Hills Integrity OS and ARINC653 framework
While the Back up is on a LEON 3 (SPARKV8) CPU using the VxWorks and NASA's CFS framework. (https://github.com/nasa/cFS)
NASA actually makes all this publicly available information available on their NTRS server.
I recall OpenBSD operated in a similar way, building the system on various architectures, big and little endian, VAX, SPARC, Luna88K, etc. Quickly highlights any hardware assumptions and helped make base more robust.
Candidly, while I understand the need for some amount of redundancy, I'm curious what this level of redundancy adds in terms of complexity to the system of a whole and whether or not that complexity-add almost outweighs the higher redundancy. I'm sure NASA has calculated the trade off, but I'd be curious to see the thoughts behind that.
I feel in a similar vein when learning of certain aircraft accidents over the years, where it feels like the redundancy of certain systems and the complexity it adds has been the indirect cause of accidents instead of preventing them. I suppose there's not really a way to quantify the accidents that it's prevent to be able to compare them directly.
There’s an obvious example of this with twin-engine airplanes. Having two engines obviously makes you a lot safer since you still have power if one fails. But dealing with an engine failure takes some skill, and your probability of experiencing a failure doubles. Airlines train their pilots to handle it, but if you’re a more casual pilot and you’re flying a twin, you have to be careful to ensure it’s actually making you safer.
Two engines also give you a lot more options for control surface failures. It's objectively safer and why all commercial airliners are (at least) two engine. But it does require more training for the pilot.
> Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.
Who sits down and determines that 8 is the correct number? Why not 4? Or 2? Or 16 or 32?
They probably set an acceptable total loss rate for the mission and worked backwards to determine how many replicas of each system they need to achieve that while minimizing total cost/weight.
So the answer is "some engineers sat down after talking to management".
Eight shall be the number thou shalt count, and the number of the counting shall be eight. Nine shalt thou not count, neither count thou seven, excepting that thou then proceed to eight.
The fault tolerance is mostly focused on background radiation flipping bits. We've got half a century of data on the frequency of those upsets and the extent to which they're correlated under different space conditions for that, not to mention the ability to irradiate prototypes of the flight computer with representative amounts of shielding in ground based facilities...
For issues that have never occurred before, probabilities are the wrong tool. The right thing to do is list all the behaviour the vehicle must never exhibit and think of ways it still might, despite all redundancies -- maybe even despite every single component working as intended.
Lots of mission failures in history were caused by unexpected interactions between fully functional components. Probabilities of failures don't help with that.
And why you test till failure (ideally under real/similar conditions): to surface the failures that have never occurred before, and start collecting data on them.
Once you've lost more than ~2 processors, you're probably into the realm of common mode failures and voting won't save you. At that point, it's entirely possible you're just working with random data coming out of all your processors.
Not just CPUs, they run a whole different (but also simpler) fallback program in case the main computers fail. I think they were more worried about programming errors but this should avoid all shared failures between the main computers (be it programming or hardware).
Even if different teams write software in different languages, they end up creating very similar bugs because the bugs crop up in the complexities of the domain and insufficiencies of the specification.
N-version programming doesn't work as well as you think. See Knight and Leveson (1986).
(N-version programming does guard against "random" errors like typos or accidentally swapping parameters to a subroutine call. But so does a good test suite and a powerful compiler.)
> The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.
How does a pair determine which of the pair did the calculation correctly?
It doesn't have to. It raises an error that the system can detect and take action on. Usually that'll be some combination of interrupt/reset and an external pin to let the rest of the system know what's happened.
When the Apollo astronauts learned that they might need to repair the computer if it breaks they joked they might as well learn brain surgery if they end up needing that too.
(This was when they planned on sending a modular computer with them. In the end they settled for sending up a fully assembled spare computer instead, which made replacement easier.)
I interned at a company called stratus which did hardware fault tolerant computers in the 80s/90s. I think they called it a “Pair and spare” approach, where every component had 3 copies running and comparing state every cycle. If one component’s state stopped matching the other 2, the failing component would be taken offline and the system would call home for a replacement to be fedexed overnight. I think just about every component was hotswappable too. Pretty cool, but expensive, and other architectures for improving availability, or mitigating impact from loss of availability, won out (except for a handful of exotic use cases).
I'm a big fan of Dissimilar Redundancies (but didn't know that was the term until today) for building system software.
Build for various Linux distros, and some of the BSDs. You'll encounter weird compile errors or edge cases that will pop up. Often times I've found that these will expose undefined behaviour or incorrect assumptions that you wouldn't notice if you were building for a single platform.
The engineering behind Artemis and SLS is a masterclass in safety-critical design. The quad redundant Primary runs on on a quadruple config PPC-750 CPU with the Green Hills Integrity OS and ARINC653 framework While the Back up is on a LEON 3 (SPARKV8) CPU using the VxWorks and NASA's CFS framework. (https://github.com/nasa/cFS)
NASA actually makes all this publicly available information available on their NTRS server.
Primary and BFS Info: https://ntrs.nasa.gov/api/citations/20190000011/downloads/20... Orion BFS: https://ntrs.nasa.gov/api/citations/20230002185/downloads/FS...
I recall OpenBSD operated in a similar way, building the system on various architectures, big and little endian, VAX, SPARC, Luna88K, etc. Quickly highlights any hardware assumptions and helped make base more robust.
Candidly, while I understand the need for some amount of redundancy, I'm curious what this level of redundancy adds in terms of complexity to the system of a whole and whether or not that complexity-add almost outweighs the higher redundancy. I'm sure NASA has calculated the trade off, but I'd be curious to see the thoughts behind that.
I feel in a similar vein when learning of certain aircraft accidents over the years, where it feels like the redundancy of certain systems and the complexity it adds has been the indirect cause of accidents instead of preventing them. I suppose there's not really a way to quantify the accidents that it's prevent to be able to compare them directly.
There’s an obvious example of this with twin-engine airplanes. Having two engines obviously makes you a lot safer since you still have power if one fails. But dealing with an engine failure takes some skill, and your probability of experiencing a failure doubles. Airlines train their pilots to handle it, but if you’re a more casual pilot and you’re flying a twin, you have to be careful to ensure it’s actually making you safer.
Two engines also give you a lot more options for control surface failures. It's objectively safer and why all commercial airliners are (at least) two engine. But it does require more training for the pilot.
> Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.
Who sits down and determines that 8 is the correct number? Why not 4? Or 2? Or 16 or 32?
They probably set an acceptable total loss rate for the mission and worked backwards to determine how many replicas of each system they need to achieve that while minimizing total cost/weight.
So the answer is "some engineers sat down after talking to management".
This is correct.
Eight shall be the number thou shalt count, and the number of the counting shall be eight. Nine shalt thou not count, neither count thou seven, excepting that thou then proceed to eight.
Ten is right out!
Given a list of estimates of failure probabilities, finding the right mix of redundancy becomes a very tractable problem, maybe even freshman-level.
Getting the probabilities could be very difficult though, especially for issues that never occurred before.
The fault tolerance is mostly focused on background radiation flipping bits. We've got half a century of data on the frequency of those upsets and the extent to which they're correlated under different space conditions for that, not to mention the ability to irradiate prototypes of the flight computer with representative amounts of shielding in ground based facilities...
For issues that have never occurred before, probabilities are the wrong tool. The right thing to do is list all the behaviour the vehicle must never exhibit and think of ways it still might, despite all redundancies -- maybe even despite every single component working as intended.
Lots of mission failures in history were caused by unexpected interactions between fully functional components. Probabilities of failures don't help with that.
And why you test till failure (ideally under real/similar conditions): to surface the failures that have never occurred before, and start collecting data on them.
That is what you hire an army of engineers for.
Why use an even number? If they use a voting style consensus mechanism wouldn't an odd number make more sense?
Once you've lost more than ~2 processors, you're probably into the realm of common mode failures and voting won't save you. At that point, it's entirely possible you're just working with random data coming out of all your processors.
Interesting. In safety components we are using Lockstep Microcontrollers which are doing something similar in a much smaller scale.
https://en.wikipedia.org/wiki/Lockstep_(computing)
Example: https://www.st.com/resource/en/datasheet/spc574k72e5.pdf
Lockstep processors were used here, as well.
> each FCM consists of a self-checking pair of processors.
Never take to clocks to sea. Always sail with one or three.
For the Airbus they used different CPUs because CPUs have bugs too...
Not just CPUs, they run a whole different (but also simpler) fallback program in case the main computers fail. I think they were more worried about programming errors but this should avoid all shared failures between the main computers (be it programming or hardware).
It does not.
Even if different teams write software in different languages, they end up creating very similar bugs because the bugs crop up in the complexities of the domain and insufficiencies of the specification.
N-version programming doesn't work as well as you think. See Knight and Leveson (1986).
(N-version programming does guard against "random" errors like typos or accidentally swapping parameters to a subroutine call. But so does a good test suite and a powerful compiler.)
> The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.
How does a pair determine which of the pair did the calculation correctly?
It doesn't have to. It raises an error that the system can detect and take action on. Usually that'll be some combination of interrupt/reset and an external pin to let the rest of the system know what's happened.
In simple terms, this works by doing an XOR on the outputs and if they disagree, performing a fault recovery.
There's also space systems that use 3 processors and a majority vote for the correct output, but that's different.
You just run the calculation again until both agree.
What I would like to see is the fault data. Also a graph of the # of in sync FMCs over time and how well did it correlate with predictions.
I other words, how over engineered is it.
The training the astronauts need must be a lot
When the Apollo astronauts learned that they might need to repair the computer if it breaks they joked they might as well learn brain surgery if they end up needing that too.
(This was when they planned on sending a modular computer with them. In the end they settled for sending up a fully assembled spare computer instead, which made replacement easier.)