System integration no matter the approach; modular, big bang, regressive, controlled or adhoc can often be one of the most critical phases of a project. This is often amplified when the final product consists of many components brought together in unison through a number of non-trivial interfaces.
In our current world we have many different interfaces:
- Interfaces between distributed software components (SW APIs)
- Interfaces between hardware modules like VME, PCI and PC104 buses
- Network interfaces like Ethernet, fibre, CAN bus, serial, wireless, Bluetooth
- Other possible proprietary buses
These interfaces will provide the final product with the technological edge to make it “tomorrow’s must have item”.
In such a changing dynamic world, the fundamental concepts we gleaned from our forefathers who punched holes in cards to write programs has remained true. If you don’t know what you’re sending and receiving over an interface, then other than through pure luck, or an extended integration phase composed of ‘tiger teams’ and an infinite budget, don’t expect the problem solved quickly. In today’s market, the once simple embedded processor with a few interrupts now has designed each interface to be compressed into a single microchip no bigger than your thumbnail. Without a strategy to attack this problem, the harsh reality is that your project is doomed to fail. So, let’s look at how we can solve this problem.
System integration – the rules
- Observe: You need to capture data at interfaces – if you can’t see it, you can’t debug it.
- Understand: You need to understand your data in real-time. If you need a long-winded process to archive and restore, you may miss the core problem.
- Simulate: You need to be able to quickly create scenarios in your debug environment to see the response of your system to events.
If we are observing the system, we want to be able to passively comprehend our system without intruding on its behaviour. After all, one must draw back to first principles and Heisenberg’s Uncertainty Principle.
What do we need to observe our system?
- System errors – User errors, and kernel errors
- Events – interrupts, semaphores, process context switch
- Data – messages or inter/intra process communication
The latter is of course the bloodline of a well-designed system.
If your system is doing any real-time I/O processing, then the data being processed, (which can these days be in the gigabits/second) needs to have a threshold mechanism. That is, if the system is in a standby state, no one is concerned with its behavior. But often it’s when we enter an operational state, one or two interfaces ramp up in usage. Then Murphy’s Law takes place. 80% of our problems exist in 20% of our system. A good start then is to be able to remove the 80% we don’t care about. The filtering mechanism in our observation tool will save us time and space, and of course, limit our overhead on the system.
Is your system talking to you?
Now that we have our data, we need to be able to comprehend the system. There’s no point letting the system execute for many hours, only to realize the problem occurred in the first few minutes of running. Using standard operating system and programming concepts, one can prebuild a knowledge of the system and use this to study the contents of the data. The power of being able to check on a message being transmitted over an interface, and see it being decoded into human readable form can put any engineer into a state of euphoria.
Almost as if your system was speaking to you, leading you down the path of its issues. At this point, the debugging effort almost falls in line with solving the simplest of crossword puzzles. Just check your IDD’s (interface description documents) and walk your way through the problem. Even better, once you have it working, why not archive the trace data and use it the next time you make a change and plan to regression test your system? Suddenly our complex system doesn’t look so complex.
Understand your system behavior in real-time
While observing the behavior of our system is a big step forwards in our debugging process, there is still room for improvement. When the problem is intermittent or even worse, it occurs at some obscure point in time (perhaps after several hours of execution). How do we address this situation? Perhaps we can have someone by the system ensuring that the data can be logged at the very instant of the failure? How about some intelligence?
Introduce the concepts of states into your system. Have your debug tool understand what states your system is in, and at each state transition it will store only the required data. Even better, what if we have a system error? What do we do if no one is there to debug it? We surely can’t have your source code debugger hooked in from the word go. So many places to look, and of course let’s not forget, lose any semblance of real-time in the system.
Yes, our old friend Heisenberg again. Well, how about if our tool could freeze the system, at that point when the error occurred? A system-level breakpoint spanning across multiple architectures and processes. The system could still service low-level needs, but at a process level, it could be halted waiting for an engineer to step it through the error with a source code debugger. Debugging our system in this way ensures that we are only interfering with it during the erroneous moments.
Simulate and inject scenarios into the system
Now let’s take the situation where we have an engineer in the lab. We think we know where the problem is, but what can we do? Why not have a smart scripting interface that allows us to quickly mimic an interface and inject scenarios into the system. Imagine, interfacing two software sub-systems, one developed internally, the other from an external supplier. Why not use the observation tool to also be the stub? Or even use it as a workaround to overcome the error in the integration of the system between the two sub-systems? Simply feed into the modules the required data sequence, and then allow the system to progress and communicate as normal. Need to verify a system thread? Why not just inject a single message which can be quickly edited to observe the behavior of the thread.
Developing a test procedure
Need an event injected at 10 minutes 56 seconds after another event? Why not script it? Python has become a commonly used scripting/prototyping language. Integrating it into our test and debugging tool, would allow us to develop regression-able test procedures. If real-time capabilities are needed during testing, specific languages like CAPL would be the appropriate answer. Having access from these scripting languages to internals within the application allows the script to also monitor internal behavior and stimulate the application via direct function calls.
We have discussed the 3 rules that are critical to system integration: Observe, understand and simulate. Most systems today already have the required pieces available to build a tool that can help us achieve this. The items just haven’t been pieced together or thought of early enough in the process.
If you are planning the integration of a complex system, take the time to answer these questions:
- Where do potential system errors can come from?
- What types of events in the system can trigger specific behaviors?
- What interfaces are required for communicating messages?
The use of the proposed procedure will ensure when a fault occurs, you’ll be ready to debug it quickly and thoroughly.