Software projects produce lots of artifacts over time, obviously the source code, but also requirement and design docs, test cases, bug reports, CI scripts, installation scripts, and a host of other formal and informal data. If you ask managers and developers where the “value” is in these artifacts, there is no doubt that source code would get the most votes, because without it you have no product!! In many ways the software “is” the product, and is the trusted repository for what an application really does. Ask a developer a question about a feature, and chances are they’ll look to the source code, not a Jira story, or a Doors requirement. In most cases, the source code is the single source of truth.
Single source of truth (SSoT)
A single source of truth simply refers to the place that stake-holders agree is the repository of data that is to be “believed”. The SSoT idea originated with database management, and business intelligence (think customer numbers, product price). But over the last few years, SSoT has started to trend in software development circles.
The idea is that knowledge about the application really shouldn’t be spread around multiple artifacts and people, but should exist in a single repository that allows the data to be easily shared, and used by tools to improve quality.
Is source code the best single source of truth?
Source code is valuable, not only because it takes a lot of effort to create, but mostly because over time, it captures all of the edge cases that are encountered in real world use. I often joke that every ugly chunk of software started off its life as a beautifully conceived algorithm with an elegant control flow … and then the real world and all of its edge cases intruded, and a variety of fixes were applied resulting in what you see now.
Although we have bug reports, and commit logs to document each incremental change, it is unlikely that all of the software artifacts are maintained for these changes.
This leaves us with the source code as the best, and often only, single source of truth.
Extracting value from your source code
If the source code is the SSoT, then how can we best extract value? The following three classes of tools are widely used:
- Static analysis tools that use pattern matching to provide insights into the code and capture common errors like divide by zero, and null pointer de-references. Additionally, these tools can help with call and data flow analysis, as well as building data dictionaries of type and variable usage.
- Code coverage tools that instrument the source code to capture statement execution and decision outcomes during testing. Coverage analysis provides valuable feedback on un-tested or under-tested sections of an application.
- Language-sensitive editors that provide, not only code completion and “goto definition” features, but often connect to static analysis, coverage, and test execution tools to provide real-time error feedback as the code changes.
Can baseline testing help?
Another way to extract value from source code is to formalize its existing behavior with a set of tests that achieve 100% code coverage, and then use these tests as a validation for future changes. This approach is incredibly valuable when working with under-tested legacy applications. Often these code bases have test suites that produce 50% or less code coverage; which means that code changes are likely to break existing features.
Baseline tests solve this problem by capturing the existing black box behavior of each function or library. As the underlying logic is changed, those same tests can be re-run as a sanity check, allowing legacy applications to be re-used and re-factored with confidence. The only downside with this approach is the time required to build these tests, but suppose you could build them automatically?
Automatic baseline testing
Automatic baseline testing builds test cases backwards from the code, with the goal of generating statement and branch coverage of 90%-100% — with no human intervention. The process requires a logic model of the application, and a mathematical solver to generate the required data sets. The model contains all data sources, intermediate computations, condition values, decision outcomes and data sinks; and the solvers use Satisfiability Modulo Theories to compute the “solution” –> the input test vectors.
There is quite a bit of research being done in this area, including by my group at Vector, to learn more follow the links below.
- CBMC – C Bounded Model Checker
- Binary Symbolic execution with Klee
- Pex and Moles – Isolation and White Box Unit Testing