Find code clones with their DNA

Clone spotting: Find code clones with their DNA

Unidentical twins

Code clones can be elusive. First of all, we’re not talking about the dumb copy/paste textual clones, we’re interested in the harder to see algorithmic kind (we met both in a previous post).
And algorithmic clones are the sneaky ones: they disguise using different variable names, muddy the waters with varying number of statements, and hide behind diverging comments.

Déjà-vu

Let’s start with a simple example: here are two functions that are algorithmic clones.

  • They have different names
  • Out of the two attributes, the first is renamed, the second is the same
  • They share the same algorithm, a sequence of eight ‘if’ statements, returning 0 or 1
  • Finally, all ‘if’ tests have a lot in common, with a few differences (renamed variables)
Algorithmic clones on visually similar code

Have we met before?

Now look at the same first function, next to an attempt at obfuscation.

Unlike the previous example, this is not a case of “clone at first sight”.

And yet they really are the exact same algorithms.

If you are skeptical, keep on reading!

Algorithmic clones on visually different code

Algorithmic DNA matching

Using the control flow chart

As discussed in our Anatomy of a source code post, algorithms can be displayed as a chart, showing statements, conditions, control flow change, etc.

Using this representation, all functions from our previous examples have the exact same chart. Yes, even the obfuscated one.

One control flow chart to describe them all

So this method can be used to focus on the very structure of algorithms,.
To do that, we leave out all names, comments and block delimiters. In a way, we are looking at the algorithm’s DNA, and matching it changes clone spotting into a fun graphical game 😊.

Spot the clone!

Here is a first example of real code.

Look at these two functions, and quickly tell if they are algorithmically similar or not.

At first glance, they seem to be the same functions.

But to be sure, you really need to compare each individual line, check variable names and string changes.

This is no fun.    

These functions look the same, but are they?

Now look at the control flow charts for both functions. It should only take you a few seconds to hit the “Found a Clone” buzzer!

Same control flow graph means same algorithm

Recognizing algorithms from a long way away

Here is another example.

This function is not that huge (under 300 lines), but if you were asked to look for algorithm similarities, time would suddenly feel very long.

Who wants to search repeating algorithmic patterns in this code?

Now look at the corresponding control flow chart. In a few seconds, your eyes will detect similar patterns, and you will just be eager to look into the code and work on these clones!

Visual algorithmic patterns are much easier to detect

Going further

Now that we know how to look for these clones, the next steps are:

  • Search
  • Select
  • Simplify

First, based on the control flow DNA as an algorithmic signature, we can look for all clones in a project, sorting them from most to less similar (using a threshold to only keep significant clones).

Next, we can select clones which are relevant.
Discarding for example those from legacy code (which may be risky to modify), or from code needing to run on separate targets.

Finally, the selected clones can be simplified (by way of factorization).

Here is the result on our previous example.

=+
A code containing clones before and after factorization

The benefits of this factorization are immediate, and durable:

  • Smaller footprint
    The cloned part (repeated three times) now only appears one time, the resulting two functions are smaller, their complexity is reduced.
  • Simpler tests
    It is easier to test smaller functions., and to ensure test coverage.
  • Better maintainability
    In the future, debug and enhancements won’t suffer from a complex function you’ll have to understand before modifying.
  • Optimized debug
    If a bug is found in the factorized code (which used to be cloned), it will only have to be fixed in one place, and there is no risk of forgetting to fix it everywhere this code was cloned.

Looking ahead

Clone detection leading to factorization is more than an exercise of visual recognition.

We started using terms like complexity, coverage, maintainability, risk. For each one, there are fields of study, standards, methods, best practices, tools.

Software development can greatly benefit from these fields, which assist us in writing better code.
In the next posts, let’s continue exploring the building blocks of a greater edifice.
You guessed it: we’re talking about software quality.

Further readings

Share:

Share on linkedin
Share on twitter
Share on whatsapp

Leave a Comment

Related Posts

Continuous Intetgration on Coderskitchen: Building the ultimate CI machine

Building the ultimate CI machine

Software is traditionally designed, coded, then tested. However, poor quality coding and a testing phase at the end of the process can add a significant

Visual Testing Coderskitchen

Visual testing of a Unity-based 3D world

We provide a 3D animation environment for the development and testing of advanced driver assistance systems (ADAS) and autonomous vehicles. Developing and maintaining this environment

Hey there!

Subscribe and get an email every time we’ve got a new quality piece on here.