Clone spotting: Find code clones with their DNA

Flavien Huynh

Unidentical twins

Code clones can be elusive. First of all, we’re not talking about the dumb copy/paste textual clones, we’re interested in the harder to see algorithmic kind (we met both in a previous post).
And algorithmic clones are the sneaky ones: they disguise using different variable names, muddy the waters with varying number of statements, and hide behind diverging comments.

Déjà-vu

Let’s start with a simple example: here are two functions that are algorithmic clones.

They have different names
Out of the two attributes, the first is renamed, the second is the same
They share the same algorithm, a sequence of eight ‘if’ statements, returning 0 or 1
Finally, all ‘if’ tests have a lot in common, with a few differences (renamed variables)

Algorithmic clones on visually similar code

Have we met before?

Now look at the same first function, next to an attempt at obfuscation.

Unlike the previous example, this is not a case of “clone at first sight”.

And yet they really are the exact same algorithms.

If you are skeptical, keep on reading!

Algorithmic clones on visually different code

Algorithmic DNA matching

Using the control flow chart

As discussed in our Anatomy of a source code post, algorithms can be displayed as a chart, showing statements, conditions, control flow change, etc.

Using this representation, all functions from our previous examples have the exact same chart. Yes, even the obfuscated one.

One control flow chart to describe them all

So this method can be used to focus on the very structure of algorithms,.
To do that, we leave out all names, comments and block delimiters. In a way, we are looking at the algorithm’s DNA, and matching it changes clone spotting into a fun graphical game 😊.

Spot the clone!

Here is a first example of real code.

Look at these two functions, and quickly tell if they are algorithmically similar or not.

At first glance, they seem to be the same functions.

But to be sure, you really need to compare each individual line, check variable names and string changes.

This is no fun.

These functions look the same, but are they?

Now look at the control flow charts for both functions. It should only take you a few seconds to hit the “Found a Clone” buzzer!

Same control flow graph means same algorithm

Recognizing algorithms from a long way away

Here is another example.

This function is not that huge (under 300 lines), but if you were asked to look for algorithm similarities, time would suddenly feel very long.

Who wants to search repeating algorithmic patterns in this code?

Now look at the corresponding control flow chart. In a few seconds, your eyes will detect similar patterns, and you will just be eager to look into the code and work on these clones!

Visual algorithmic patterns are much easier to detect

Going further

Now that we know how to look for these clones, the next steps are:

Search
Select
Simplify

First, based on the control flow DNA as an algorithmic signature, we can look for all clones in a project, sorting them from most to less similar (using a threshold to only keep significant clones).

Next, we can select clones which are relevant.
Discarding for example those from legacy code (which may be risky to modify), or from code needing to run on separate targets.

Finally, the selected clones can be simplified (by way of factorization).

Here is the result on our previous example.

A code containing clones before and after factorization

The benefits of this factorization are immediate, and durable:

Smaller footprint
The cloned part (repeated three times) now only appears one time, the resulting two functions are smaller, their complexity is reduced.
Simpler tests
It is easier to test smaller functions., and to ensure test coverage.
Better maintainability
In the future, debug and enhancements won’t suffer from a complex function you’ll have to understand before modifying.
Optimized debug
If a bug is found in the factorized code (which used to be cloned), it will only have to be fixed in one place, and there is no risk of forgetting to fix it everywhere this code was cloned.

Looking ahead

Clone detection leading to factorization is more than an exercise of visual recognition.

We started using terms like complexity, coverage, maintainability, risk. For each one, there are fields of study, standards, methods, best practices, tools.

Software development can greatly benefit from these fields, which assist us in writing better code.
In the next posts, let’s continue exploring the building blocks of a greater edifice.
You guessed it: we’re talking about software quality.

Further readings

Contact me

Flavien Huynh

Flavien Huynh works as a Senior Field Application Engineer at Vector. He is in charge of assisting customers reaching efficient and relevant monitoring solutions, by way of project management, training and technical support. Originating from the research field, he wrote several articles on the topic of Software Project monitoring, and managed R&D projects with research partners.

Legal Notice

This text is the intellectual property of the author and is copyrighted by coderskitchen.com. You are welcome to reuse the thoughts from this blog post. However, the author must always be mentioned with a link to this post!

Clone spotting: Find code clones with their DNA

Flavien Huynh

Unidentical twins

Déjà-vu

Have we met before?

Algorithmic DNA matching

Using the control flow chart

Spot the clone!

Recognizing algorithms from a long way away

Going further

Looking ahead

Further readings

Flavien Huynh

Share:

Legal Notice

Leave a Comment Cancel reply

Related Posts

Why is everything an int?

Measuring and monitoring technical debt

Break data silos, don’t get lost in translation

Inside software quality: Big picture and little details

Software quality is not a myth: Real examples

Software quality: To the rescue!

Follow us on LinkedIn