SPO600 Project Stage 2, Part 2: "Where Art Thou, Clones?"
Introduction I’ll keep the introduction short - welcome back! The code will be broken down into parts and explained as we go along. Let’s find those function variants! Identifying the Function Variants Identifying the clones require a reference point - the base function serves as the “source of truth” with its respective basic blocks, statements, control flow, etc., and becomes the template to compare against when analyzing the variants. Identifying the Base Function // ~/gccsrc/gcc/gcc/tree-prune-clones.cc ... std::string base_function; ... unsigned int pass_prune_clones::execute(function *func) { ... const std::string name(node->name()); if (base_function.empty() && name.find(".resolver") != std::string::npos) { base_function = name.substr(0, name.find(".resolver")); fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str()); } } ... bool pass_prune_clones::is_base_function(const std::string &name) const { return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos; } The first thing to do is to get the name of the function when entering the call graph. Note that the name of function can have any of the variant extensions (e.g. .constprop.0, .ssa, etc.). .resolver is the source function because ifuncs get resolved during runtime - allow multiple implementations during runtime. pass_prune_clones::execute extracts the function name from .resolver. pass_prune_clones::is_base_function checks if the current node is function variant by comparing it the base function name Fingerprinting Functions gcc documentation provides a list of accessors that simplifies the extraction inGIMPLE. The combo of get_gimple_vars and record_function_info is being used to record a structured fingerprint of a function — not based on specific variable names, but on how it behaves structurally (e.g., statement types, number of operands, variable usage patterns). get_gimple_vars(...) processes a single GIMPLE statement by: Extract Variables For GIMPLE_COND, GIMPLE_ASSIGN ,GIMPLE_CALL, and GIMPLE_DEBUG_BIND, it collects involved tree variables (like lhs/rhs, debug values) into a std::vector vars — rather than identifying variables by their names, it tracks their first occurrence and usage position for structural comparisons across different functions. Map First Occurrence For each variable in vars: - If this is the **first time** it appears, it's assigned a new ID `i`. - If it's been seen before, it reuses the same ID. The output is stored in: first_occurrence_map: mapping tree → ID out_map: mapping ID of this statement → ID of first occurrence This abstract variable names away and focusses on their role — a = b + c and x = y + z would look the same if variable roles match. // ~/gccsrc/gcc/gcc/tree-prune-clones.cc ... int pass_prune_clones::get_gimple_vars(std::map &first_occurrence_map, int i, gimple *stmt, std::map &out_map) const { std::vector vars; switch (gimple_code(stmt)) { case GIMPLE_COND: { tree left_op = gimple_cond_lhs(stmt); tree right_op = gimple_cond_rhs(stmt); if (left_op) vars.push_back(left_op); if (right_op) vars.push_back(right_op); break; } case GIMPLE_CALL: { tree left_op = gimple_call_lhs(stmt); if (left_op) vars.push_back(left_op); break; } case GIMPLE_ASSIGN: { tree left_op = gimple_assign_lhs(stmt); vars.push_back(left_op); int rhs_ops = gimple_num_ops(stmt) - 1; for (int i = 1; i second; } else { first_occurrence_map[debug_var] = i; out_map[i] = i; } ++i; } return i; } Putting it Together Finally, there’s a way to identify the base function and to fingerprint each function for comparison! Let’s put the code all together! // ~/gccsrc/gcc/gcc/tree-prune-clones.cc #include #include #include #include #include "config.h" #include "system.h" #include "coretypes.h" #include "backend.h" #include "tree.h" #include "gimple.h" #include "tree-pass.h" #include "ssa.h" #include "tree-pretty-print.h" #include "gimple-iterator.h" #include "gimple-walk.h" #include "internal-fn.h" #include "gimple-pretty-print.h" #include "cgraph.h" #include "gimple-ssa.h" #include "attribs.h" #include "pretty-print.h" #include "tree-inline.h" #include "intl.h" #include "function.h" #include "basic-block.h" namespace { const pass_data prune_clones_pass_data = { GIMPLE_PASS, /* type */ "prune_clones", /* name */ OPTGROUP_NONE, /* optinfo_flags */ TV_NONE, /* tv_id */ PROP_cfg, 0, 0, 0, 0, }; class pass_prune_clones : public gimple_opt_pass { public: pass_prune_clones(gcc::context *ctxt) : gi

Introduction
I’ll keep the introduction short - welcome back! The code will be broken down into parts and explained as we go along. Let’s find those function variants!
Identifying the Function Variants
Identifying the clones require a reference point - the base function serves as the “source of truth” with its respective basic blocks, statements, control flow, etc., and becomes the template to compare against when analyzing the variants.
Identifying the Base Function
// ~/gccsrc/gcc/gcc/tree-prune-clones.cc
...
std::string base_function;
...
unsigned int pass_prune_clones::execute(function *func)
{
...
const std::string name(node->name());
if (base_function.empty() && name.find(".resolver") != std::string::npos)
{
base_function = name.substr(0, name.find(".resolver"));
fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str());
}
}
...
bool pass_prune_clones::is_base_function(const std::string &name) const
{
return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos;
}
The first thing to do is to get the name of the function when entering the call graph. Note that the name of function can have any of the variant extensions (e.g. .constprop.0
, .ssa
, etc.). .resolver
is the source function because ifuncs get resolved during runtime - allow multiple implementations during runtime.
-
pass_prune_clones::execute
extracts the function name from.resolver
. -
pass_prune_clones::is_base_function
checks if the current node is function variant by comparing it the base function name
Fingerprinting Functions
gcc
documentation provides a list of accessors that simplifies the extraction inGIMPLE
.
The combo of get_gimple_vars
and record_function_info
is being used to record a structured fingerprint of a function — not based on specific variable names, but on how it behaves structurally (e.g., statement types, number of operands, variable usage patterns).
get_gimple_vars(...)
processes a single GIMPLE statement by:
-
Extract Variables
For
GIMPLE_COND
,GIMPLE_ASSIGN
,GIMPLE_CALL
, andGIMPLE_DEBUG_BIND
, it collects involvedtree
variables (like lhs/rhs, debug values) into astd::vector
— rather than identifying variables by their names, it tracks their first occurrence and usage position for structural comparisons across different functions.vars -
Map First Occurrence
For each variable in
vars
:
- If this is the **first time** it appears, it's assigned a new ID `i`.
- If it's been seen before, it reuses the same ID.
The output is stored in:
-
first_occurrence_map
: mappingtree → ID
-
out_map
: mappingID of this statement → ID of first occurrence
This abstract variable names away and focusses on their role — a = b + c
and x = y + z
would look the same if variable roles match.
// ~/gccsrc/gcc/gcc/tree-prune-clones.cc
...
int pass_prune_clones::get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const
{
std::vector<tree> vars;
switch (gimple_code(stmt))
{
case GIMPLE_COND:
{
tree left_op = gimple_cond_lhs(stmt);
tree right_op = gimple_cond_rhs(stmt);
if (left_op)
vars.push_back(left_op);
if (right_op)
vars.push_back(right_op);
break;
}
case GIMPLE_CALL:
{
tree left_op = gimple_call_lhs(stmt);
if (left_op)
vars.push_back(left_op);
break;
}
case GIMPLE_ASSIGN:
{
tree left_op = gimple_assign_lhs(stmt);
vars.push_back(left_op);
int rhs_ops = gimple_num_ops(stmt) - 1;
for (int i = 1; i <= rhs_ops; ++i)
{
tree op = gimple_op(stmt, i);
vars.push_back(op);
}
break;
}
case GIMPLE_DEBUG_BIND:
{
tree debug_var = gimple_debug_bind_get_var(stmt);
tree debug_value = gimple_debug_bind_get_value(stmt);
if (debug_var)
vars.push_back(debug_var);
if (debug_value)
vars.push_back(debug_value);
break;
}
default:
break;
}
for (tree debug_var : vars)
{
std::map<tree, int>::iterator it = first_occurrence_map.find(debug_var);
if (it != first_occurrence_map.end())
{
out_map[i] = it->second;
}
else
{
first_occurrence_map[debug_var] = i;
out_map[i] = i;
}
++i;
}
return i;
}
Putting it Together
Finally, there’s a way to identify the base function and to fingerprint each function for comparison! Let’s put the code all together!
// ~/gccsrc/gcc/gcc/tree-prune-clones.cc
#include
#include
#include
#include
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
#include "cgraph.h"
#include "gimple-ssa.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include "function.h"
#include "basic-block.h"
namespace
{
const pass_data prune_clones_pass_data = {
GIMPLE_PASS, /* type */
"prune_clones", /* name */
OPTGROUP_NONE, /* optinfo_flags */
TV_NONE, /* tv_id */
PROP_cfg,
0,
0,
0,
0,
};
class pass_prune_clones : public gimple_opt_pass
{
public:
pass_prune_clones(gcc::context *ctxt)
: gimple_opt_pass(prune_clones_pass_data, ctxt) {}
bool gate(function *) final override { return true; }
void set_pass_param(unsigned int, bool) override {}
unsigned int execute(function *) final override;
private:
int get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const;
bool is_base_function(const std::string &name) const;
void record_function_info(function *func);
std::string base_function;
int gimple_bb_count_1 = 0;
int gimple_stmt_count_1 = 0;
std::vector<enum gimple_code> gimple_sequence_1;
std::vector<unsigned> gimple_ops_count_1;
std::vector<enum tree_code> gimple_op_types_1;
std::map<int, int> gimple_var_map_1;
};
int pass_prune_clones::get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const
{
std::vector<tree> vars;
switch (gimple_code(stmt))
{
case GIMPLE_COND:
{
tree left_op = gimple_cond_lhs(stmt);
tree right_op = gimple_cond_rhs(stmt);
if (left_op)
vars.push_back(left_op);
if (right_op)
vars.push_back(right_op);
break;
}
case GIMPLE_CALL:
{
tree left_op = gimple_call_lhs(stmt);
if (left_op)
vars.push_back(left_op);
break;
}
case GIMPLE_ASSIGN:
{
tree left_op = gimple_assign_lhs(stmt);
vars.push_back(left_op);
int rhs_ops = gimple_num_ops(stmt) - 1;
for (int i = 1; i <= rhs_ops; ++i)
{
tree op = gimple_op(stmt, i);
vars.push_back(op);
}
break;
}
case GIMPLE_DEBUG_BIND:
{
tree debug_var = gimple_debug_bind_get_var(stmt);
tree debug_value = gimple_debug_bind_get_value(stmt);
if (debug_var)
vars.push_back(debug_var);
if (debug_value)
vars.push_back(debug_value);
break;
}
default:
break;
}
for (tree debug_var : vars)
{
std::map<tree, int>::iterator it = first_occurrence_map.find(debug_var);
if (it != first_occurrence_map.end())
{
out_map[i] = it->second;
}
else
{
first_occurrence_map[debug_var] = i;
out_map[i] = i;
}
++i;
}
return i;
}
bool pass_prune_clones::is_base_function(const std::string &name) const
{
return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos;
}
void pass_prune_clones::record_function_info(function *func)
{
basic_block bb;
int i = 0;
std::map<tree, int> first_occurrence_map;
FOR_EACH_BB_FN(bb, func)
{
gimple_bb_count_1++;
for (gimple_stmt_iterator gsi = gsi_start_bb(bb); !gsi_end_p(gsi); gsi_next(&gsi))
{
gimple_stmt_count_1++;
i = get_gimple_vars(first_occurrence_map, i, gsi_stmt(gsi), gimple_var_map_1);
gimple_sequence_1.push_back(gimple_code(gsi_stmt(gsi)));
gimple_ops_count_1.push_back(gimple_num_ops(gsi_stmt(gsi)));
if (is_gimple_assign(gsi_stmt(gsi)))
{
gimple_op_types_1.push_back(gimple_assign_rhs_code(gsi_stmt(gsi)));
}
}
}
}
unsigned int pass_prune_clones::execute(function *func)
{
if (!dump_file || !func || !func->cfg)
return 0;
struct cgraph_node *node = cgraph_node::get(func->decl);
if (!node)
{
fprintf(dump_file, "ERROR: cgraph_node::get(func->decl)\n");
return 0;
}
const std::string name(node->name());
if (base_function.empty() && name.find(".resolver") != std::string::npos)
{
base_function = name.substr(0, name.find(".resolver"));
fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str());
}
if (!is_base_function(name))
return 0;
// Record info for the first matching function
if (gimple_bb_count_1 == 0 && gimple_stmt_count_1 == 0)
{
record_function_info(func);
}
return 0;
}
} // anonymous namespace
gimple_opt_pass *make_pass_prune_clones(gcc::context *ctxt)
{
return new pass_prune_clones(ctxt);
}
Results
The test cases were generously provided by Professor Chris Tyler.
Setup
- Build
gcc
with the newly added passcd ~/gcc-build-002
-
time make -j$(nproc) |& tee rebuild_log.log
make install
-
cd ~
-
tar xvf /public/spo600-test-clone.tgz
: copy the test cases into current directory -
cd ~/spo600/examples/test-clone
-
vim Makefile
and uncommentDUMP_ALL
-
PATH=”$HOME/gcc-test-001/bin:$PATH"
: set defaultgcc
to the newly built version -
make
: execute test cases
Note: ls
will reveal all the dumpfiles created, find the name of the pass. In my case my pass was named prune-clones
.
Test Results
japablo@x86-001:~/spo600/examples/test-clone$ cat clone-test-x86-prune-clone-test-core.c.228t.prune_clones | grep "DIFF"
;; Function scale_samples.arch_x86_64_v3 (scale_samples.arch_x86_64_v3, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)
[DIFF] BB count: 40 vs 31
japablo@x86-001:~/spo600/examples/test-clone$ cat clone-test-x86-prune-clone-test-core.c.228t.prune_clones
;; Function scale_samples.resolver (scale_samples.resolver, funcdef_no=27, decl_uid=4477, cgraph_uid=32, symbol_order=30)
[BASE] Found: scale_samples
Running the tests shows when a base function has been found and shows the differences in basic block count between the base and variant!
Reflection
This stage of the project was challenging but rewarding. There was a lot of trial and error along the way. My initial approach to identifying function variants—by checking file extensions, which partially worked.
Resources like the GCC documentation, YouTube tutorials, and StackOverflow were invaluable in helping me discover the right functions to traverse and extract information for fingerprinting.
A big thank you to Professor Chris Tyler as well for providing the test cases, which were conveniently automated using a Makefile.
Conclusion
The hardest part is done! Time to tackle the last part: Prune Suggestion.