SPO600 Project Stage 2, Part 2: "Where Art Thou, Clones?"

Introduction I’ll keep the introduction short - welcome back! The code will be broken down into parts and explained as we go along. Let’s find those function variants! Identifying the Function Variants Identifying the clones require a reference point - the base function serves as the “source of truth” with its respective basic blocks, statements, control flow, etc., and becomes the template to compare against when analyzing the variants. Identifying the Base Function // ~/gccsrc/gcc/gcc/tree-prune-clones.cc ... std::string base_function; ... unsigned int pass_prune_clones::execute(function *func) { ... const std::string name(node->name()); if (base_function.empty() && name.find(".resolver") != std::string::npos) { base_function = name.substr(0, name.find(".resolver")); fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str()); } } ... bool pass_prune_clones::is_base_function(const std::string &name) const { return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos; } The first thing to do is to get the name of the function when entering the call graph. Note that the name of function can have any of the variant extensions (e.g. .constprop.0, .ssa, etc.). .resolver is the source function because ifuncs get resolved during runtime - allow multiple implementations during runtime. pass_prune_clones::execute extracts the function name from .resolver. pass_prune_clones::is_base_function checks if the current node is function variant by comparing it the base function name Fingerprinting Functions gcc documentation provides a list of accessors that simplifies the extraction inGIMPLE. The combo of get_gimple_vars and record_function_info is being used to record a structured fingerprint of a function — not based on specific variable names, but on how it behaves structurally (e.g., statement types, number of operands, variable usage patterns). get_gimple_vars(...) processes a single GIMPLE statement by: Extract Variables For GIMPLE_COND, GIMPLE_ASSIGN ,GIMPLE_CALL, and GIMPLE_DEBUG_BIND, it collects involved tree variables (like lhs/rhs, debug values) into a std::vector vars — rather than identifying variables by their names, it tracks their first occurrence and usage position for structural comparisons across different functions. Map First Occurrence For each variable in vars: - If this is the **first time** it appears, it's assigned a new ID `i`. - If it's been seen before, it reuses the same ID. The output is stored in: first_occurrence_map: mapping tree → ID out_map: mapping ID of this statement → ID of first occurrence This abstract variable names away and focusses on their role — a = b + c and x = y + z would look the same if variable roles match. // ~/gccsrc/gcc/gcc/tree-prune-clones.cc ... int pass_prune_clones::get_gimple_vars(std::map &first_occurrence_map, int i, gimple *stmt, std::map &out_map) const { std::vector vars; switch (gimple_code(stmt)) { case GIMPLE_COND: { tree left_op = gimple_cond_lhs(stmt); tree right_op = gimple_cond_rhs(stmt); if (left_op) vars.push_back(left_op); if (right_op) vars.push_back(right_op); break; } case GIMPLE_CALL: { tree left_op = gimple_call_lhs(stmt); if (left_op) vars.push_back(left_op); break; } case GIMPLE_ASSIGN: { tree left_op = gimple_assign_lhs(stmt); vars.push_back(left_op); int rhs_ops = gimple_num_ops(stmt) - 1; for (int i = 1; i second; } else { first_occurrence_map[debug_var] = i; out_map[i] = i; } ++i; } return i; } Putting it Together Finally, there’s a way to identify the base function and to fingerprint each function for comparison! Let’s put the code all together! // ~/gccsrc/gcc/gcc/tree-prune-clones.cc #include #include #include #include #include "config.h" #include "system.h" #include "coretypes.h" #include "backend.h" #include "tree.h" #include "gimple.h" #include "tree-pass.h" #include "ssa.h" #include "tree-pretty-print.h" #include "gimple-iterator.h" #include "gimple-walk.h" #include "internal-fn.h" #include "gimple-pretty-print.h" #include "cgraph.h" #include "gimple-ssa.h" #include "attribs.h" #include "pretty-print.h" #include "tree-inline.h" #include "intl.h" #include "function.h" #include "basic-block.h" namespace { const pass_data prune_clones_pass_data = { GIMPLE_PASS, /* type */ "prune_clones", /* name */ OPTGROUP_NONE, /* optinfo_flags */ TV_NONE, /* tv_id */ PROP_cfg, 0, 0, 0, 0, }; class pass_prune_clones : public gimple_opt_pass { public: pass_prune_clones(gcc::context *ctxt) : gi

Apr 6, 2025 - 04:29
 0
SPO600 Project Stage 2, Part 2: "Where Art Thou, Clones?"

Introduction

I’ll keep the introduction short - welcome back! The code will be broken down into parts and explained as we go along. Let’s find those function variants!

Identifying the Function Variants

Identifying the clones require a reference point - the base function serves as the “source of truth” with its respective basic blocks, statements, control flow, etc., and becomes the template to compare against when analyzing the variants.

Identifying the Base Function

// ~/gccsrc/gcc/gcc/tree-prune-clones.cc
...

std::string base_function;

...

unsigned int pass_prune_clones::execute(function *func)
{
    ...
    const std::string name(node->name());
  if (base_function.empty() && name.find(".resolver") != std::string::npos)
  {
      base_function = name.substr(0, name.find(".resolver"));
      fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str());
  }
}

...

bool pass_prune_clones::is_base_function(const std::string &name) const
{
    return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos;
}

The first thing to do is to get the name of the function when entering the call graph. Note that the name of function can have any of the variant extensions (e.g. .constprop.0, .ssa, etc.). .resolver is the source function because ifuncs get resolved during runtime - allow multiple implementations during runtime.

  1. pass_prune_clones::execute extracts the function name from .resolver.
  2. pass_prune_clones::is_base_function checks if the current node is function variant by comparing it the base function name

Fingerprinting Functions

gcc documentation provides a list of accessors that simplifies the extraction inGIMPLE.

The combo of get_gimple_vars and record_function_info is being used to record a structured fingerprint of a function — not based on specific variable names, but on how it behaves structurally (e.g., statement types, number of operands, variable usage patterns).

get_gimple_vars(...) processes a single GIMPLE statement by:

  1. Extract Variables

    For GIMPLE_COND, GIMPLE_ASSIGN ,GIMPLE_CALL, and GIMPLE_DEBUG_BIND, it collects involved tree variables (like lhs/rhs, debug values) into a std::vector vars — rather than identifying variables by their names, it tracks their first occurrence and usage position for structural comparisons across different functions.

  2. Map First Occurrence

    For each variable in vars:

- If this is the **first time** it appears, it's assigned a new ID `i`.
- If it's been seen before, it reuses the same ID.

The output is stored in:

  • first_occurrence_map: mapping tree → ID
  • out_map: mapping ID of this statement → ID of first occurrence

This abstract variable names away and focusses on their role — a = b + c and x = y + z would look the same if variable roles match.

// ~/gccsrc/gcc/gcc/tree-prune-clones.cc

...

int pass_prune_clones::get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const
{
    std::vector<tree> vars;

    switch (gimple_code(stmt))
    {

    case GIMPLE_COND:
    {
        tree left_op = gimple_cond_lhs(stmt);
        tree right_op = gimple_cond_rhs(stmt);
        if (left_op)
            vars.push_back(left_op);
        if (right_op)
            vars.push_back(right_op);
        break;
    }

    case GIMPLE_CALL:
    {
        tree left_op = gimple_call_lhs(stmt);
        if (left_op)
            vars.push_back(left_op);
        break;
    }

    case GIMPLE_ASSIGN:
    {
        tree left_op = gimple_assign_lhs(stmt);
        vars.push_back(left_op);

        int rhs_ops = gimple_num_ops(stmt) - 1;
        for (int i = 1; i <= rhs_ops; ++i)
        {
            tree op = gimple_op(stmt, i);
            vars.push_back(op);
        }
        break;
    }

    case GIMPLE_DEBUG_BIND:
    {
        tree debug_var = gimple_debug_bind_get_var(stmt);
        tree debug_value = gimple_debug_bind_get_value(stmt);
        if (debug_var)
            vars.push_back(debug_var);
        if (debug_value)
            vars.push_back(debug_value);
        break;
    }

    default:
        break;
    }

    for (tree debug_var : vars)
    {
        std::map<tree, int>::iterator it = first_occurrence_map.find(debug_var);

        if (it != first_occurrence_map.end())
        {
            out_map[i] = it->second;
        }
        else
        {
            first_occurrence_map[debug_var] = i;
            out_map[i] = i;
        }

        ++i;
    }

    return i;
}

Putting it Together

Finally, there’s a way to identify the base function and to fingerprint each function for comparison! Let’s put the code all together!

// ~/gccsrc/gcc/gcc/tree-prune-clones.cc
#include 
#include 
#include 
#include 

#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
#include "cgraph.h"
#include "gimple-ssa.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include "function.h"
#include "basic-block.h"

namespace
{
    const pass_data prune_clones_pass_data = {
        GIMPLE_PASS,    /* type */
        "prune_clones", /* name */
        OPTGROUP_NONE,  /* optinfo_flags */
        TV_NONE,        /* tv_id */
        PROP_cfg,
        0,
        0,
        0,
        0,
    };

    class pass_prune_clones : public gimple_opt_pass
    {
    public:
        pass_prune_clones(gcc::context *ctxt)
            : gimple_opt_pass(prune_clones_pass_data, ctxt) {}

        bool gate(function *) final override { return true; }
        void set_pass_param(unsigned int, bool) override {}
        unsigned int execute(function *) final override;

    private:
        int get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const;
        bool is_base_function(const std::string &name) const;
        void record_function_info(function *func);

        std::string base_function;
        int gimple_bb_count_1 = 0;
        int gimple_stmt_count_1 = 0;
        std::vector<enum gimple_code> gimple_sequence_1;
        std::vector<unsigned> gimple_ops_count_1;
        std::vector<enum tree_code> gimple_op_types_1;
        std::map<int, int> gimple_var_map_1;
    };

    int pass_prune_clones::get_gimple_vars(std::map<tree, int> &first_occurrence_map, int i, gimple *stmt, std::map<int, int> &out_map) const
    {
        std::vector<tree> vars;

        switch (gimple_code(stmt))
        {

        case GIMPLE_COND:
        {
            tree left_op = gimple_cond_lhs(stmt);
            tree right_op = gimple_cond_rhs(stmt);
            if (left_op)
                vars.push_back(left_op);
            if (right_op)
                vars.push_back(right_op);
            break;
        }

        case GIMPLE_CALL:
        {
            tree left_op = gimple_call_lhs(stmt);
            if (left_op)
                vars.push_back(left_op);
            break;
        }

        case GIMPLE_ASSIGN:
        {
            tree left_op = gimple_assign_lhs(stmt);
            vars.push_back(left_op);

            int rhs_ops = gimple_num_ops(stmt) - 1;
            for (int i = 1; i <= rhs_ops; ++i)
            {
                tree op = gimple_op(stmt, i);
                vars.push_back(op);
            }
            break;
        }

        case GIMPLE_DEBUG_BIND:
        {
            tree debug_var = gimple_debug_bind_get_var(stmt);
            tree debug_value = gimple_debug_bind_get_value(stmt);
            if (debug_var)
                vars.push_back(debug_var);
            if (debug_value)
                vars.push_back(debug_value);
            break;
        }

        default:
            break;
        }

        for (tree debug_var : vars)
        {
            std::map<tree, int>::iterator it = first_occurrence_map.find(debug_var);

            if (it != first_occurrence_map.end())
            {
                out_map[i] = it->second;
            }
            else
            {
                first_occurrence_map[debug_var] = i;
                out_map[i] = i;
            }

            ++i;
        }

        return i;
    }

    bool pass_prune_clones::is_base_function(const std::string &name) const
    {
        return name.find(base_function) == 0 && name.find(".resolver") == std::string::npos;
    }

    void pass_prune_clones::record_function_info(function *func)
    {
        basic_block bb;
        int i = 0;
        std::map<tree, int> first_occurrence_map;

        FOR_EACH_BB_FN(bb, func)
        {
            gimple_bb_count_1++;
            for (gimple_stmt_iterator gsi = gsi_start_bb(bb); !gsi_end_p(gsi); gsi_next(&gsi))
            {
                gimple_stmt_count_1++;
                i = get_gimple_vars(first_occurrence_map, i, gsi_stmt(gsi), gimple_var_map_1);
                gimple_sequence_1.push_back(gimple_code(gsi_stmt(gsi)));
                gimple_ops_count_1.push_back(gimple_num_ops(gsi_stmt(gsi)));
                if (is_gimple_assign(gsi_stmt(gsi)))
                {
                    gimple_op_types_1.push_back(gimple_assign_rhs_code(gsi_stmt(gsi)));
                }
            }
        }
    }

    unsigned int pass_prune_clones::execute(function *func)
    {
        if (!dump_file || !func || !func->cfg)
            return 0;

        struct cgraph_node *node = cgraph_node::get(func->decl);
        if (!node)
        {
            fprintf(dump_file, "ERROR: cgraph_node::get(func->decl)\n");
            return 0;
        }

        const std::string name(node->name());
        if (base_function.empty() && name.find(".resolver") != std::string::npos)
        {
            base_function = name.substr(0, name.find(".resolver"));
            fprintf(dump_file, "[BASE] Found: %s\n", base_function.c_str());
        }

        if (!is_base_function(name))
            return 0;

        // Record info for the first matching function
        if (gimple_bb_count_1 == 0 && gimple_stmt_count_1 == 0)
        {
            record_function_info(func);
        }

        return 0;
    }

} // anonymous namespace

gimple_opt_pass *make_pass_prune_clones(gcc::context *ctxt)
{
    return new pass_prune_clones(ctxt);
}

Results

The test cases were generously provided by Professor Chris Tyler.

Setup

  1. Build gcc with the newly added pass
    1. cd ~/gcc-build-002
    2. time make -j$(nproc) |& tee rebuild_log.log
    3. make install
  2. cd ~
  3. tar xvf /public/spo600-test-clone.tgz: copy the test cases into current directory
  4. cd ~/spo600/examples/test-clone
  5. vim Makefile and uncomment DUMP_ALL
  6. PATH=”$HOME/gcc-test-001/bin:$PATH": set default gcc to the newly built version
  7. make: execute test cases

Note: ls will reveal all the dumpfiles created, find the name of the pass. In my case my pass was named prune-clones.

Test Results

japablo@x86-001:~/spo600/examples/test-clone$ cat clone-test-x86-prune-clone-test-core.c.228t.prune_clones | grep "DIFF"
;; Function scale_samples.arch_x86_64_v3 (scale_samples.arch_x86_64_v3, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)
[DIFF] BB count: 40 vs 31

japablo@x86-001:~/spo600/examples/test-clone$ cat clone-test-x86-prune-clone-test-core.c.228t.prune_clones
;; Function scale_samples.resolver (scale_samples.resolver, funcdef_no=27, decl_uid=4477, cgraph_uid=32, symbol_order=30)
[BASE] Found: scale_samples

Running the tests shows when a base function has been found and shows the differences in basic block count between the base and variant!

Reflection

This stage of the project was challenging but rewarding. There was a lot of trial and error along the way. My initial approach to identifying function variants—by checking file extensions, which partially worked.

Resources like the GCC documentation, YouTube tutorials, and StackOverflow were invaluable in helping me discover the right functions to traverse and extract information for fingerprinting.

A big thank you to Professor Chris Tyler as well for providing the test cases, which were conveniently automated using a Makefile.

Conclusion

The hardest part is done! Time to tackle the last part: Prune Suggestion.