Untitled

# Presentation Speech Script

## Slide 1: Openning

“Hello everyone, my name is Hung Quoc To, and I am here to present our work on ‘Functional Overlap Reranking for Neural Code Generation.’

This research is co-authored with Minh Huynh Nguyen and Nghi D. Q. Bui

The authors of this research are from FPT Software AI Center, Vietnam. The paper will be proceeding at Findings of the Association for Computational Linguistics, ACL 2024.”

## Slide 2: Presentation Outline

“Our presentation will follow this outline:

1. Introduction
2. Related Work
3. Methodology
4. Experimental Setup
5. Experimental Results
6. Analysis and Ablation Study
7. Conclusion

Let’s start with the introduction.”

## Slide 4: Introduction

"Firstly, I'll talk about the advancements in Code Large Language Models, or CodeLLMs, which have significantly boosted code generation capabilities. These models, like Codex and WizardCoder, can generate code from natural language descriptions.

However, likelihood-based decoding methods like greedy search often lead to errors, making it necessary to rerank the generated solutions.

Despite these advancements, selecting the best outputs remains challenging due to their variability.

Previous approaches missed out on capturing the functional similarities and interactions between solution clusters.

To address this, we introduce SRank, which focuses on modeling inter-cluster relationships using a metric called functional overlap.

This novel reranking strategy not only improves performance but also shows robustness even with a limited number of solutions."

## Slide 5: Introduction

"Now, let's dive deeper into how SRank quantifies functional overlap to improve solution ranking. On the left, we have a concrete example.

Given a coding problem, the model is required to generate code solutions to address. And we are provided with a pre-defined set of 4 test inputs [0, 1, 2, 3].
Here, CodeLLMs sampled 3 clusters of code solutions. You can think of a cluster is a group of solutions that produced the same execution outputs when executed on the pre-defined test input set.
Each cluster has each own distinct set of execution outputs.

Here, three clusters are tested with four inputs. Cluster 1 produces outputs [10, 20, 30, 40], Cluster 2 outputs [11, 20, 30, 40], and Cluster 3 outputs [10, 20, 30, 41]. Clusters overlap if they produce the same outputs for the same inputs. For instance, Cluster 1 overlaps with Cluster 2 on three values, indicating a 3/4 overlap, and similarly with Cluster 3.

The functional overlapping score is calculated by summing these overlaps. Cluster 1's total score is 6, making it the most representative cluster.

On the right, the Python pseudocode illustrates process of calculating functional overlap score:
- It defines the execution outputs for each cluster.
- Then, the functional overlap is computed as the sum of output match between 2 clusters.
- We then calculate the functional overlap score of each cluster as sum of its functional overlaps with every other clusters.
- Finally, the most representative cluster is identified.

By focusing on functional overlap, SRank significantly enhances the ranking of code solutions, improving accuracy and reliability."

## Slide 7: Related Work

"Let's look at the related work in reranking code generation. There are several approaches:

1. **Empirical Methods**: These use the mean log probability of tokens to rank solutions.
2. **Coder-Reviewer**: This method employs mutual information-based reranking, aligning solutions with natural language instructions.
3. **Execution-Based Methods**: Approaches like MBR-exec, AlphaCode, LEVER, and CodeT. Some of them clusters solutions based on their execution outputs. AlphaCode focuses on identical outputs from test inputs, while CodeT clusters solutions that pass model-generated test cases.
4. **Our Approach**: Unlike the others, our method is execution-based but doesn't require model training or fine-tuning. It complements methods like LEVER by focusing on functional overlap and inter-cluster relationships, providing a more robust reranking strategy."

## Slide 4: Experimental Setup

"Now, let's go over our experimental setup.

**Models**: We evaluated on Codex, WizardCoder, StarCoder, and CodeGen, covering sizes from 6B to 34B parameters.

**Metrics**: We used pass@1 to assess the functional correctness of code solutions.

**Baselines**: We compared SRank with methods like Coder-Reviewer and CodeT.

**Benchmarks**: We used HumanEval, MBPP-S, and APPS benchmarks, ensuring no real test cases were exposed.

**Hyper-parameters**: For each problem, we sampled 100 code solutions and 100 test sequences, with specific settings like temperature 0.8, top \(p\) 0.95, and a 5-second timeout for executions."
"

## Slide 17: Experimental Results

"Let's review our experimental results.

The tables on the right show detailed results for HumanEval, MBPP-S, and APPS benchmarks. SRank's performance improvements are evident across various models, including WizardCoder and Codex.

Overall, these results highlight SRank's effectiveness and robustness in enhancing code solution rankings."

**Performance**: SRank consistently outperforms other techniques across three benchmarks. On the HumanEval benchmark, SRank achieves average improvements of 3.63% over CodeT and 8.81% over Coder-Reviewer in pass@1 scores.

**Robustness**: On the APPS benchmark, SRank significantly outperforms all other baselines by a considerable margin. For example, using Codex002, SRank achieved a pass@1 score of 37.79% in the Introduction category, compared to CodeT's 34.60%, Greedy's 27.20%, and Random sampling's 20.35%. These results demonstrate that SRank is robust and scales well with different difficulty levels.

## Slide 19: Analysis

**Transition from Previous Slide**:
"Having seen the strong performance of SRank, let's delve deeper into the analysis to understand why it works so well."

**Assumption**:
"We started with the assumption that incorrect solutions are diverse, leading to a low probability of functional agreement among them."

**Chart Analysis and Assumption Validation**:
"Now, let's analyze our results with the chart on the left. The x-axis represents functional overlap, which measures how much two clusters' outputs match. The y-axis shows the probability of pairs of incorrect solutions falling within different ranges of functional overlap. Each line on the chart represents a different model.

As functional overlap increases, the probability of finding pairs of incorrect solutions within that range generally decreases. Initially, there’s a steep decline, and it levels out towards the right.
This trend indicates that incorrect solutions are diverse in their functionality. When solutions are incorrect, they are less likely to produce similar outputs, resulting in low functional overlap.

These findings align with our assumptions and prove that incorrect solutions are functionally diverse, leading to a low probability of agreement among them."

## Slide 20: Ablation Study

"Now, let's examine our ablation study on scaling the number of generated test cases, shown in the figure on the right.

Each graph represents a different model, with the x-axis showing the number of generated test cases ranging from 0 to 50, and the y-axis showing the pass@1 performance. The solid lines represent our method with interaction, and the dashed lines represent methods without interaction.

As we can see, increasing the number of test cases generally improves performance.

The solid lines are consistently above the dashed lines, indicating that incorporating functional overlap enhances performance over just using cluster features.

The gap between reranking with and without interaction increases as the number of test cases grows, highlighting our method's scalability.

In our main experimental setting, we sample 100 sequences of test cases independently from CodeLLMs, with each sequence containing multiple test cases. This setup shows that even with a limited number of test cases, our method remains robust and effective.

Empirically, we observe that generating at least 30 test cases is recommended to fully benefit from cluster interaction.

**Comparison with Baselines**: SRank consistently outperforms other methods, including CodeT across ranges of number of generated test cases.

## Slide 21: Ablation Study

"Continuing from our previous slide on the number of generated test cases, let's now look at the scaling number of sampled solutions.

The graph in the right hand side is similar to the privious graph except that the x-axis presenting the number of sampled code solutions ranging from 1 to 50.

We draw the same observations as with scaling number of generated test cases:

- As seen before, increasing the number of sampled solutions generally improves performance.

- The solid lines consistently outperform the dashed lines, indicating that incorporating functional overlap enhances performance over just using cluster features. This trend highlights our method's scalability and robustness, even with fewer sampled solutions.

- SRank consistently outperforms other methods, including CodeT across ranges of number of sampled code solutions.

## Slide 23: Conclusion

**Transition from Previous Slide**:
"To wrap up our presentation, let's summarize the key points and conclusions from our study."

**Novel Reranking Method**:
"We introduced SRank, a novel reranking strategy designed to extract optimal code generation solutions from CodeLLMs. SRank focuses on modeling inter-cluster relationships and leverages functional overlap to identify clusters with the highest functional interaction, which often indicates correct functionality.

**Extensive Evaluations**:
"Our extensive evaluations demonstrated that SRank consistently outperforms existing methods like CodeT and Coder-Reviewer across various well-known CodeLLMs. We showcased state-of-the-art performance on pass@1."

**Robustness and Effectiveness**:
"Moreover, our thorough analysis validated the robustness and effectiveness of SRank in realistic scenarios, even with a limited number of solutions and test cases. These findings are crucial for addressing the challenges of code generation in real-world applications, providing strategies for selecting superior solutions within constrained coding environments."

**Closing**:
"In conclusion, SRank not only sets a new benchmark in code generation reranking but also provides a practical and scalable solution for real-world applications."

## Slide 24: Thanks!

Thank you for your attention.

If you have any questions or would like to discuss further, feel free to reach out to me at tqh262@gmail.com. You can also scan the QR codes on the right to access the full paper and our GitHub repository.

Once again, thank you for your time, and I look forward to your questions."

Editor is loading...