I was testing out the SketchBLEU implementation via CodeBLEU (from validation/evaluation_scripts/codebleu), and while I was able to get it running after a couple of modifications, I ran into an issue where the syntax match score is extremely low, even when evaluating two identical repos.
Steps to Reproduce
-
Clone the repo and set up the environment (conda environment details provided below).
-
Apply the following fixes to make CodeBLEU run without errors:
-
In CodeS/validation/evaluation_scripts/codebleu/codebleu/syntax_match.py, update the to_str function in the FileOrNode class:
field_names.append(cursor.current_field_name) # instead of cursor.field_name
-
In CodeS/validation/evaluation_scripts/codebleu/codebleu/__main__.py, update the main function to properly handle repo_bleu runs:
def main(
ref_files: List[str],
hyp_file: str,
lang: str,
weights: Tuple[float, float, float, float] = (0.25, 0.25, 0.25, 0.25),
repo_bleu: bool = False,
) -> None:
if repo_bleu:
repo_bleu_score = calc_repobleu(
[Path(ref_file) for ref_file in ref_files],
[Path(hyp_file)],
lang,
weights=weights,
)
print("Repo-level CodeBLEU score: ", repo_bleu_score)
else:
code_bleu_score = calc_codebleu(
references,
hypothesis,
lang,
weights=weights,
)
...
-
Create two dummy repos with identical files:
tmp.py
# This script reads from 'input.txt' and writes its content to 'output.txt'
with open('input.txt', 'r') as infile:
data = infile.read()
with open('output.txt', 'w') as outfile:
outfile.write(data)
codebleu.py
(copied directly from this repo’s codebleu.py)
-
Run the command:
python -m codebleu --refs "../repo_1" --hyp "../repo_2" --lang python --repo
Expected Behavior
Since both repos are identical, I expected all scores (including syntax match) to be 1.0 (or very close to it).
Actual Behavior
The output was:
Repo-level CodeBLEU score: {
'codebleu': np.float64(0.7503026634382567),
'ngram_match_score': 1.0,
'weighted_ngram_match_score': 1.0,
'syntax_match_score': 0.0012106537530266344,
'dataflow_match_score': np.float64(1.0)
}
The syntax match score is ~0.001, even though the repos are identical.
Environment
Conda environment (relevant libraries):
python 3.11.13
numpy 2.3.3
tree-sitter 0.20.1
types-tree-sitter 0.20.1.20240311
codebleu 0.4.0
Notes
- This might be related to the
tree-sitter API changes (hence the fix needed in syntax_match.py).
- Possibly the repo-level aggregation is not correctly handling syntax trees across files.
Could you clarify whether this is expected behavior or if the syntax match computation is incorrect at the repo level?
I was testing out the
SketchBLEUimplementation via CodeBLEU (fromvalidation/evaluation_scripts/codebleu), and while I was able to get it running after a couple of modifications, I ran into an issue where the syntax match score is extremely low, even when evaluating two identical repos.Steps to Reproduce
Clone the repo and set up the environment (conda environment details provided below).
Apply the following fixes to make CodeBLEU run without errors:
In
CodeS/validation/evaluation_scripts/codebleu/codebleu/syntax_match.py, update theto_strfunction in theFileOrNodeclass:In
CodeS/validation/evaluation_scripts/codebleu/codebleu/__main__.py, update themainfunction to properly handlerepo_bleuruns:Create two dummy repos with identical files:
tmp.pycodebleu.py(copied directly from this repo’s
codebleu.py)Run the command:
Expected Behavior
Since both repos are identical, I expected all scores (including syntax match) to be 1.0 (or very close to it).
Actual Behavior
The output was:
The syntax match score is ~0.001, even though the repos are identical.
Environment
Conda environment (relevant libraries):
Notes
tree-sitterAPI changes (hence the fix needed insyntax_match.py).Could you clarify whether this is expected behavior or if the syntax match computation is incorrect at the repo level?