The evolution of Large Language Models (LLMs) has ushered in a transformative era in scientific discovery, positioning them as powerful tools for streamlining research, from idea generation to verification and publication writing. This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics.
Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark's difficulty.
Figure 1: The overview of the SciReplicate-Bench.
Task Definition
The SciReplicate-Bench focuses on repository-level code generation, where each task is centered around implementing a specific function or class method. For code generation, the following components are provided as inputs to LLMs:
Metrics
Figure 2: The overview of the Sci-Reproducer.
Table 1: The pre-defined actions for the Paper Agent and the Code Agent.
To address this task, we introduce Sci-Reproducer, a dual-agent framework designed for scientific paper methodology replication. As illustrated in Figure 2, Sci-Reproducer comprises a Paper Agent and a Code Agent that collaboratively work to replicate algorithms described in a given paper. Both agents follow the ReAct strategy, and their predefined actions are outlined in Table 1.
Paper Agent
The Paper Agent incrementally builds an understanding of the target algorithm by executing predefined actions to query the literature context. After concluding that all necessary information has been collected, it generates a comprehensive report comprising key findings that fill in the missing components of the target algorithm's LaTeX source.
Code Agent
The Code Agent integrates the target algorithm's LaTeX code with the Paper Agent's report to comprehensively understand the algorithm. It leverages actions to search the code repository for necessary dependencies that aid implementation. The agent can also browse websites for additional information and use a compiler to test and iteratively debug the code, ensuring proper execution by identifying and fixing syntax errors.
Video: Demo of the Sci-Reproducer framework.
Table 2: Performance evaluation on the SciReplicate-Bench. "Exe Acc" represents execution accuracy while "RG Acc" indicates reasoning graph accuracy.
Table 2 displays Sci-Reproducer’s evaluation results and contributions of code/paper agent. The results offer the following key insights:
Table 3: A grouped bar chart illustrating the frequency of tool usage by different models. The x-axis represents various actions, while the y-axis indicates the total number of times each tool was used on this dataset.
Reasnoning LLMs tend to overthinking which leads to limited improvement: As shown in Table 3, reasoning LLMs tend to rely more on internal deliberation rather than retrieving external information. Regarding code-related actions, reasoning LLMs use “SearchFile”, “SearchCodeItem”, and “SearchWeb” an average of 25.0, 3.3, and 0.0 times, respectively. In comparison, non-reasoning LLMs use these actions significantly more often, with averages of 210.4, 68.2, and 16.8 times, respectively. Such over-reliance on internal reasoning hurts their overall performance: the execution accuracy of models such as o3-mini-high and o3-mini-low is comparable to that of gpt-4o-mini, despite their theoretical advantages.
Table 4: Some examples for different missing information categories.
Please check our paper and dataset for details. If you find our work helpful, please consider citing it using the following:
@article{xiang2025scireplicate,
title={SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers},
author={Xiang, Yanzheng and Yan, Hanqi and Ouyang, Shuyin and Gui, Lin and He, Yulan},
journal={arXiv preprint arXiv:2504.00255},
year={2025}
}
For questions and comments, please contact Yanzheng Xiang at: yanzheng.xiang[AT]kcl.ac.uk