
Cross-Model Automated Program Repair Benchmark
Evaluating 7 state-of-the-art LLMs on a new benchmark that measures how well models fix human-written bugs, LLM-generated bugs, and code containing both, revealing self-repair bias and the limits of static APR benchmarks contaminated by training data.

