Ping Yu

I am a Research Scientist at Meta FAIR in Seattle, working in Jason Weston’s research group. I focus on advancing the reasoning and alignment capabilities of large language models (LLMs).

My research interests focus on advancing the reasoning and alignment capabilities of large language models (LLMs). I have explored how to distill System-2 reasoning into LLMs, design alignment strategies such as training models to act as judges and integrating heterogeneous reward signals to improve GRPO, and develop methods for LLM self-improvement through iterative refinement. I am also interested in data-centric approaches, including generating and curating high-quality synthetic data to enhance pre-training and fine-tuning.

My current research focuses on long-horizon credit assignment for LLM-based agents and RL-trained models, particularly designing more informative and structured reward signals for agentic tasks with tool use and multi-step planning.

📧 Contact: ping.yu.nlp@gmail.com (Updated)

Publications

Improving LLM Reasoning Ability

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense
Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason Weston, Ping Yu.
2025 [PDF]

Current research on reasoning tasks mainly focuses on verifiable rewards. We studied whether using verifiable answers for GRPO can help with hard-to-verify reasoning tasks, and explored whether including hard-to-verify training data is necessary. We proposed combining reward model signals with rule-based signals for model training, which allows the reward signal to account for intermediate reasoning steps while avoiding the issue of reward hacking that arises when relying solely on reward models.

Distilling System 2 into System 1
Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov.
2024 [PDF]

Investigate self-supervised methods to ‘compile” (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1.

Exploring Data Quality Improvement & Synthetic Data Generation

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu.
2025 [PDF]

Proposed CoT-Self-Instruct, a synthetic data generation method that uses Chain-of-Thought reasoning and automatic filtering to create high-quality training data. Achieves state-of-the-art results on verifiable reasoning benchmarks (MATH500, AMC23, AIME24, GPQA-Diamond) and surpasses human and standard Self-Instruct data on AlpacaEval 2.0 and Arena-Hard.

R.I.P.: Better Models by Survival of the Fittest Prompts
Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason Weston, Jing Xu.
ICML 2025 [PDF]

Proposed Rejecting Instruction Preferences (RIP), a data filtering method that evaluates training prompts via rejected response quality and reward gap between chosen/rejected outputs. Achieved large gains in model performance: +9.4% AlpacaEval2, +8.7% Arena-Hard, +9.9% WildBench (Llama-3.1-8B); and boosted Llama-3.3-70B Arena-Hard accuracy from 67.5→82.9 (18th→6th place).

Self-Alignment with Instruction Backtranslation
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason Weston, Mike Lewis.
ICLR 2024 (Oral) [PDF]

Developed instruction backtranslation, a scalable method to train instruction-following LLMs by automatically generating and curating instruction–response pairs from web text.

Self-Improvement

RESTRAIN: From Spurious Votes to Signals – Self-Driven RL with Self-Penalization
Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu*, Jing Xu*.
2025 [PDF] (* Equal contribution)

Introduced RESTRAIN, a self-penalizing RL framework that transforms unlabeled data into training signals by penalizing overconfident or low-confidence rollouts while preserving promising reasoning chains. RESTRAIN integrates with policy optimization (e.g., GRPO) and achieves large gains on challenging reasoning benchmarks, narrowing the gap with supervised training.

Shepherd: A Critic for Language Model Generation
Tianlu Wang*, Ping Yu*, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz.
2023 [PDF] (* Equal contribution)

Introduced Shepherd, a 7B-parameter LLM tuned to critique and refine model outputs using a curated feedback dataset. Shepherd’s critiques match or surpass those of larger models, achieving 53–87% win rates against competitive alternatives and rivaling ChatGPT in human evaluation.

LLM as Judge

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha.
2025 [PDF]

Introduced J1, a reinforcement learning approach for training LLM-as-a-Judge with verifiable rewards that incentivize reasoning and reduce bias. J1 outperforms all existing 8B/70B models (including DeepSeek-R1 distillations), surpasses o1-mini, and even beats R1 on some benchmarks, demonstrating stronger judgment ability through improved chain-of-thought reasoning.

Self-Taught Evaluators
Tianlu Wang, Ilia Kulikov*, Olga Golovneva*, Ping Yu*, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li.
2024 [PDF] [Model] (* Equal contribution)

Proposed Self-Taught Evaluator, a synthetic-data approach for training LLM-as-a-Judge without human labels. Through iterative self-improvement with reasoning traces and judgments, improved LLaMA3-70B-Instruct from 75.4 → 88.3 on RewardBench, surpassing GPT-4 and matching top reward models trained with human preferences.

Others

Chameleon: Mixed-modal early-fusion foundation models
Chameleon team.
2024 [PDF] [GitHub]

Developed Chameleon, a family of early-fusion token-based multimodal models for unified image–text understanding and generation. Chameleon achieves state-of-the-art results in image captioning, surpasses LLaMA-2 on text tasks, competes with Mixtral 8x7B and Gemini-Pro, and matches/exceeds much larger models like GPT-4V in human evaluations of long-form mixed-modal generation.

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
OPT team.
2022 [PDF] [Model-1.3B] [Model-30B]

Developed OPT-IML Bench, a large benchmark of 2,000 NLP tasks for studying instruction-tuning decisions (task diversity, sampling, demonstrations, objectives). Using this framework, trained OPT-IML 30B and 175B, showing improved generalization across unseen categories, tasks, and instances, and significantly outperforming OPT on multiple benchmarks.