Some Random Stuff 👋
I am Mingyuan Wu, a part-time researcher at Meta. I am currently a final-year PhD candidate in Computer Science at University of Illinois, Urbana Champaign. I am fortunate to be advised by Prof. Klara Nahrstedt, and and to work with Prof. Minjia Zhang and Prof. Chengxiang Zhai. Before starting my PhD, I spent wonderful time at UIUC and Shanghai Jiao Tong University.
I work on vision language model agents and multimodal reasoning. In 2025, I have been cooking VLMs and LLMs with recipes of multi-turn reinforcement learning fine-tuning and inference-time scaling, to boost self-improvement, reasoning, and memory use. I also investigate VLM interpretability through circuit tracing.
My research goal is to build human-level agents that partner with people: humans set the intent with prompts; agents reason, execute and suggest new possibilities (via recommendations).
When I’m not teaching large models, I prototype augmented-reality systems for fun, hoping that, one day, agentic models will power these interactive, human-centered interfaces.
Industry Experience
-
MRS AI, Meta
-
Video Rec, Meta
-
LLM Research Team, Capital Today
-
Xin's Group, Adobe
-
IOTG, Intel
Research
-
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
-
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
-
Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Reasoning
-
Spatio-Temporal LLM: Reasoning about Environments and Actions
-
RecoWorld: Building Simulated Environments for Agentic Recommender Systems
-
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
-
TraceNet: Segment One Thing Efficiently
-
Anywhere Avatar: 3D Telepresence with Just a Phone and a Laptop
-
Scene Graph Driven Hybrid Interactive VR Teleconferencing
-
miVirtualSeat: A Next Generation Hybrid Telepresence System
-
Seaware: Semantic-aware View Prediction System for 360-degree Video Streaming
-
AquaVLM: Improving Underwater Situation Awareness with Mobile Vision Language Models
-
ImmerScope: Multi-view Video Aggregation at Edge towards Immersive Content Services
-
GaugeTracker: AI-Powered Cost-Effective Analog Gauge Monitoring System
-
Vesper: Learning to Manage Uncertainty in Video Streaming
-
I-Matting: Improved Trimap-Free Image Matting
-
Interactive Scene Graph Analysis for Future Intelligent Teleconferencing Systems
-
360TripleView: 360-Degree Video View Management System Driven by Convergence Value of Viewing Preferences
-
SAVG360: Saliency-aware Viewport-guidance-enabled 360-video Streaming System
-
Video 360 Content Navigation for Mobile HMD Devices
-
AnyLoc: Energy-efficient Visual Localization in Dynamic and Large-Scale Scenes without Pain
Events
- Date — Invited Talk at TBD Institute.
Acknowledgements
This simple website is built from Jiayi Pan's website and Codex.