Jiashu Xu 徐家澍

Hi there 👋

I am currently a member of technical staff at SpaceXAI Imagine team (try our model at here!). Previously I was a research scientist at NVIDIA Cosmos Team working on generative model pre/post-training and physical AI. Before joining NVIDIA, I got my Master’s degree in Computer Science at Harvard University, and a double major in Applied Mathematics and Computer Science at USC. I have spent a bit of time as research intern at Amazon Science working on LLM and at Microsoft Research working on synthetic data generation.

My current research interests is in VLM/LLM and multi-modal generative models. Particularly,

Post-train diffusion models [1, 2, 3, 4] and VLMs [5, 6] w/ SFT and RL [7]
Reliable model against malicious attacks [8, 9, 10]
Excels in low-resource regimes via synthetic data generation [11, 12, 13]

This is my girlfriend 😍Carmen and me

Selected Publications

White Paper
Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA Cosmos Team: Jiashu Xu

Arxiv, 2026

Abs Bib Project Page PDF Video Code

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation’s OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.
@article{nvidia2026cosmos3, title = {Cosmos 3: Omnimodal World Models for Physical AI}, author = {Team, NVIDIA Cosmos}, author_display = {NVIDIA Cosmos Team: Jiashu Xu}, journal = {Arxiv}, project_page = {https://research.nvidia.com/labs/cosmos-lab/cosmos3/}, year = {2026} }
White Paper
Cosmos World Foundation Model Platform for Physical AI

NVIDIA Cosmos Team: Jiashu Xu

Arxiv, 2025 (CES’25 Best of AI, Best Overall)

Abs Bib Project Page PDF Video Code

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via NVIDIA Cosmos-Predict1.
@article{agarwal2025cosmos, title = {Cosmos World Foundation Model Platform for Physical AI}, author = {Team, NVIDIA Cosmos}, author_display = {NVIDIA Cosmos Team: Jiashu Xu}, journal = {Arxiv}, project_page = {https://www.nvidia.com/en-us/ai/cosmos/}, year = {2025} }
White Paper
World simulation with video foundation models for physical ai

NVIDIA Cosmos Team: Jiashu Xu

Arxiv, 2025

Abs Bib Project Page PDF Code

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI visionlanguage model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5× smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
@article{ali2025world, title = {World simulation with video foundation models for physical ai}, author = {Team, NVIDIA Cosmos}, project_page = {https://research.nvidia.com/labs/dir/cosmos-predict2.5/}, author_display = {NVIDIA Cosmos Team: Jiashu Xu}, journal = {Arxiv}, year = {2025} }
White Paper
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA Cosmos Team: Jiashu Xu

Arxiv, 2025

Abs Bib Project Page PDF Video Code

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, CosmosReason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
@article{azzolini2025cosmos, title = {Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning}, author = {Team, NVIDIA Cosmos}, author_display = {NVIDIA Cosmos Team: Jiashu Xu}, journal = {Arxiv}, project_page = {https://www.nvidia.com/en-us/ai/cosmos/}, year = {2025} }
Arxiv
Data-regularized Reinforcement Learning for Diffusion Models at Scale

Haotian Ye, Kaiwen Zheng, Jiashu Xu , Puheng Li , Huayu Chen, Jiaqi Han , Sheng Liu , Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay , Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai , Ming-Yu Liu, James Zou, and Stefano Ermon

Arxiv, 2025

Abs Bib Project Page PDF

Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
@article{ye2025data, title = {Data-regularized Reinforcement Learning for Diffusion Models at Scale}, author = {Ye, Haotian and Zheng, Kaiwen and Xu, Jiashu and Li, Puheng and Chen, Huayu and Han, Jiaqi and Liu, Sheng and Zhang, Qinsheng and Mao, Hanzi and Hao, Zekun and Chattopadhyay, Prithvijit and Yang, Dinghao and Feng, Liang and Liao, Maosheng and Bai, Junjie and Liu, Ming-Yu and Zou, James and Ermon, Stefano}, journal = {Arxiv}, project_page = {https://research.nvidia.com/labs/dir/ddrl/}, year = {2025} }
NAACL Oral
Instructional Fingerprinting of Large Language Models

Jiashu Xu, Fei Wang* , Mingyu Derek Ma*, Pang Wei Koh, Chaowei Xiao, and Muhao Chen

In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , 2024 (Oral)

Abs Bib Project Page PDF Code

The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (\eg restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.
@inproceedings{xu2024instructional, title = {Instructional Fingerprinting of Large Language Models}, author = {Xu, Jiashu and Wang, Fei and Ma, Mingyu Derek and Koh, Pang Wei and Xiao, Chaowei and Chen, Muhao}, year = {2024}, booktitle = {Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, abbr2 = {Oral}, project_page = {https://cnut1648.github.io/Model-Fingerprint/}, }
CVPR Highlight
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Yunhao Ge*, Yihe Tang*, Jiashu Xu*, Cem Gokmen* , Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti , Yunzhu Li, Roberto Martin-Martin , Miao Liu, Pengchuan Zhang , Ruohan Zhang, Li Fei-Fei, and Jiajun Wu

In Conference on Computer Vision and Pattern Recognition (CVPR) , 2024 (Highlight)

Abs Bib Project Page PDF Code

The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction.
@inproceedings{ge2024behavior, title = {BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation}, author = {Ge, Yunhao and Tang, Yihe and Xu, Jiashu and Gokmen, Cem and Li, Chengshu and Ai, Wensi and Martinez, Benjamin Jose and Aydin, Arman and Anvari, Mona and Chakravarthy, Ayush K and Yu, Hong-Xing and Wong, Josiah and Srivastava, Sanjana and Lee, Sharon and Zha, Shengxin and Itti, Laurent and Li, Yunzhu and Martin-Martin, Roberto and Liu, Miao and Zhang, Pengchuan and Zhang, Ruohan and Fei-Fei, Li and Wu, Jiajun}, booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, abbr2 = {Highlight}, project_page = {https://behavior-vision-suite.github.io/}, }
SIGGRAPH Real-Time Live!
Genusd: 3d scene generation made easy

NVIDIA Cosmos Team: Jiashu Xu

In ACM SIGGRAPH , 2024 (Real-Time Live!)

Abs Bib Project Page Video

We introduce GenUSD, an end-to-end text-to-scene generation framework that transforms natural language queries into realistic 3D scenes, including 3D objects and layouts. The process involves two main steps: 1) A Large Language Model (LLM) generates a scene layout hierarchically. It first proposes a high-level plan to decompose the scene into multiple functionally and spatially distinct subscenes. Then, for each subscene, the LLM proposes objects with detailed positions, poses, sizes, and descriptions. To manage complex object relationships and intricate scenes, we introduce object layout design meta functions as tools for the LLM. 2) A novel text-to-3D model generates each 3D object with surface meshes and high-resolution texture maps based on the LLM’s descriptions. The assembled 3D assets form the final 3D scene, represented as a Universal Scene Description (USD) format. GenUSD ensures physical plausibility by incorporating functions to prevent collisions.
@incollection{lin2024genusd, title = {Genusd: 3d scene generation made easy}, author = {Team, NVIDIA Cosmos}, author_display = {NVIDIA Cosmos Team: Jiashu Xu}, booktitle = {ACM SIGGRAPH}, abbr2 = {Real-Time Live!}, pages = {1--2}, project_page = {https://blogs.nvidia.com/blog/real-time-3d-generative-ai-research-siggraph-2024/?ncid=so-twit-353134-vt36&linkId=100000277255797}, year = {2024} }
ACL Oral
Can NLI Provide Proper Indirect Supervision for Low-resource Biomedical Relation Extraction?

Jiashu Xu , Mingyu Derek Ma, and Muhao Chen

In Association for Computational Linguistics (ACL) , Jul 2023 (Oral)

Abs Bib PDF Code

Two key obstacles in biomedical relation extraction (RE) are the scarcity of annotations and the prevalence of instances without explicitly pre-defined labels due to low annotation coverage. Existing approaches, which treat biomedical RE as a multi-class classification task, often result in poor generalization in low-resource settings and do not have the ability to make selective prediction on unknown cases but give a guess from seen relations, hindering the applicability of those approaches. We present NBR, which converts biomedical RE as natural language inference formulation through indirect supervision. By converting relations to natural language hypotheses, NBR is capable of exploiting semantic cues to alleviate annotation scarcity. By incorporating a ranking-based loss that implicitly calibrates abstinent instances, NBR learns a clearer decision boundary and is instructed to abstain on uncertain instances. Extensive experiments on three widely-used biomedical RE benchmarks, namely ChemProt, DDI and GAD, verify the effectiveness of NBR in both full-set and low-resource regimes. Our analysis demonstrates that indirect supervision benefits biomedical RE even when a domain gap exists, and combining NLI knowledge with biomedical knowledge leads to the best performance gains.
@inproceedings{xu-etal-2023-nli, title = {Can {NLI} Provide Proper Indirect Supervision for Low-resource Biomedical Relation Extraction?}, author = {Xu, Jiashu and Ma, Mingyu Derek and Chen, Muhao}, booktitle = {Association for Computational Linguistics (ACL)}, month = jul, abbr2 = {Oral}, year = {2023}, doi = {10.18653/v1/2023.acl-long.138}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.138}, pages = {2450--2467}, }