Photo
Research Direction

I’m a PhD student at the Institute of Artificial Intelligence, Peking University, advised by Prof. Yaodong Yang (both a good teacher and a helpful friend in my life). In 2024, I was honored to receive the first batch of National Natural Science Foundation funding for the Youth Student Basic Research Project (Ph.D. students); the sole recipient from Peking University in the field of intelligence. Before this, I conducted research on safe reinforcement learning and won the championship in the NeurIPS 2022 MyoChallenge for robotic dexterous manipulation.

吉嘉铭,北京大学人工智能研究院博士生在读,导师为杨耀东老师,研究方向为强化学习、大模型的安全与价值对齐,在计算机顶级会议期刊发表口头、焦点论文等十余篇,谷歌学术引用累计2200余次,模型开源累计下载500W,GitHub开源累计获得2W+ Star。曾获首批国自然博士青年基金资助(2023年度北京大学智能学科唯一),苹果学者奖学(Apple Scholar,全国仅两位),获北京大学博士最高研究奖“校长奖学金”, 首届中国电子学会—腾讯博士生科研激励计划(全国17人),获 NeurIPS‘22 机器人灵巧操作比赛冠军,研究成果及模型被OpenAI 、Meta引用,被MIT Tech Review报道。


Research

Currently, i focus on AI Safety and Alignment.

  • AI Alignment: Given the biases and discriminations that may exist in pre-training data, large models (LMs) may exhibit unintended behaviors. I am interested in alignment methods (e.g., Reinforcement Learning from human feedback (RLHF)) and post-hoc alignment methods to ensure the safety and trustworthy of LLMs.
  • Theoretical Explanations and Mechanism Design for Alignment: Aligning these AI System (e.g. LLMs) effectively to ensure consistency with human intentions and values (though some views may question universal values) is a significant current challenge. I am particularly interested in ensuring the feasibility of these alignment methods in both theoretical and practical mechanisms.
  • Applications (LM + X): I am interested in the application of large models in various domain, such as healthcare and education, and the potential impact of rapid industry development and iteration brought about by large models.
  • News

    • 2024-09 Aligner has been accepted as an Oral presentation at NeurIPS 2024!
    • 2024-09 ProgressGym has been accepted as an Spotlight at NeurIPS 2024 DB Track, and SafeSora has been accept as Poster.
    • 2024-09 Our framework: OmniSafe have accepted by JMLR 2024 (The most popular open-source Safe Reinforcement Learning framework).
    • 2024-06 We released PKU-SafeRLHF dataset, the 2nd version of BeaverTails (The total number of downloads: 800K+).
    • 2024-05 We released Language Models Resist Alignment (Exploring Hoke's Law in large models: A theoretical analysis of the fragility of alignment).
    • 2024-01 Two papers get accepted to ICLR 2024. Safe RLHF (Spotlight), SafeDreamer.
    • 2023-10 Big News! We released AI Alignment: A Comprehensive Survey.

    Awards

    • 2025-03 Apple Scholars in AI/ML.
      苹果学者,全国仅两位。
    • 2024-12 CIE-Tencent Doctoral Research Incentive Project.
      首届中国电子学会—腾讯博士生科研激励计划,全国17人,科研基金10万。
    • 2024-05 Peking University President Scholarship, the highest doctoral research honor.
      北京大学校长奖学金。
    • 2024-05 National Natural Science Foundation for Ph.D. students (first batch; the sole recipient in the Peking University's intelligence field).
      首批国家自然科学基金青年学生基础研究项目(博士研究生)项目资助,北大智能学科唯一。

    Publications

    Publications:

    Year:

    Topic: