Yixiao Ge

I am currently a senior researcher at Tencent ARC Lab and Tencent AI Lab, leading an effort on multimodal foundation models, open-world visual comprehension, and efficient AI. Previously, I got my Ph.D. degree from Multimedia Lab (MMLab), the Chinese University of Hong Kong.

Actively looking for self-motivated interns to work on related research topics. Feel free to reach out if you are interested.

Projects [Back]

Multimodal Foundation Models:

Vision-language: We aim to develop foundational models that unify visual comprehension and generation tasks within one framework.

Given the great success of Large Language Models (LLMs), we take the initial step to empower the off-the-shelf LLMs with the ability to perform visual tasks via plugins (GPT4Tools @NeurIPS23). Despite a feasible solution, it is far from multimodal emergent abilities.

We are further devoted to developing an end-to-end framework that facilitates flexible input/output formats, transitioning and reasoning seamlessly between multimodal signals while acquiring knowledge from an inherently multimodal world. Check out our SEED for details.

Previously, we focused on pre-training vision-language representations and video-text retrieval, e.g., MCQ @CVPR22(Oral), All-in-One @CVPR23. We also made some interesting applications like Tune-A-Video @ICCV23.
Omni-modal: A real AI agent (e.g., a smart robot) should be capable of sensing all modalities. It is non-trivial, especially for those rare modalities. Check out our solution, namely, ViT-Lens. Omni-modal representation has great potential in emergent applications, see our DreamDiffusion.
Data-centric: High-quality and large-scale data is the prerequisite for training foundation models. For training data, we collect large-scale TV dramas (PTVD, Tencent Video authorization), as well as memes (Sticker820K, Tencent Search authorization). Besides, we are also focusing on properly evaluating multimodal LLMs, proposing SEED-Bench ([leaderboard]).

Open-world Visual Comprehension:

Visual representation: We are committed to improving image representation (e.g., mc-BEiT @ECCV22, ConMIM @ICLR23, RILS @CVPR23) and video representation (e.g., TVTS @CVPR23, TVTSv2) via large-scale pre-training.
Visual perception: We also tackle the challenge of visual perception tasks, for instance, detection and segmentation. Check out our MIMDet @ICCV23, BoxSnake @ICCV23.

Efficient AI:

We have created a new topic of hot-refresh model upgrades (RACT @ICLR22) for large-scale retrieval systems, which is practical in industry and under-explored in academia. Beyond retrieval, upgrading the foundation models in current AI systems is also costly because all downstream modules need to be retrained to adapt. Check out our TaCA for a solution. We are also interested in model selection (SFDA @ECCV22, PED @ICCV23), binarization (BEBR @KDD23), etc.

Our algorithms helped Tencent effectively reduce costs and increase efficiency. We won the highest technical award within the company and the SZCCF Science and Technology Award.