About Me

I am a final-year Ph.D. Candidate in Computer Science at Northeastern University. My research focuses on Video Understanding Methods to address key challenges in Video-Language Models (VLM) and Assistant AI. Specifically, it includes data-efficient video-text matching, enhanced computational efficiency, accurate long temporal modeling, and procedural video understanding. Currently, I am working on open-world video understanding and video data generation.

During my Ph.D. study, I have had the opportunity to spend two memorable internships at Amazon AWS AI Lab and Microsoft Research. In the past, I also worked as a Student Researcher at Chinese Academic of Sciences, and an Architecture Summer Intern at NVIDIA. In 2019, I received dual B.S. degrees in Computer Science and Economics from NYU Shanghai, and awarded with the NYU University Honors Scholar and the Undergraduate Scholarship of University of Chinese Academy of Sciences.

I am actively seeking full-time research positions. Please reach out to me if you have an opportunity!

Recent Updates

Apr 2025 My intern paper at Microsoft is accepted to CVPR 2025.
Jun 2024 Started Research Internship at Microsoft Responsible and Open AI Research Team.
Apr 2024 Three papers accepted to CVPR 2024. My intern paper at Amazon is designated as "Highlight" paper.
Sep 2022 Started as an Applied Scientist Intern at Amazon's AWS AI Lab.
Apr 2022 One paper accepted to CVPR 2022.

Selected Publications

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen

CVPR 2025

Project Paper Code

A Delegate-and-Conquer framework for efficient coarse-to-fine Long Video Temporal Grounding.
Efficient Video Encoder for end-to-end training on Ego4D with one A100.
SOTA accuracy with 47%-66% lower computation.
APPLICATION: improve VLLMs to efficiently handle hour-long videos.

FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Fully-Supervised Action Segmentation

Zijia Lu, Ehsan Elhamifar

CVPR 2024

Paper Code

New Long Temporal Reasoning paradigm with parallel modeling of frame details and temporal events/actions.
Action Tokens Design for dynamic video condensing and enabling text-data input.
SOTA accuracy with 3x inference speed.
APPLICATION: better Assistant AI; allow Multi-Modality Learning.

Error Detection in Egocentric Procedural Task Videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, Ehsan Elhamifar

CVPR 2024

Paper Code & Dataset

A One-class Error Detection method in procedural videos.
First Egocentric Procedural Error Detection (EgoPER) dataset with extensive error types.
APPLICATION: error detection for Egocentric Understanding and Assistant AI.

Self-Supervised Multi-Object Tracking with Path Consistency

Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

CVPR 2024 Highlight

Paper Code

Self-Supervised Consistency Loss for robust Multi-Object Tracking without manual labels.
Superior or comparable to supervised methods on popular benchmarks.
APPLICATION: Robust object tracking for Egocentric Understanding and Scene Understanding.

Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency

Zijia Lu, Ehsan Elhamifar

CVPR 2022

Paper Code

Video-Text Alignment between video frames and unordered sets of actions in video parsed from video narrations.
New differentiable Sequence Metric for weakly-supervised video-text alignment.
APPLICATION: learn video-text semantic space for VLLMor Video Generation.

Weakly-supervised Action Segmentation and Alignment via Transcript-Aware Union-of-subspaces Learning

Zijia Lu, Ehsan Elhamifar

ICCV 2022

Paper Code

New Union-of-Subspace network for accurate Action Modeling and capturing complex action variations.
Contrastive Learning for weakly-supervised video-text alignment.
APPLICATION: Video-Text Alignment.

Dft-Net: Disentanglement of Face Deformation and Texture Synthesis for Expression Editing

Jinghui Wang, Jie Zhang, Zijia Lu, Shiguang Shan

ICIP 2019

Paper

Two-Branch GAN network for facial expression edition.
Warping branch for expression transform and Generative branch for refinement.
APPLICATION: Controlled Image Generation and Expression Editing.

Zero-Shot Facial Expression Recognition with Multi-Label Label Propagation

Zijia Lu, Jiabei Zeng, Shiguang Shan, Xilin Chen

ACCV 2018 Oral

Paper Code & Dataset

Transductive Label Propagation method for Zero-Shot facial expression recognition.
The first Open-Set Facial Expression Recognition dataset.
APPLICATION: Open-World affection computing for human-computer interaction.