About Me

I am a final-year Ph.D. Candidate in Computer Science at Northeastern University. My research focuses on Video Understanding Methods to address key challenges in Video-Language Models (VLM) and Assistant AI. Specifically, it includes data-efficient video-text matching, enhanced computational efficiency, accurate long temporal modeling, and procedural video understanding. Currently, I am working on open-world video understanding and video data generation.

During my Ph.D. study, I have had the opportunity to spend two memorable internships at Amazon AWS AI Lab and Microsoft Research. In the past, I also worked as a Student Researcher at Chinese Academic of Sciences, and an Architecture Summer Intern at NVIDIA. In 2019, I received dual B.S. degrees in Computer Science and Economics from NYU Shanghai, and awarded with the NYU University Honors Scholar and the Undergraduate Scholarship of University of Chinese Academy of Sciences.

I am actively seeking full-time research positions. Please reach out to me if you have an opportunity!

Recent Updates

  • Apr 2025 My intern paper at Microsoft is accepted to CVPR 2025.
  • Jun 2024 Started Research Internship at Microsoft Responsible and Open AI Research Team.
  • Apr 2024 Three papers accepted to CVPR 2024. My intern paper at Amazon is designated as "Highlight" paper.
  • Sep 2022 Started as an Applied Scientist Intern at Amazon's AWS AI Lab.
  • Apr 2022 One paper accepted to CVPR 2022.

Selected Publications

DeCafNet

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen

  • A Delegate-and-Conquer framework for efficient coarse-to-fine Long Video Temporal Grounding.
  • Efficient Video Encoder for end-to-end training on Ego4D with one A100.
  • SOTA accuracy with 47%-66% lower computation.
  • APPLICATION: improve VLLMs to efficiently handle hour-long videos.
FACT

FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Fully-Supervised Action Segmentation

Zijia Lu, Ehsan Elhamifar

  • New Long Temporal Reasoning paradigm with parallel modeling of frame details and temporal events/actions.
  • Action Tokens Design for dynamic video condensing and enabling text-data input.
  • SOTA accuracy with 3x inference speed.
  • APPLICATION: better Assistant AI; allow Multi-Modality Learning.
Error Detection

Error Detection in Egocentric Procedural Task Videos

Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, Ehsan Elhamifar

  • A One-class Error Detection method in procedural videos.
  • First Egocentric Procedural Error Detection (EgoPER) dataset with extensive error types.
  • APPLICATION: error detection for Egocentric Understanding and Assistant AI.
Self-Supervised Multi-Object Tracking

Self-Supervised Multi-Object Tracking with Path Consistency

Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

  • Self-Supervised Consistency Loss for robust Multi-Object Tracking without manual labels.
  • Superior or comparable to supervised methods on popular benchmarks.
  • APPLICATION: Robust object tracking for Egocentric Understanding and Scene Understanding.
Set-Supervised Action Learning

Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency

Zijia Lu, Ehsan Elhamifar

  • Video-Text Alignment between video frames and unordered sets of actions in video parsed from video narrations.
  • New differentiable Sequence Metric for weakly-supervised video-text alignment.
  • APPLICATION: learn video-text semantic space for VLLMor Video Generation.
Weakly-supervised Action Segmentation

Weakly-supervised Action Segmentation and Alignment via Transcript-Aware Union-of-subspaces Learning

Zijia Lu, Ehsan Elhamifar

  • New Union-of-Subspace network for accurate Action Modeling and capturing complex action variations.
  • Contrastive Learning for weakly-supervised video-text alignment.
  • APPLICATION: Video-Text Alignment.
Zero-Shot Facial Expression Recognition

Dft-Net: Disentanglement of Face Deformation and Texture Synthesis for Expression Editing

Jinghui Wang, Jie Zhang, Zijia Lu, Shiguang Shan

  • Two-Branch GAN network for facial expression edition.
  • Warping branch for expression transform and Generative branch for refinement.
  • APPLICATION: Controlled Image Generation and Expression Editing.
Zero-Shot Facial Expression Recognition

Zero-Shot Facial Expression Recognition with Multi-Label Label Propagation

Zijia Lu, Jiabei Zeng, Shiguang Shan, Xilin Chen

  • Transductive Label Propagation method for Zero-Shot facial expression recognition.
  • The first Open-Set Facial Expression Recognition dataset.
  • APPLICATION: Open-World affection computing for human-computer interaction.

Experience

Microsoft

Research Intern at Microsoft

06/2024 - 09/2024

Worked on efficient end-to-end models for text-based video temporal grounding.

Amazon

Applied Scientist Intern at Amazon AWS AI Lab

09/2022 - 03/2023

Worked on Self-Supervised and Robust Multi-Object Tracking.

Northeastern University

Graduate Research Assistant in Northeastern University

09/2019 - Present

Focusing on Video-Text Understanding, Action Segmentation, Egocentric Understanding and Video Data Generation.

Student Researcher at Chinese Academy of Science

01/2018 - 12/2018

Worked on Zero-Shot Facial Recognition and Expression Editing.

NVIDIA

Architecture Summer Intern at NVIDIA Shanghai

06/2016 - 09/2016

Developed CUDNN function for Volta GPU and GPU performance simulator.

NYU Shanghai

B.S in Computer Science and Economics at NYU Shanghai

2014 - 2019 (Gap Year 2018)

Graduated with a Major GPA of 3.98/4. Conducted research in Image Segmentation with Prof. Zhen Zhang and Document QA with Prof. Kyunghyun Cho.