Loading Events

« All Events

  • This event has passed.

Thesis Defence: Unified Driving Maneuver Detection and Classification from Monocular Video

June 3 at 9:00 am - 1:00 pm

Zirui Lin, supervised by Dr. Zheng Liu, will defend their thesis titled “Unified Driving Maneuver Detection and Classification from Monocular Video” in partial fulfillment of the requirements for the degree of Master of Applied Science in Electrical Engineering.

An abstract for Zirui Lin’s thesis is included below.

Examinations are open to all members of the campus community as well as the general public. Please email zheng.liu@ubc.ca to receive the Zoom link for this defence.

Abstract

Autonomous driving systems must read the surrounding environment continuously, and they must do more than classify maneuvers: they also have to localize when each maneuver starts and ends. Most commercial systems and academic baselines treat temporal action detection (TAD) and action classification as separate tasks. This fragmented pipeline blocks feature sharing and lets errors propagate. Existing Transformer-based models also have quadratic complexity (O(N2)) and rely on non-causal inference, which means high latency and reduced sensitivity on motion-sensitive categories such as curves and U-turns when running under causal constraints.

We propose a unified architecture with staged training that supports maneuver classification and temporal localization at the same time, from a single RGB video stream. The framework combines a Selective State Space Model (Mamba) for linear-time causal temporal modelling (O(L)) with a dual-pathway feature fusion mechanism that pairs a primary visual representation with an image-derived ego-trajectory branch. The visual pathway extracts spatio-temporal appearance features from a Vision Transformer backbone, while the auxiliary pathway predicts a compact latent motion descriptor from the same visual tokens through a lightweight MLP head. We do not require any external inertial or telemetry sensor at inference time. We fuse the two pathways through concatenation, which preserves the distinct motion-sensitive signals that cross-attention mechanisms tend to dilute.

For temporal action detection, we use a four-stage pipeline: Mambabased causal sequence processing, boundary-aware localization, proposal generation, and graph convolutional refinement. The Mamba backbone gives linear-time sequence modelling with a selection mechanism that adapts to the input, letting the model keep critical driving events while dropping irrelevant segments.

Our experiments under both standard offline metrics and a strict 2-second causal-buffer protocol show the strength of the design. Under the causal-buffer protocol, our framework reaches the best average localization accuracy among all evaluated methods (84.1% average, beating the strongest baseline by 2.9 percentage points), with the largest per-class gain of +5.9 points on U-Turn over the nearest competing baseline. The Mamba-based detector runs at 24.5 FPS, about three times faster than the matched Transformer backbone in the ablation study, while keeping competitive offline detection performance. Ablation studies confirm that simple concatenation fusion gives the best overall balance across categories compared with cross-attention and weighted-sum alternatives, and that both the Mamba backbone and the graph convolutional component contribute to overall system performance.

Details

Date:
June 3
Time:
9:00 am - 1:00 pm

Additional Info

Registration/RSVP Required
Yes (see event description)
Event Type
Thesis Defence
Topic
Research and Innovation, Science, Technology and Engineering
Audiences
Alumni, Community and public, Faculty, Staff, Family friendly, Partners and Industry, Students, Postdoctoral Fellows and Research Associates