Driver Anomaly Detection: A Dataset and Contrastive Learning Approach
- 0.9673 AUC
- codes and pre-trained models are publicly available
contrastive
showing the difference between two things when compared
contrastive learning(대조 학습)
allows models to separate dissimilar instances while mapping comparable ones close together by utilizing similarity and dissimilarity
비슷한 것들은 가깝게, 다른 것들은 멀리 배치하도록 학습하는 대표적인 자가 학습 기법
Abstract
There are unbounded many anomalous actions that a driver can do while driving, which leads to an ‘open set recognition’ problem
→ propose a contrastive learning approach to learn a metric to differentiate normal driving from anomalous driving
Introduction
- human factors are the main contributing cause in almost 90% of the road accidents having distraction as the main factor for around 68% of them
→ development of a reliable DMS, which can supervise a driver’s performance, alertness, and driving intention, contains utmost importance to prevent human-related road accidents
- there has been several datasets to facilitate video based driver monitoring. However, all these datasets are partitioned into finite set of known classes, such as normal driving class and several distraction classes
→ these datasets are designed for closed set recognition
closed set recognitionall samples in their test set belong to one of the K known classes that the networks are trained with
closed set recognitionarises a very important question:How would the system react if an unknown class is introduced to the network?
→ This obscurity = serious problem ( cause there might be unbounded many distracting actions )
- propose an
open set recognitionapproach - propose a deep contrastive learning approach to learn a metric in order to distinguish normal driving from anomalous driving
- DAD(Driver Anomaly Detection) dataset
Related Work
Vision Based Driver Monitoring Datasets
Hand-focused datasets
CVRR-HANDS 3D, VIVA-Hands, DriverMHG
datasets that provide Eye-tracking annotations
DrivFace, DriveAHead, DD-Pose
- Drivers’ face and head information also provides very important cues to identify driver’s state such as head pose, gaze directions, fatigue and emotions
Body actions of the drivers
StateFarm, AUC Distracted Driver(AUC DD)
→ all of the datasets above is designed for open set recognition
Contrastive Learning Approaches
these approaches learn representations by contrasting positive pairs against negative pairs
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3733–
3742, 2018
- The full softmax distribution is approximated by the Noise Contrastive Estimation(NCE)
Softmax Distribution신경망에서 확률 분포를 만들기 위해 Softmax 함수를 사용
모든 가능한 클래스에 대한 점수
Noise Contrastive Estimation(NCE)전체 분포를 직접 계산하는 대신, 올바른 정답(positive)과 랜덤한 노이즈(negative)를 구분하는 이진 분류 문제로 변환
- a memory bank and the Proximal Regularization are used in order to stabilize learning process
Memory Bank이전에 계산된 데이터의 특징 벡터를 저장해놓고 재사용하는 방식
Proximal Regularization모델이 학습 중 급격하게 변화하지 않도록 가중치 업데이트를 안정화하는 기법
Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference
on Computer Vision, pages 6002–6012, 2019.
- instances that are close to each other on the embedding space used as positive pairs in addition to the augmented version of the original images
embedding데이터를 고차원에서 저차원 벡터로 변환하는 것
embedding space데이터의 의미를 유지하면서 압축된 벡터로 표현하는 공간
embedding 변환된 벡터들이 위치하는 공간
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
9729–9738, 2020.
- a dynamic dictionary with a queue and a moving-average encoder are presented
Lightweight CNN Architectures
Since DMS applications need to be deployed in car, it is critical to have a resource efficient architecture
→ lightweight CNN Architectures
SqueezeNetis the first and most well-known architecture
- consists of fire modules to achieve AlexNet-level accuracy with 50x fewer parameters
MobileNetcontains depthwise separable convolutions with a width multiplier parameter to achieve thinner or wider network
MobileNetV2contains inverted residuals blocks and ReLU6 activation function
ShuffleNetproposes to use channel shuffle operation together with pointwise group convolution
ShuffleNetV2upgrades it with several principles, which are effective in designing lightweight architectures
NASNet,FBNet- Networks using Neural Architecture Searchprovide another direction for designing lightweight architectures
Driver Anomaly Detection (DAD) Dataset
Modality
데이터의 유형 또는 감지하는 방식
- 특징들
- large enough to train a Deep Neural Network architectures
- multi-modal containing depth and infrared modalities such that system is operable at different lightning conditions
- 3D 객체 감지, 손 모양 인식, 증강 현실 등에 사용됨
- 야간 감시, 열 감지, 얼굴 인식 등에 사용됨
- multi-view containing front and top views
- synchronously and complement each other
- 45 fps providing high temporal resolution, 224 x 171 pixels
Depth ModalityDepth Camera를 이용하여 객체의 거리 정보 획득
Infrared Modality어두운 환경에서도 사물을 감지
- 31 subjects are asked to drive in a computer game performing either normal driving or anomalous driving
Training Set
recordings of 25 subjects
- each subject has 6 normal driving and 8 anomalous driving video recordings
- each normal driving video lasts about 3.5 minutes and each anomalous driving video lasts about 30 seconds
- In total, there are around 550 mins recordings for normal driving
- 100 mins recording of anomalous driving
Test Set
recordings of 6 subjects
- each subject has 6 video recordings lasting around 3.5 mins
- Anomalous actions occur randomly during the test video
- there are 16 distracting actions in the test set that are not available in the training set
- In total, 88 mins normal driving
- 45 mins anomalous driving
- 17% of the complete DAD dataset, which is around 95GB
Methodology
Contrastive Learning Framework
- tried to maximize the similarity between normal driving samples and minimizing the similarity between normal driving samples and minimizing the similarity between normal driving and anomalous driving samples in the latent space using a contrastive loss
비디오 클립 → 고유한 벡터 표현(Embedding) → Contrastive Loss
핵심 목표 : 비정상적인 운전 클립은 정상적인 클립과 거리가 멀어지도록 학습
Three major components in the applied framework
- Base Encoder \text{f}_\\text{theta}(.)
- 입력: 비디오 클립
- 모델: 3D-CNN 기반 ResNet-18을 포함한 다양한 CNN 모델 사용
- extract vector representations of input clips
- refers to a 3D-CNN architecture with parameters
비디오 입력을 고차원 벡터로 변환하는 과정
즉, 원본 데이터를 Feature Space 에서의 벡터 표현으로 변환
- Projection head \text{g}_\\text{beta}(.)
- 입력: Base Encoder에서의 출력
- 구조: MLP(Multi-Layer Perceptron) + ReLU 활성화 함수
- 벡터의 크기가 일정해져서 학습이 안정적으로 됨
- map into another latent space
Feature Space 에서 Latent Space 로 변환하는 과정
Projection Head를 적용하는 이유
Contrastive Learning 에서는 Projection Head 의 출력에서 Contrastive Loss 를 적용하는 것이 더 효과적
L2 정규화 적용
- Contrastive loss
- 정상적인 운전 클립들은 서로 가까운 위치로 학습됨
- 비정상적인 운전 클립들은 정상적인 클립과 거리를 벌리도록 학습됨
- impose that normalized embeddings from the normal driving class are closer together than embeddings from different anomalous action classes
Embedding Space 에서 정상 운전과 비정상 운전을 구분하도록 학습
Test Time Recognition
Fusion of Different Views and Modalities
Training Details
- train model from scratch for 250 epochs using SGC with momentum 0.9 and initial learning rate of 0.01
Experiments
Baseline Results
- Base Encoder: ResNet-18
- AUC of the ROC curve for baseline evaluation