Driver Anomaly Detection: A Dataset and Contrastive Learning Approach

0.9673 AUC

codes and pre-trained models are publicly available

contrastive

showing the difference between two things when compared

contrastive learning(대조 학습)

allows models to separate dissimilar instances while mapping comparable ones close together by utilizing similarity and dissimilarity

비슷한 것들은 가깝게, 다른 것들은 멀리 배치하도록 학습하는 대표적인 자가 학습 기법

Abstract

There are unbounded many anomalous actions that a driver can do while driving, which leads to an ‘open set recognition’ problem

→ propose a contrastive learning approach to learn a metric to differentiate normal driving from anomalous driving

Introduction

human factors are the main contributing cause in almost 90% of the road accidents having distraction as the main factor for around 68% of them

→ development of a reliable DMS, which can supervise a driver’s performance, alertness, and driving intention, contains utmost importance to prevent human-related road accidents

there has been several datasets to facilitate video based driver monitoring. However, all these datasets are partitioned into finite set of known classes, such as normal driving class and several distraction classes

→ these datasets are designed for closed set recognition

closed set recognition

all samples in their test set belong to one of the K known classes that the networks are trained with

closed set recognition arises a very important question:

How would the system react if an unknown class is introduced to the network?
→ This obscurity = serious problem ( cause there might be unbounded many distracting actions )

propose an open set recognition approach

propose a deep contrastive learning approach to learn a metric in order to distinguish normal driving from anomalous driving

DAD(Driver Anomaly Detection) dataset

Vision Based Driver Monitoring Datasets

Hand-focused datasets

CVRR-HANDS 3D, VIVA-Hands, DriverMHG

datasets that provide Eye-tracking annotations

DrivFace, DriveAHead, DD-Pose

Drivers’ face and head information also provides very important cues to identify driver’s state such as head pose, gaze directions, fatigue and emotions

Body actions of the drivers

StateFarm, AUC Distracted Driver(AUC DD)

→ all of the datasets above is designed for open set recognition

Contrastive Learning Approaches

these approaches learn representations by contrasting positive pairs against negative pairs

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3733–
3742, 2018

The full softmax distribution is approximated by the Noise Contrastive Estimation(NCE)

Softmax Distribution

신경망에서 확률 분포를 만들기 위해 Softmax 함수를 사용
모든 가능한 클래스에 대한 점수

Noise Contrastive Estimation(NCE)

전체 분포를 직접 계산하는 대신, 올바른 정답(positive)과 랜덤한 노이즈(negative)를 구분하는 이진 분류 문제로 변환

a memory bank and the Proximal Regularization are used in order to stabilize learning process

Memory Bank

이전에 계산된 데이터의 특징 벡터를 저장해놓고 재사용하는 방식

Proximal Regularization

모델이 학습 중 급격하게 변화하지 않도록 가중치 업데이트를 안정화하는 기법

Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference
on Computer Vision, pages 6002–6012, 2019.

instances that are close to each other on the embedding space used as positive pairs in addition to the augmented version of the original images

embedding

데이터를 고차원에서 저차원 벡터로 변환하는 것

embedding space

데이터의 의미를 유지하면서 압축된 벡터로 표현하는 공간
embedding 변환된 벡터들이 위치하는 공간

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
9729–9738, 2020.

a dynamic dictionary with a queue and a moving-average encoder are presented

Lightweight CNN Architectures

Since DMS applications need to be deployed in car, it is critical to have a resource efficient architecture

→ lightweight CNN Architectures

SqueezeNet is the first and most well-known architecture

consists of fire modules to achieve AlexNet-level accuracy with 50x fewer parameters

MobileNet contains depthwise separable convolutions with a width multiplier parameter to achieve thinner or wider network

MobileNetV2 contains inverted residuals blocks and ReLU6 activation function

ShuffleNet proposes to use channel shuffle operation together with pointwise group convolution

ShuffleNetV2 upgrades it with several principles, which are effective in designing lightweight architectures

NASNet, FBNet - Networks using Neural Architecture Search

provide another direction for designing lightweight architectures

Driver Anomaly Detection (DAD) Dataset

Modality

데이터의 유형 또는 감지하는 방식

특징들

large enough to train a Deep Neural Network architectures
multi-modal containing depth and infrared modalities such that system is operable at different lightning conditions

Depth Modality

Depth Camera를 이용하여 객체의 거리 정보 획득
3D 객체 감지, 손 모양 인식, 증강 현실 등에 사용됨

Infrared Modality

어두운 환경에서도 사물을 감지
야간 감시, 열 감지, 얼굴 인식 등에 사용됨

multi-view containing front and top views

synchronously and complement each other

45 fps providing high temporal resolution, 224 x 171 pixels

31 subjects are asked to drive in a computer game performing either normal driving or anomalous driving

Training Set

recordings of 25 subjects

each subject has 6 normal driving and 8 anomalous driving video recordings

each normal driving video lasts about 3.5 minutes and each anomalous driving video lasts about 30 seconds

In total, there are around 550 mins recordings for normal driving

100 mins recording of anomalous driving

Test Set

recordings of 6 subjects

each subject has 6 video recordings lasting around 3.5 mins

Anomalous actions occur randomly during the test video

there are 16 distracting actions in the test set that are not available in the training set

In total, 88 mins normal driving

45 mins anomalous driving

17% of the complete DAD dataset, which is around 95GB

Methodology

Contrastive Learning Framework

tried to maximize the similarity between normal driving samples and minimizing the similarity between normal driving samples and minimizing the similarity between normal driving and anomalous driving samples in the latent space using a contrastive loss

비디오 클립 → 고유한 벡터 표현(Embedding) → Contrastive Loss

핵심 목표 : 비정상적인 운전 클립은 정상적인 클립과 거리가 멀어지도록 학습

Three major components in the applied framework

Base Encoder $f_\theta(.)$

비디오 입력을 고차원 벡터로 변환하는 과정

입력: 비디오 클립

모델: 3D-CNN 기반 ResNet-18을 포함한 다양한 CNN 모델 사용

즉, 원본 데이터를 Feature Space 에서의 벡터 표현으로 변환

extract vector representations of input clips

refers to a 3D-CNN architecture with parameters $\theta$

h_i = f_\theta(x_i)

Projection head $g_\beta(.)$

Feature Space 에서 Latent Space 로 변환하는 과정

입력: Base Encoder에서의 출력

구조: MLP(Multi-Layer Perceptron) + ReLU 활성화 함수

Projection Head를 적용하는 이유

Contrastive Learning 에서는 Projection Head 의 출력에서 Contrastive Loss 를 적용하는 것이 더 효과적

L2 정규화 적용

벡터의 크기가 일정해져서 학습이 안정적으로 됨

map $h_i$ into another latent space $v_i$

Contrastive loss

Embedding Space 에서 정상 운전과 비정상 운전을 구분하도록 학습

정상적인 운전 클립들은 서로 가까운 위치로 학습됨

비정상적인 운전 클립들은 정상적인 클립과 거리를 벌리도록 학습됨

impose that normalized embeddings from the normal driving class are closer together than embeddings from different anomalous action classes

Test Time Recognition

Fusion of Different Views and Modalities

Training Details

train model from scratch for 250 epochs using SGC with momentum 0.9 and initial learning rate of 0.01

Experiments

Baseline Results

Base Encoder: ResNet-18

AUC of the ROC curve for baseline evaluation

Driver Anomaly Detection: A Dataset and Contrastive Learning Approach

Abstract