sol’s blog

Driver Anomaly Detection: A Dataset and Contrastive Learning Approach

 
  • 0.9673 AUC
  • codes and pre-trained models are publicly available
 
Callout icon'

contrastive

showing the difference between two things when compared

Callout icon'

contrastive learning(대조 학습)

allows models to separate dissimilar instances while mapping comparable ones close together by utilizing similarity and dissimilarity

비슷한 것들은 가깝게, 다른 것들은 멀리 배치하도록 학습하는 대표적인 자가 학습 기법

Abstract

There are unbounded many anomalous actions that a driver can do while driving, which leads to an ‘open set recognition’ problem

→ propose a contrastive learning approach to learn a metric to differentiate normal driving from anomalous driving

 

Introduction

  • human factors are the main contributing cause in almost 90% of the road accidents having distraction as the main factor for around 68% of them
    • → development of a reliable DMS, which can supervise a driver’s performance, alertness, and driving intention, contains utmost importance to prevent human-related road accidents

  • there has been several datasets to facilitate video based driver monitoring. However, all these datasets are partitioned into finite set of known classes, such as normal driving class and several distraction classes
    • → these datasets are designed for closed set recognition

      closed set recognition

      all samples in their test set belong to one of the K known classes that the networks are trained with

      closed set recognition arises a very important question:

      How would the system react if an unknown class is introduced to the network?

      → This obscurity = serious problem ( cause there might be unbounded many distracting actions )

  • propose an open set recognition approach
    • propose a deep contrastive learning approach to learn a metric in order to distinguish normal driving from anomalous driving
  • DAD(Driver Anomaly Detection) dataset
 
 

Vision Based Driver Monitoring Datasets

 

Hand-focused datasets

CVRR-HANDS 3D, VIVA-Hands, DriverMHG
 
 

datasets that provide Eye-tracking annotations

DrivFace, DriveAHead, DD-Pose
  • Drivers’ face and head information also provides very important cues to identify driver’s state such as head pose, gaze directions, fatigue and emotions
 

Body actions of the drivers

StateFarm, AUC Distracted Driver(AUC DD)
 

→ all of the datasets above is designed for open set recognition

 

Contrastive Learning Approaches

these approaches learn representations by contrasting positive pairs against negative pairs

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3733–
3742, 2018
  • The full softmax distribution is approximated by the Noise Contrastive Estimation(NCE)
    • Softmax Distribution

      신경망에서 확률 분포를 만들기 위해 Softmax 함수를 사용

      모든 가능한 클래스에 대한 점수

      Noise Contrastive Estimation(NCE)

      전체 분포를 직접 계산하는 대신, 올바른 정답(positive)과 랜덤한 노이즈(negative)를 구분하는 이진 분류 문제로 변환

  • a memory bank and the Proximal Regularization are used in order to stabilize learning process
    • Memory Bank

      이전에 계산된 데이터의 특징 벡터를 저장해놓고 재사용하는 방식

      Proximal Regularization

      모델이 학습 중 급격하게 변화하지 않도록 가중치 업데이트를 안정화하는 기법

Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference
on Computer Vision, pages 6002–6012, 2019.
  • instances that are close to each other on the embedding space used as positive pairs in addition to the augmented version of the original images
    • embedding

      데이터를 고차원에서 저차원 벡터로 변환하는 것

      embedding space

      데이터의 의미를 유지하면서 압축된 벡터로 표현하는 공간

      embedding 변환된 벡터들이 위치하는 공간

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
9729–9738, 2020.
  • a dynamic dictionary with a queue and a moving-average encoder are presented
 
 

Lightweight CNN Architectures

Since DMS applications need to be deployed in car, it is critical to have a resource efficient architecture

→ lightweight CNN Architectures

 
SqueezeNet is the first and most well-known architecture
  • consists of fire modules to achieve AlexNet-level accuracy with 50x fewer parameters
MobileNet contains depthwise separable convolutions with a width multiplier parameter to achieve thinner or wider network
MobileNetV2 contains inverted residuals blocks and ReLU6 activation function
ShuffleNet proposes to use channel shuffle operation together with pointwise group convolution
ShuffleNetV2 upgrades it with several principles, which are effective in designing lightweight architectures
NASNet, FBNet - Networks using Neural Architecture Search

provide another direction for designing lightweight architectures

 

Driver Anomaly Detection (DAD) Dataset

Callout icon'

Modality

데이터의 유형 또는 감지하는 방식

  • 특징들
    • large enough to train a Deep Neural Network architectures
    • multi-modal containing depth and infrared modalities such that system is operable at different lightning conditions
      • Depth Modality

        Depth Camera를 이용하여 객체의 거리 정보 획득

        • 3D 객체 감지, 손 모양 인식, 증강 현실 등에 사용됨
        Infrared Modality

        어두운 환경에서도 사물을 감지

        • 야간 감시, 열 감지, 얼굴 인식 등에 사용됨
    • multi-view containing front and top views
      • synchronously and complement each other
    • 45 fps providing high temporal resolution, 224 x 171 pixels
 
  • 31 subjects are asked to drive in a computer game performing either normal driving or anomalous driving
 

Training Set

recordings of 25 subjects

  • each subject has 6 normal driving and 8 anomalous driving video recordings
  • each normal driving video lasts about 3.5 minutes and each anomalous driving video lasts about 30 seconds
  • In total, there are around 550 mins recordings for normal driving
    • 100 mins recording of anomalous driving
 

Test Set

recordings of 6 subjects

  • each subject has 6 video recordings lasting around 3.5 mins
  • Anomalous actions occur randomly during the test video
  • there are 16 distracting actions in the test set that are not available in the training set
  • In total, 88 mins normal driving
    • 45 mins anomalous driving
  • 17% of the complete DAD dataset, which is around 95GB
 
 
 

Methodology

Contrastive Learning Framework

  • tried to maximize the similarity between normal driving samples and minimizing the similarity between normal driving samples and minimizing the similarity between normal driving and anomalous driving samples in the latent space using a contrastive loss
 
Callout icon'

비디오 클립고유한 벡터 표현(Embedding)Contrastive Loss

Callout icon'

핵심 목표 : 비정상적인 운전 클립은 정상적인 클립과 거리가 멀어지도록 학습

Three major components in the applied framework

  1. Base Encoder \text{f}_\\text{theta}(.)
    1. Callout icon'

      비디오 입력을 고차원 벡터로 변환하는 과정

      • 입력: 비디오 클립
      • 모델: 3D-CNN 기반 ResNet-18을 포함한 다양한 CNN 모델 사용

      즉, 원본 데이터를 Feature Space 에서의 벡터 표현으로 변환

      • extract vector representations of input clips
      • refers to a 3D-CNN architecture with parameters texttheta\\text{theta}
      hi=fθ(xi)h_i = f_\theta(x_i)
  1. Projection head \text{g}_\\text{beta}(.)
    1. Callout icon'

      Feature Space 에서 Latent Space 로 변환하는 과정

      • 입력: Base Encoder에서의 출력
      • 구조: MLP(Multi-Layer Perceptron) + ReLU 활성화 함수
      Callout icon'

      Projection Head를 적용하는 이유

      Contrastive Learning 에서는 Projection Head 의 출력에서 Contrastive Loss 를 적용하는 것이 더 효과적

      Callout icon'

      L2 정규화 적용

      • 벡터의 크기가 일정해져서 학습이 안정적으로 됨
      • map hi\text{h}_\text{i} into another latent space vi\text{v}_\text{i}
  1. Contrastive loss
    1. Callout icon'

      Embedding Space 에서 정상 운전과 비정상 운전을 구분하도록 학습

      • 정상적인 운전 클립들은 서로 가까운 위치로 학습됨
      • 비정상적인 운전 클립들은 정상적인 클립과 거리를 벌리도록 학습됨
      • impose that normalized embeddings from the normal driving class are closer together than embeddings from different anomalous action classes
 

Test Time Recognition

 

Fusion of Different Views and Modalities

 
 

Training Details

  • train model from scratch for 250 epochs using SGC with momentum 0.9 and initial learning rate of 0.01
 

Experiments

Baseline Results

  • Base Encoder: ResNet-18
  • AUC of the ROC curve for baseline evaluation
 
 
 

추천 글

Made with BlogPro