PRAGYAN NCMS SCIENCE FORUM: March 2023

Written by: Sri Hari K C (1st year MCA)

ABSTRACT

The field of violence detection from video streams is of increasing importance due to its potential to enhance peace, security, and save lives by proactively identifying violent acts. In this study, we introduce a novel approach to address this issue, employing a Convolutional Neural Network (CNN) as both a classifier and feature extractor. Additionally, classical classifiers, specifically Support Vector Machines and Random Forests, are integrated to leverage the CNN's extracted features for violence detection in video streams. To facilitate input to the CNN, we introduce a unique data structure called "packets." Each "packet" comprises 15 sampled frames, equivalent to one second of video footage. This innovative approach transforms the input into a format suitable for the CNN, ultimately framing the problem as binary classification. Our research incorporates four diverse datasets. One dataset is a combination of YouTube videos carefully collected and annotated by our team, amalgamated with a Kaggle dataset that underwent thorough curation by our researchers. Additionally, we integrate three benchmark datasets to ensure a comprehensive evaluation. Our model is trained through supervised learning on a dataset that includes both normal and violent videos. Subsequently, we subject it to rigorous testing using three distinct classifiers. The results generated by these classifiers are thoroughly compared and contrasted with existing state-of-the-art approaches in the field.

In the final phase of our study, we explore the potential of transfer learning by cross-validating models trained on one dataset and tested on another. This innovative approach to video stream violence detection exhibits promising results, underlining its potential to significantly contribute to security and life-saving applications.

INTRODUCTION

In today's increasingly interconnected world, violence, defined as "the use of physical force to injure, abuse, damage, or destroy" by the Merriam-Webster dictionary, has emerged as a pressing global concern. The proliferation of various forms of violence directed at individuals and communities has given rise to a critical need for the development of automatic violence detection systems. These systems are essential in safeguarding locations of paramount importance, including schools, universities, public parks, governmental institutions, hospitals, and other vital facilities. The consequences of delayed responses to violent incidents are profound, often resulting in tragic loss of life and property damage. Thus, the quest for a sophisticated and responsive violence detection solution has become an urgent priority, attracting extensive attention from the research community. With the prevalence of surveillance cameras in many of these critical settings, there is a critical gap in the timely response to violent incidents. Human operators tasked with monitoring these surveillance feeds frequently experience significant delays in identifying and responding to violence, which may lead to severe consequences. As such, the demand for intelligent violence detection systems capable of autonomously alerting authorities upon the detection of any violent act has never been greater. Such systems hold the promise of reducing response times, minimizing losses, and ultimately enhancing overall security. This imperative need for violence detection has ignited a burgeoning field of research, particularly within the computer vision community. From a computational and artificial intelligence perspective, two principal methodologies have emerged to tackle this challenging problem: the first approach relies on handcrafted features, while the second leverages automatic feature extraction, primarily through the application of deep learning techniques. The former entails feature engineering, including the examination of various indicators such as the presence of blood, motion characteristics, audio data, and other manually crafted features. Notably, a significant portion of recent research in violence detection has gravitated towards this approach. In contrast, the second approach harnesses the power of deep neural networks to automatically extract features and detect violence. These deep learning models can process a wide range of input data, including time-series data from inertial sensors like accelerometers and gyroscopes, visual video streams, and static images. Detailed exploration of these two approaches is provided in Section II. Beyond the computational realm, the practical applications of violence detection are remarkably diverse. In today's society, violence can manifest in various forms, from street altercations and physical confrontations to violent scenes depicted in movies, violent acts during sporting events like hockey matches, and even crowd-related violence during gatherings, protests, and other public events. Each of these application domains represents a unique but significant area for violence detection, illustrating the variety of contexts in which these systems can make a meaningful impact. In this research endeavour, we embrace the second approach, focusing on automatic feature extraction using deep learning techniques. Our methodology revolves around the implementation of an end-to- end deep neural network architecture that can work with raw pixel data without extensive preprocessing. Our contributions include the development of a novel data structure named the creation of a dedicated dataset, the formulation of a binary classification problem for detecting violent scenes, and the execution of cross-testing with multiple datasets to facilitate transfer learning. This paper is organised as follows: Section I serves as an introduction, offering insights into the urgency of violence detection in critical settings. Section II delves into related work, exploring techniques relevant to violence detection in video scenes. Section III provides a comprehensive overview of the datasets used in this study. Section IV elaborates on the proposed method, detailing our approach. The experiments conducted, along with their outcomes, are discussed in Section V, and the paper concludes in Section VI.

METHODOLOGY

OVERVIEW

The methodology for violence detection in video streams entails feature extraction utilizing a Convolutional Neural Network (CNN) and the subsequent classification of normal and violent scenes using various classifiers. These include the CNN itself, Support Vector Machine (SVM), and Random Forest (RF). This section provides an in-depth elucidation of the approach, encompassing preprocessing steps, CNN-based feature extraction, classification employing diverse models, and the application of transfer learning.

A. Preprocessing:

1.Data Split: The video datasets are initially divided into training (70%), validation (15%), and testing (15%) sets, except for the PBL-2020 dataset, which follows a different distribution.

2.Frame Sampling: To mitigate computational load, video sequences are subsampled at 15 frames per second (FPS).

3.Image Resizing:The sampled frames are uniformly resized to a 50x50-pixel resolution.

4. Greyscale Conversion: All frames are converted to greyscale format, simplifying data representation.

5.Packet Formation: Sequential 15-frame segments are grouped and assigned either a "normal "; or " violent "label. These packets form 3D tensors with dimensions (50, 50, 15).

Overlapping packets are generated using a sliding window approach with a stride of 1, enriching the dataset for machine learning.

B. Feature Extraction (CNN):

At the core of the methodology lies a meticulously designed Convolutional Neural Network (CNN) for feature extraction. The CNN architecture comprises:

- Seven convolutional layers, each succeeded by MaxPooling and Batch Normalization layers.

- A flattening layer with dropout for regularisation.

- Four dense layers.

- Activation functions employing Rectified Linear Units (ReLU).

- Varied filter quantities in each convolutional layer, ranging from 128 to 1024, capturing intricate abstract features.

- 2x2 max-pooling and a 0.2 dropout rate for regularization.

After passing through these layers, the feature maps are flattened and forwarded to a classical neural network consisting of three dense layers, each followed by dropout layers with a 0.3 rate. ReLU activation functions introduce non-linearity. The final layer encompasses a dense layer with a logistic sigmoid activation function, transforming the feature vector into a probability value within the [0, 1] range, interpreted as the probability of belonging to the two classes: normal or violent.

C. Classification:

To differentiate between normal and violent scenes, three distinct classifiers are harnessed, and their results are contrasted:

1.CNN Classifier: The CNN, operating not only as a feature extractor but also as a supervised classifier, processes the extracted feature vectors to categorize scenes.

2.Support Vector Machine (SVM): The feature vector from the first dense layer of the CNN serves as input for the SVM. SVM endeavours to ascertain a decision boundary optimizing class separation.

3.Random Forest (RF):RF acts as an ensemble method, incorporating numerous decision trees to mitigate overfitting and inefficiencies. It is assessed using feature vectors extracted by the CNN.

D. Transfer Learning:

Transfer learning, a potent technique in computer vision and deep learning, is deployed to expedite the training process. A pre-trained model, originating from a different source dataset (and potentially a distinct task), forms the base model. Fine-tuning with the target dataset is executed, leading to commendable results in significantly less time than starting from a blank slate. This transfer learning approach assesses the utility of employing a model trained on a different dataset to accelerate training and evaluate the quality of feature representation provided by the CNN.

In synopsis, the methodology encompasses preprocessing video data, extracting features with a deep CNN, classifying scenes through diverse classifiers, and exploring the advantages of transfer learning. These steps are pivotal in the development of a robust violence detection system, promising to bolster security and mitigate harm across a spectrum of real-world scenarios. The ensuing sections will delve into specific details, experiments, and outcomes of this approach.

CONCLUSION

This article delved into the intricate domain of violence detection within video scenes, striving to create a versatile model capable of identifying violence across diverse scenarios. The innovative "packets" data structure, designed for efficient feature extraction using Convolutional Neural Networks (CNN), proved to be a pivotal component of our methodology. Our findings unveiled crucial insights, highlighting the potential of end-to-end solutions employing CNNs for feature extraction. Despite dataset limitations, this approach demonstrated remarkable promise, rivaling complex handcrafted feature engineering techniques. Deep learning's innate ability to autonomously discern relevant features emerged as a game-changer, obviating the need for labor-intensive manual feature engineering. Transfer learning, a key facet of our research, was instrumental in expediting training while maintaining robust model performance, underscoring its potential in addressing complex computer vision challenges. Looking forward, potential areas for further exploration include methods to enhance model performance, especially in transfer learning and cross-testing across diverse datasets. The prospect of evolving the problem into a multi-class classification task, incorporating an intermediate class alongside "normal" and "violent," presents an exciting avenue for investigation. Additionally, the temporal nature of video data suggests the promise of sequence models, like transformer models, in violence detection, paving the way for future research.

In summary, this article advances the frontier of violence detection, offering a methodology harnessing deep learning, transfer learning, and innovative data structures. The implications span security enhancement across real-world scenarios, from safeguarding critical facilities to identifying violence in various settings. The field of violence detection promises continued evolution, building upon the strong foundation laid out in this article.

PRAGYAN NCMS SCIENCE FORUM

Friday, March 17, 2023

Advancements in Violence Detection: Leveraging Deep Learning and Transfer Learning

The Collatz Conjecture: A Simple Problem with Complex Implications

Report Abuse