[
  {
    "path": "MANet.md",
    "content": "# Fully Motion-Aware Network for Video Object Detection\n\n## Architecture\n\n![](imgs/MANet.png)\n\n## Summary\n\nSimilar with [FGFA](https://arxiv.org/abs/1703.10025), but in addtion to pixel-level feature calibration and aggregagtion, [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) proposes the **motion pattern reasoning module** to dynamically combine (learnable soft weights) **pixel-level** and **instance-level** calibration according to the motion (optical flow by [FlowNet](https://arxiv.org/abs/1504.06852)). **Instance-level calibration** is achieved by regressing relative movements $(\\Delta x , \\Delta y , \\Delta w , \\Delta h)$ on the optical flow estimation according to proposal positions of reference frame. Final feaure maps for detection network ([R-FCN](https://arxiv.org/abs/1605.06409)) are the aggregation of nearby (13 frames in total) calibrated feature maps. Pixel-level calibration achieves better improvements for non-rigid movements while instance-level calibration is better for rigid movements and occlusion cases.\n"
  },
  {
    "path": "README.md",
    "content": "# Awesome Video-Object-Detection\n\n![Intro](https://github.com/ZHANGHeng19931123/seq_nms_yolo/raw/master/doc/intro1.gif \"Intro\")\n\nThis is a list of awesome articles about **object detection from video**.\n\n## Datasets\n\n### ImageNet VID Challenge\n- **Site**: http://image-net.org/challenges/LSVRC/2017/#vid\n- **Kagge**: https://www.kaggle.com/account/login?returnUrl=%2Fc%2Fimagenet-object-detection-from-video-challenge\n\n### VisDrone Challenge\n- **Site**: http://aiskyeye.com/\n\n## Paper list\n\n### 2016\n\n#### Seq-NMS for Video Object Detection\n\n[[Arxiv]](https://arxiv.org/abs/1602.08465)\n\n- **Date**: Feb 2016\n- **Motivation**: Smoothing the final bounding box predictions across time.\n- **Summary**:  Constructing a temporal graph from overlapping bounding box detections across the adjacent frames, and using dynamic programming to select bounding box sequences with the highest overall detection score.\n\n#### T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos\n\n[[Arxiv]](https://arxiv.org/abs/1604.02532) [[Code]](https://github.com/myfavouritekk/T-CNN)\n\n- **Date**: Apr 2016\n- **Summary**:  Using a video object detection pipeline that involves predicting optical flow first, then propagating image level predictions according to the flow, and finally using a tracking algorithm to select temporally consistent high confidence detections.\n- **Performance**: 73.8% mAP on ImageNet VID validation.\n\n#### Object Detection from Video Tubelets with Convolutional Neural Networks\n\n[[Arxiv]](https://arxiv.org/abs/1604.04053) [[Code]](https://github.com/myfavouritekk/vdetlib)\n\n- **Date**: Apr 2016\n\n#### Deep Feature Flow for Video Recognition\n\n[[Arxiv]](https://arxiv.org/abs/1611.07715) [[Code]](https://github.com/msracver/Deep-Feature-Flow)\n\n- **Date**: Nov 2016\n- **Performance**: 73.0% mAP on ImageNet VID validation at 29 fps on a Titan X GPU.\n\n### 2017\n\n#### Object Detection in Videos with Tubelet Proposal Networks\n\n[[Arxiv]](https://arxiv.org/abs/1702.06355)\n\n- **Date**: Feb 2017\n\n#### Flow-Guided Feature Aggregation for Video Object Detection\n\n[[Arxiv]](https://arxiv.org/abs/1703.10025) [[Code]](https://github.com/msracver/Flow-Guided-Feature-Aggregation)\n\n- **Date**: Mar 2017\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 76.3% mAP at 1.4 fps or 78.4% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) at 1.1 fps on ImageNet VID validation on a Titan X GPU.\n\n#### Detect to Track and Track to Detect\n\n[[Arxiv]](https://arxiv.org/abs/1710.03958) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md) [[Code]](https://github.com/feichtenhofer/Detect-Track)\n\n- **Date**: Oct 2017\n- **Motivation**: Smoothing the final bounding box predictions across time.\n- **Summary**: Proposing a ConvNet architecture that solves detection and tracking problems jointly and applying a Viterbi algorithm to link the detections across time.\n- **Performance**: 79.8% mAP on ImageNet VID validation.\n\n#### Towards High Performance Video Object Detection\n\n[[Arxiv]](https://arxiv.org/abs/1711.11577)\n\n- **Date**: Nov 2017\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 78.6% mAP on ImageNet VID validation at 13 fps on a Titan X GPU.\n\n#### Video Object Detection with an Aligned Spatial-Temporal Memory\n\n[[Arxiv]](https://arxiv.org/abs/1712.06317) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/STMN.md) [[Code]](http://fanyix.cs.ucdavis.edu/project/stmn/project.html) [[Demo]](https://www.youtube.com/watch?v=Vs3LqY1s9GY)\n\n- **Date**: Dec 2017\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 80.5% mAP on ImageNet VID validation.\n\n### 2018\n\n#### Object Detection in Videos by High Quality Object Linking\n\n[[Arxiv]](https://arxiv.org/abs/1801.09823)\n\n- **Date**: Jan 2018\n\n#### Towards High Performance Video Object Detection for Mobiles \n\n[[Arxiv]](https://arxiv.org/abs/1804.05830)\n\n- **Date**: Apr 2018\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 60.2% mAP on ImageNet VID validation at 25.6 fps on mobiles.\n\n#### Optimizing Video Object Detection via a Scale-Time Lattice\n\n[[Arxiv]](https://arxiv.org/abs/1804.05472) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md) [[Code]](https://github.com/hellock/scale-time-lattice) \n\n- **Date**: Apr 2018\n- **Performance**: 79.4% mAP at 20 fps or 79.0% at 62 fps on ImageNet VID validation on a Titan X GPU.\n\n#### Object Detection in Video with Spatiotemporal Sampling Networks\n\n[[Arxiv]](https://arxiv.org/abs/1803.05549) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/STSN.md)\n\n- **Date**: Mar 2018\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 78.9% mAP or 80.4% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) on ImageNet VID validation.\n\n#### Fully Motion-Aware Network for Video Object Detection\n\n[[Paper]](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/MANet.md)\n\n- **Date**: Stp. 2018\n- **Motivation**: Producing powerful spatiotemporal features.\n- **Performance**: 78.1% mAP or 80.3% (combined with [Seq-NMS](https://arxiv.org/abs/1602.08465)) on ImageNet VID validation.\n\n\n#### Integrated Object Detection and Tracking with Tracklet-Conditioned Detection\n\n[[Arxiv]](https://arxiv.org/abs/1811.11167) [[Summary]](https://github.com/ZHANGHeng19931123/awesome-video-object-detection/blob/master/X.md)\n\n- **Date**: Nov 2018\n- **Motivation**: Smoothing the final bounding box predictions across time.\n- **Performance**: 83.5% of mAP with [FGFA](https://arxiv.org/abs/1703.10025) and [Deformable ConvNets v2](https://arxiv.org/abs/1811.11168) on ImageNet VID validation.\n\n### 2019\n\n#### AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling\n[[arXiv]](https://arxiv.org/pdf/1902.02910.pdf)\n\n- **Date**: Feb 2019\n- **Motivation**: Adaptively rescale the input image resolution to improve both accuracy and speed for video object detection.\n- **Performance**: 75.5% of mAP on ImageNet VID validation for 4 different multi-scale training (600, 480, 360, 240).\n\n#### Improving Video Object Detection by Seq-Bbox Matching\n[[pdf]](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0)\n\n- **Date**: Feb 2019\n- **Motivation**: Smoothing the final bounding box predictions across time (box-level method).\n- **Performance**: 80.9% of mAP (offline detection) and 78.2% of mAP (online detection) both at 38 fps on a Titan X GPU.\n\n## Comparison table\n\n| Paper | Date | Base detector | Backbone | Tracking? | Optical flow? | Online? | mAP(%) | FPS (Titan X) |\n| ---|---| ---|---|---|---|---|---|---|\n| [Seq-NMS](https://arxiv.org/abs/1602.08465) | Feb 2016 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | no | no | 76.8 | 2.3 |\n| [T-CNN](https://arxiv.org/abs/1604.02532) | Apr 2016 | RCNN | DeepIDNet+CRAFT | yes | no | no | 73.8 | - |\n| [DFF](https://arxiv.org/abs/1611.07715) | Nov 2016 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | yes | 73.0 | 29 |\n| [TPN](https://arxiv.org/abs/1702.06355) | Feb 2017 | TPN | GoogLeNet | yes | no | no | 68.4 | - |\n| [FGFA](https://arxiv.org/abs/1703.10025) | Mar 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | yes | 76.3 | 1.4 |\n| [FGFA](https://arxiv.org/abs/1703.10025) + [Seq-NMS](https://arxiv.org/abs/1602.08465) | 29 Mar 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | yes | no | 78.4 | 1.14 |\n| [D&T](https://arxiv.org/abs/1710.03958) | Oct 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) (15 anchors) | ResNet101 | yes | no | no | 79.8 | 7.09 |\n| [STMN](https://arxiv.org/abs/1712.06317) | Dec 2017 | [R-FCN](https://arxiv.org/abs/1605.06409) | ResNet101 | no | no | no | 80.5 | - |\n| [Scale-time-lattice](https://arxiv.org/abs/1804.05472) | 16 Apr 2018 | [Faster RCNN](https://arxiv.org/abs/1506.01497) (15 anchors)| ResNet101 | no | no | no | 79.6 | 20 |\n| [Scale-time-lattice](https://arxiv.org/abs/1804.05472) | Apr 2018 | [Faster RCNN](https://arxiv.org/abs/1506.01497) (15 anchors)| ResNet101 | no | no | no | 79.0 | **62** |\n| [SSN](https://arxiv.org/abs/1803.05549) (per-frame baseline for STSN) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409) | Deformable ResNet101 | no | no | yes | 76.0 | - |\n| [STSN](https://arxiv.org/abs/1803.05549) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| Deformable ResNet101 | no | no | yes | 78.9 | - |\n| [STSN](https://arxiv.org/abs/1803.05549)+[Seq-NMS](https://arxiv.org/abs/1602.08465) | Mar 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| Deformable ResNet101 | no | no | no | 80.4 | - |\n| [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html) | Sep. 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | no | yes | yes | 78.1 | 5 |\n| [MANet](http://openaccess.thecvf.com/content_ECCV_2018/html/Shiyao_Wang_Fully_Motion-Aware_Network_ECCV_2018_paper.html)+[Seq-NMS](https://arxiv.org/abs/1602.08465) | Sep. 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | no | yes | no | 80.3 | - |\n| [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | 78.1 | - |\n| [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167)+[DCNv2](https://arxiv.org/abs/1811.11168) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | 82.0 | - |\n| [Tracklet-Conditioned Detection](https://arxiv.org/abs/1811.11167)+[DCNv2](https://arxiv.org/abs/1811.11168)+[FGFA](https://arxiv.org/abs/1703.10025) | Nov 2018 | [R-FCN](https://arxiv.org/abs/1605.06409)| ResNet101 | yes | no | yes | **83.5** | - |\n| [Seq-Bbox Matching](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0) | Feb 2019 | [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)| darknet53 | no | no | no | 80.9 | 38 |\n| [Seq-Bbox Matching](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwjWwNWa95_iAhUMyoUKHR-GAJwQFjABegQIBBAC&url=http%3A%2F%2Fwww.insticc.org%2FPrimoris%2FResources%2FPaperPdf.ashx%3FidPaper%3D72600&usg=AOvVaw1dTqzUoybpNRVkCdkA1xg0) | Feb 2019 | [YOLOv3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)| darknet53 | no | no | yes | 78.2 | 38 |"
  },
  {
    "path": "STMN.md",
    "content": "# Video Object Detection with an Aligned Spatial-Temporal Memory\n\n## Architecture\n\n![](imgs/STMN.png)\n\n## Summary\n\nProposing 1. a novel **Spatial-Temporal Memory module (STMM)** (as the recurrent computation unit) to **model** long-term temoral appearance and motion dynamicis; 2. a novel **MatchTrans module** to **align** the Spatial-Temporal Memory (feature maps) across frames.\nAssuming $F_t$ as the appearane feature for the current frame and $M^{\\rightarrow}_{t-1}$ as the the feature of all previous frames, the **STMM updates $M^{\\rightarrow}_{t}$ with the input $F_t$ and $M^{\\rightarrow}_{t-1}$**. Two STMMs are used to obtain feature maps from both directions and the final feature maps are the concatenation of $M^{\\rightarrow}_{t}$ and $M^{\\leftarrow}_{t}$. The **MatchTrans module** computes transformation coefficients $\\Gamma$ for position (x,y) from $M_{t-1}$ to $M^{'}_{t-1}$ (' means matched to $F_t$) by **measuing the similarity** between $F_t(x,y)$ and $F_{t-1}(x+i,y+j)$. The transformation coefficients are then used to synthesize $M^{'}_{t-1}$ by **interpolating** the corresponding $M_{t-1}$ feature vectors: $M^{'}_{t-1}(x,y)=\\sum_{i,j \\in \\{-k,...k\\}} \\Gamma_{x,y}(i,j) \\cdot M_{t-1}(x+i,y+j)$. Sequence length $T=7$ for training and and $T=11$ for testing. During testing, STMN detector and initial R-FCN detector detection results are ensembled.\n"
  },
  {
    "path": "STSN.md",
    "content": "# Object Detection in Video with Spatiotemporal Sampling Networks\n\n## Architecture\n\n![](imgs/STSN.png)\n\n## Summary\n\nUsing [deformable convolutions](https://arxiv.org/abs/1703.06211)  across space and time (instead of optical flow) to leverage temporal information for object detection in video, i.e., using **deformable convolutions** to sample relevant features from nearby frames (27 frames in total) and using **temporally aggregagtion** (per-pixel weighted summation) to generate final feature maps for detection network ([R-FCN](https://arxiv.org/abs/1605.06409)).\n"
  }
]