Full Code of amusi/daily-paper-computer-vision for AI

master b61562b6f6ab cached

44 files

586.2 KB

160.6k tokens

1 requests

Download .txt

Showing preview only (607K chars total). Download the full file or copy to clipboard to get everything.

Repository: amusi/daily-paper-computer-vision
Branch: master
Commit: b61562b6f6ab
Files: 44
Total size: 586.2 KB

Directory structure:
gitextract_lcikzkm9/

├── 2018/
│   ├── 05/
│   │   ├── 16.md
│   │   ├── 19.md
│   │   ├── 22.md
│   │   ├── 24.md
│   │   └── 29.md
│   ├── 06/
│   │   ├── 06.md
│   │   ├── 08.md
│   │   ├── 11.md
│   │   ├── 13.md
│   │   ├── 15.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   └── 29.md
│   ├── 07/
│   │   ├── 02.md
│   │   ├── 05.md
│   │   ├── 06.md
│   │   ├── 07.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   ├── 27.md
│   │   └── 31.md
│   ├── 08/
│   │   ├── 03.md
│   │   ├── 07.md
│   │   ├── 11.md
│   │   ├── 15.md
│   │   └── 25.md
│   ├── 10/
│   │   ├── 12.md
│   │   └── 17.md
│   ├── 11/
│   │   ├── 05-09.md
│   │   ├── 19.md
│   │   └── 20.md
│   ├── 12/
│   │   ├── 10.md
│   │   ├── 17-21.md
│   │   ├── 24-28.md
│   │   └── 31.md
│   └── cvpr2018-paper-list.csv
├── 2018-Paper.md
├── 2019/
│   ├── 01/
│   │   └── 01-04.md
│   └── 03/
│       └── 12.md
├── 2019-Paper.md
├── 2020-Paper.md
├── 2021-Paper.md
├── 2023-Paper.md
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: 2018/05/16.md
================================================
**2018-05-16**

Summary: 有4篇论文速递信息，涉及单目图像深度估计、6-DoF跟踪、图像合成和动作捕捉等方向（含1篇CVPR 2018论文和1篇ICRA 2018论文）。

# Depth Estimation


[1]《Dual CNN Models for Unsupervised Monocular Depth Estimation》

2018 arXiv

Abstract：立体视觉中的深度估计问题已经取得了很多进展。虽然通过利用监督深度学习的深度估计来观察到非常令人满意的表现。这种方法需要大量的标定好的真实数据（ground truth）以及深度图，这些图准备非常费时费力，并且很多时候在实际情况下不可用。因此，无监督深度估计是利用双目立体图像摆脱深度图ground truth的最新方法。在无监督深度计算中，通过基于极线几何约束（epipolar geometry constraints）以图像重构损失对CNN进行训练来生成视差图像。需要解决使用CNN的有效方法以及调查该问题的更好的损失（loss）。在本文中，提出了一种基于双重（dual）CNN的模型，用于无监督深度估计，每个视图具有6个损失（DNM6）和单个CNN，以生成相应的视差图。所提出的双CNN模型也通过利用交叉差异扩大了12个损失（DNM12）。所提出的DNM6和DNM12模型在KITTI驾驶和Cityscapes城市数据库上进行了试验，并与最近最先进的无监督深度估计结果进行了比较。
arXiv：https://arxiv.org/abs/1804.06324
github：https://github.com/ishmav16/Dual-CNN-Models-for-Unsupervised-Monocular-Depth-Estimation/tree/master/DNM6
注：无监督学习，哎呦喂！



# 6-DoF Tracking

[2]《Egocentric 6-DoF Tracking of Small Handheld Objects》

2018 arXiv

Abstract：虚拟和增强现实技术在过去几年中有了显著性增长。这种系统的关键部分是能够在3D空间中跟踪头戴式显示器和控制器的姿态。我们从自我中心相机（egocentric camera perspectives）的角度解决了手持式控制器高效的6-DoF跟踪问题。我们收集了HMD控制器数据集，该数据集由超过540,000个立体图像对组成，标记有手持控制器的完整6-DoF姿态 我们提出的SSD-AF-Stereo3D模型在3D关键点预测中实现33.5毫米的平均平均误差，并与控制器上的IMU传感器结合使用，以实现6-DoF跟踪。我们还介绍了基于模型的完整6-DoF跟踪方法的结果。 我们的所有型号都受到实时移动CPU inference的严格限制。
arXiv：https://arxiv.org/abs/1804.05870



# Image Synthesis

[3]《Geometry-aware Deep Network for Single-Image Novel View Synthesis》
CVPR 2018
Abstract：本文从单个图像解决了新颖视图合成的问题。特别是，我们针对的是具有丰富几何结构的真实场景，这是一个具有挑战性的任务，因为这些场景的外观变化很大，并且缺乏简单的3D模型来表示它们。现代的，基于学习的方法主要集中于外观来合成新颖的视图，因此倾向于产生与底层场景结构不一致的预测。相反，在本文中，我们建议利用场景的三维几何来合成一种新颖的视图。具体而言，我们通过固定数量的平面逼近真实世界的场景，并学习预测一组单应性（homographies）及其相应的区域蒙版/掩膜（masks），以将输入图像转换为新颖视图。为此，我们开发了一个新的区域感知型几何变换网络（region-aware geometric transform network），在一个通用框架中执行这些多任务。我们在户外KITTI和室内ScanNet数据集上的结果证明了我们网络在生成场景几何的高质量合成视图方面的有效性，从而超越了最先进的方法。
arXiv：https://arxiv.org/abs/1804.06008



# Motion Capture

[4]《Human Motion Capture Using a Drone》
ICRA 2018
Abstract：目前的动作捕捉（MoCap）系统通常需要标记和多个校准摄像头，这些摄像头只能在受限环境中使用。在这项工作中，我们介绍了一款基于无人机的3D人体模型系统。该系统只需要具有自主飞行无人机和板载RGB相机，并可用于各种室内和室外环境。重建算法被开发用于从无人机记录的视频恢复全身运动。 我们认为，除了跟踪移动主体的能力之外，飞行无人机还提供快速变化的视点，这对于运动重建是有益的。 我们使用我们新的DroCap数据集评估拟议系统的准确性，并使用消费无人机在野外证明其适用。
arXiv：https://arxiv.org/abs/1804.06112
注：脑洞好大的研究，很cool

================================================
FILE: 2018/05/19.md
================================================
**2018-05-19**

Summary: 这篇文章有4篇论文速递信息，涉及人脸识别（综述）、人脸检测、3D 目标检测和姿态估计和目标检测等方向（含2篇CVPR 2018）。

# Face

[1]《Deep Face Recognition: A Survey》
Abstract：在图形处理单元（GPU）、大量待标注数据和更高级算法的驱动下，深度学习使得计算机视觉领域受到了极大的冲击，并且使包括人脸识别（FR）在内的实际应用受益匪浅。Deep FR 方法利用深层网络学习更多的不同（discriminative）表征，显著地改善现有技术并超越人类表现（97.53％）。在本文中，我们提供深度FR方法的全面调查，包括数据，算法和场景。首先，我们总结了常用的训练和测试数据集。然后，数据预处理方法分为两类：“一对多增强”和“多对一标准化”。其次，对于算法，我们总结了现有技术方法中使用的不同网络架构和损失函数。第三，我们回顾了深度FR中的几个场景，比如视频FR，3D FR和不同年龄段（Cross-Age） FR。最后，强调了当前方法的一些潜在缺陷和几个未来方向。
arXiv：https://arxiv.org/abs/1804.06655
注：综述性文章，实属好评！

[2]《SFace: An Efficient Network for Face Detection in Large Scale Variations》
Abstract：人脸检测是许多应用程序（如人脸识别）的基础研究主题。特别是最近卷积神经网络的发展取得了令人印象深刻的进展。然而，广泛存在于高分辨率图像/视频中的大范围变化的问题在文献中尚未得到很好的解决。在本文中，我们提出了一种名为SFace的新算法，它有效地集成了基于 Anchor 的方法和无 Anchor 方法来解决尺度（scale）问题。还引入了称为4K-Face的新数据集来评估具有极大尺度变化的人脸检测的性能。SFace架构在新的4K-Face基准测试中显示出可喜的成果。 此外，我们的方法可以以每秒50帧（fps）的速度运行，标准WIDER FACE数据集的准确率为80％AP，其速度比现有算法高出近一个数量级，同时达到了比较性能。
arXiv：https://arxiv.org/abs/1804.06559



# 3D Object Detection and Pose Estimation

[3]《Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation》
CVPR 2018 Workshop on Real World Challenges and New Benchmarks for Deep Learning in Robotic Vision
Abstarct：本文提出了一个名为Falling Things（FAT）的新数据集，用于推进机器人技术环境下的物体检测（Object Detectiion）和3D姿态估计的最新技术。通过对复杂构图和高图形质量的对象模型和背景进行综合组合，我们能够为所有图像中的所有对象生成具有精确三维姿态标注的照片真实感图像。我们的数据集包含来自YCB数据集的21个家庭对象的60k注释照片。对于每个图像，我们为所有对象提供3D姿势，每像素类分割以及2D / 3D边界框坐标。为了便于测试不同的输入模式，我们提供单目和立体双目 RGB图像以及配准（registered）的密集深度图像。 我们详细描述了数据的生成过程和统计分析。
arXiv：https://arxiv.org/abs/1804.06534
datasets：http://research.nvidia.com/publication/2018-06_Falling-Things



# Object Detection

[4]《Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization》
CVPR 2018 Workshop on Autonomous Driving
Abstract：我们提出了一种用于训练用于使用合成图像进行物体检测的深度神经网络的系统。为了解决真实世界数据的变化问题，系统依赖于域随机化技术（domain randomization），其中模拟器（simulator）的参数（例如照明，姿态，物体纹理等）以非现实的方式随机化，迫使神经网络学习感兴趣对象的基本特征。我们探索这些参数的重要性，表明可以仅使用非艺术性生成的合成数据生成具有引人注目的性能的网络。通过对实际数据进行额外的微调，网络比单独使用真实数据的性能更好。这个结果为使用低成本的合成数据训练神经网络提供了可能性，同时避免了收集大量手工标注的真实世界数据或生成高保真度合成世界（high-fidelity synthetic worlds）的需求 - 这两者都是许多应用的瓶颈。该方法在KITTI数据集上对汽车的边界框检测进行评估。
arXiv：https://arxiv.org/abs/1804.06516

================================================
FILE: 2018/05/22.md
================================================
**2018-05-22**

Summary: 这篇文章有4篇论文速递信息，涉及图像分割、视频分割、目标追踪和异常检测等方向。

#Image Segmentation

**[1]《Deep Object Co-Segmentation》**

Abstract：这项工作提出了一种深度对象共分割（DOCS）方法，用于分割一对图像中同一类的共同对象。这意味着该方法学习忽略常见或不常见的背景内容，并专注于对象。 如果在图像对中呈现多个对象类，则将它们共同提取为前景。为了解决这个任务，我们提出了一个基于CNN的连体编码器 - 解码器架构。编码器提取前景对象的高级语义特征，互相关层检测公共对象，最后，解码器为每个图像生成输出前景掩膜。为了训练我们的模型，我们编译了一个大对象协同分割数据集，该数据集由来自PASCAL VOC数据集的图像对与普通对象掩膜组成。我们评估了常用数据集的共分割任务方法，并观察到我们的方法对于看到和看不见的对象类，始终优于其它方法
arXiv：https://arxiv.org/abs/1804.06423

注：联合分割，很cool！



#Video Segmentation

**[2]《Superframes, A Temporal Video Segmentation》**

Abstract：视频分割的目标是将视频数据转换为一组可以轻松解释为视频构建块的具体运动集群。有一些类似主题的作品，比如检测视频中的场景剪辑，但很少有关于将视频数据聚类到所需数量的紧凑片段的具体研究。与从我们称之为超帧的低级分组过程获得的具有感知上有意义的实体一起工作会更直观，更高效。本文提出了一种新的简单而有效的技术来检测视频中类似内容模式的超帧。我们计算内容运动的相似度以获得连续帧之间的变化强度。在现有的使用深度模型的光流技术的帮助下，所提出的方法能够有效地执行更精确的运动估计。我们还提出了两个衡量和比较各种数据库上不同算法性能的标准。来自基准数据库的视频实验结果证明了该方法的有效性。



arXiv：https://arxiv.org/abs/1804.06642



#Object Tracking

**[3]《Unveiling the Power of Deep Tracking》**

Abstract：在通用目标跟踪领域，已经利用深度特征进行了许多尝试。尽管载有很多期望，但与仅基于手工特征（handcrafted feature）的方法相比，深度跟踪器仍未达到出色的性能水平。在本文中，我们调查了这个关键问题，并提出了解决深度特征追踪真实潜力的方法。我们系统地研究了深和浅特征的特征，以及它们与跟踪精度和鲁棒性的关系。我们将有限的数据和低空间分辨率确定为主要挑战，并提出策略以在集成深度特征进行跟踪时解决这些问题。此外，我们提出了一种新颖的自适应融合方法，利用深和浅特征的互补特性来提高鲁棒性和准确性。对四个具有挑战性的数据集进行了大量实验。在VOT2017上，我们的方法明显优于EAO中最高性能跟踪器，相对增益为17％。

arXiv：https://arxiv.org/abs/1804.06833

注：上述所说的深度特征，应该就是深度神经网络提取的特征



#Anomaly detection

**[4]《Temporal Unknown Incremental Clustering (TUIC) Model for Analysis of Traffic Surveillance Videos》**

Abstract：优化的场景表示是用于检测实况视频异常的框架的重要特征。检测实时视频异常的挑战之一是以非参数方式实时检测对象。另一个挑战是有效地表示跨帧的对象状态。在本文中，提出了一种基于Gibbs抽样的启发式模型，称为时间未知增量聚类（TUIC），用于将像素与运动聚类。首先使用光流检测像素运动，并且已经应用贝叶斯算法将后续帧中属于相似群集的像素相关联。该算法速度快并且在Θ（kn）时间产生准确的结果，其中k是簇的数量，n是像素的数量。我们使用公开可用的数据集进行的实验验证表明，所提出的框架具有很好的潜力来为实时交通分析开辟新的机会。

arXiv：https://arxiv.org/abs/1804.06680

注：用于交通监控视频分析

================================================
FILE: 2018/05/24.md
================================================
**2018-05-24**

Summary: 这篇文章有5篇论文速递信息，涉及活体检测、SFM、视差估计、Zero-short Learning和3D shape等方向。

# 活体检测

**[1]《Liveness Detection Using Implicit 3D Features》**

Abstract：欺骗攻击（Spoofing attacks）是对现代人脸识别系统的威胁。在这项工作中，我们提出了一种简单而有效的活体检测（Liveness Detection）方法来增强2D人脸识别方法，并使其能够抵御欺骗攻击。我们表明，通过使用额外的光源，例如闪光灯，可以减少欺骗攻击的风险。从在不同照明下拍摄的一对输入图像中，我们定义含有面部三维信息的判别特征。此外，我们表明，当考虑多个光源时，我们能够验证哪一个已被激活。这使得设计高度安全的主动光认证框架成为可能。最后，进一步研究如何在不使用3D重建的情况下使用3D特征，我们引入了一种近似的基于视差的隐式3D特征，该特征从未校准的立体对摄像头中获得。评估实验表明，所提出的方法在几乎没有特征提取延迟的情况下在具有挑战性的场景中产生了state-of-the-art的结果。

arXiv：https://arxiv.org/abs/1804.06702



# **从运动到结构(SFM)**

**[2]《Structure from Recurrent Motion: From Rigidity to Recurrency》**

**CVPR 2018**

Abstract：本文提出了一种新的非刚性运动结构（NRSfM）方法，该方法从长单目视频序列观察非刚性物体执行循环和可能重复的动态行为。从传统的使用线性低阶或低阶形状模型的NRSfM任务出发，我们的方法利用了形状反复的性质（即许多变形形状倾向于及时重复）。我们表明，反复发生（recurrency）实际上是一种广义的rigidity。基于此，我们将NRSfM问题简化为刚性问题，只要满足某些重复性条件。鉴于这种减少，标准的刚性SfM技术可直接应用于（不作任何改变）重构非刚性动态形状。为了实现这个想法作为一种实用的方法，本文开发了用于自动重复检测的高效算法，以及通过刚性检查进行摄像机视图聚类。对模拟序列和实际数据的实验证明了该方法的有效性。由于本文提供了一种反思运动结构的新视角，我们希望它能激发该领域的其他新问题。

arXiv：https://arxiv.org/abs/1804.06510

注：NRSfM，很棒的研究！



# 视差估计

**[3]《Variational Disparity Estimation Framework for Plenoptic Image》**

**ICME 2017**

Abstract：本文提出了一个精确估计全光（plenoptic）图像视差图的计算框架。所提出的框架基于变分原理（variational principle ）并提供 intrinsic sub-pixel precision。这个框架中引入的光场运动张量（light-field motion tensor）允许我们结合先进的强大数据项，并为不同颜色通道提供明确的处理。我们的框架中嵌入了扭曲策略（warping strategy）来解决大位移（displacememt）问题。我们还表明，通过应用简单的正则化项和导向中值滤波，可以大大提高遮挡区域位移场的精度。我们通过与Lytro软件和合成的（synthetic）、现实的世界数据集进行深入比较，展示了所提出的框架的出色表现。

arXiv：https://arxiv.org/abs/1804.06633

github：https://github.com/hieuttcse/variational_plenoptic_disparity_estimation



# Zero-short Learning

**[4]《Zero-shot Learning with Complementary Attributes》**

Abstract：Zero-shot Learning（ZSL）旨在通过属性使用不相交的可见对象识别未曾见过的对象，以将语义信息从训练数据传输到测试数据。ZSL的泛化性能受属性的控制，这些属性代表了所看到的类和没看到过的类之间的相关性。在本文中，我们提出了一种新的ZSL方法，使用互补属性（complementary attributes）作为原始属性的补充。我们首先用它们的补充形式来扩展属性，然后使用训练数据对原始属性和互补属性进行预分类。在对每个属性进行排序后，我们使用排名聚合框架来计算最高排序被指定为测试样本标签的测试类别中的优化排名。我们凭经验证明，互补属性对ZSL模型有一个有效的改进。实验结果表明，我们的方法优于标准ZSL数据集上的最新方法。

arXiv：https://arxiv.org/abs/1804.06505



# 3D Shape

**[5]《Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines》**

**ACM Transactions on Graphics 2018**

Abstract：我们提出了一个半监督的协同分析方法，用于从投影特征线学习3D形状风格（style），实现只有弱监督的 style patch定位。鉴于跨越多个对象类别和风格的3D形状集合，我们对每个3D形状的投影要素线执行样式协同分析，然后将学习风格要素反向投影到3D形状上。我们的核心分析流程始于中级 patch 抽样和预选候选风格 batch。然后通过拼接卷积编码投影特征。多视点特征集成和风格聚类是在部分共享潜在因子（PSLF）学习框架下进行的，这是一种多视点特征学习方案。PSLF通过从多个视图中提取一致和互补的特征信息，同时从候选中选择 style patches，实现有效的多视点特征融合。我们的风格分析方法支持无监督分析和半监督分析。对于后者，我们的方法接受用户指定的形状标签和风格排列的三元组（style-ranked triplets ）作为聚类约束条件。我们演示了3D形状样式分析和 patch定位的结果以及对最先进方法的改进。我们还通过我们的风格分析提供了几个应用程序。

arXiv：https://arxiv.org/abs/1804.06579

================================================
FILE: 2018/05/29.md
================================================
**2018-05-29**

Summary: 这篇文章有4篇论文速递信息，涉及图像分类、视频分类和语义分割等方向（含一篇ICLR 2018和一篇CVPR 2018）。

# Image Classification

**[1]《IamNN: Iterative and Adaptive Mobile Neural Network for Efficient Image Classification》**

ICLR 2018 Workshop track

Abstract：深度残差网络（ResNets）近期在深度学习方面取得突破。ResNets的核心思想是在图层之间建立 shortcut，使网络更加深入，同时易于优化，避免梯度消失。这些 shotcut 连接具有有趣的副作用（side-effects），使ResNets的行为与其他典型网络架构不同。在这项工作中，我们使用这些属性来设计基于ResNet但具有参数共享和自适应计算时间的网络。 所得到的网络比原始网络小得多，并且可以使计算成本适应输入图像的复杂度。

arXiv：https://arxiv.org/abs/1804.10123



**[2]《Progressive Neural Networks for Image Classification》**

Abstract：现有的深度神经网络的推理/推断结构（inference structures）和计算复杂性一旦被训练，就被固定，并且对于所有测试图像保持相同。然而，实际上，为深度神经网络建立渐进式结构（ progressive structure）是非常需要的，其能够针对具有不同视觉识别复杂度的图像调整其推理过程和复杂度。在这项工作中，我们为深度神经网络开发了一个集成置信分析和决策策略学习的多阶段渐进结构。这个新的框架由一系列网络单元组成，以顺序方式激活，逐渐增加复杂性和视觉识别能力。我们在CIFAR-10和ImageNet数据集上的广泛实验结果表明，所提出的渐进深度神经网络能够获得10倍以上的复杂度可扩展性，同时使用满足不同复杂度的单一网络模型实现最先进的性能，精度要求。

arXiv：https://arxiv.org/abs/1804.09803

注：很有意思的研究~



# Video Classification

**[3]《Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification》**

IJCAI 2018

Abstract：视频表示学习（Video representation learning）是分类任务的一个重要问题。最近，出现了一种被称为自监督学习的无监督范式（unsupervised paradigm），它通过解决辅助任务探索了海量数据中蕴含的固有监督信号，用于特征学习。然而，当扩展到视频分类时，这方面的现有方法受到两个限制。首先，他们只关注单个任务，而忽视不同任务特定功能之间的互补性，从而导致视频表现不理想。其次，高计算和内存成本阻碍了它们在现实世界中的应用。在本文中，我们提出了一个基于图的 distillation 框架来解决这些问题：（1）我们提出了logits图和表示图来传递来自多个自我监督任务的知识，前者通过解决多个自我监督任务来提取分类器级知识，分配联合匹配问题，后者从成对集合表示中提取内部特征知识，应对不同特征之间异质性的挑战; （2）采用 teacher-student 框架的建议可以显著地减少 teachers从教学中学到的冗余，从而形成一个轻量级的 student模型，更有效地解决分类任务。在3个视频数据集上的实验结果验证了我们的提议不仅有助于学习更好的视频表示，还可以压缩模型以加快推断速度。

arXiv：https://arxiv.org/abs/1804.10069



# Semantic Segmentation

**[4]《Fully Convolutional Adaptation Networks for Semantic Segmentation》**

CVPR 2018, Rank 1 in Segmentation Track of Visual Domain Adaptation Challenge 2017

Abstract：深度神经网络的最新进展令人信服地证明了在大型数据集上学习视觉模型的高能力。尽管如此，收集专家标记的数据集尤其是像素级注释（ pixel-level）是一个代价非常高的过程。一个吸引人的选择是呈现合成数据（例如电脑游戏）并自动生成 ground truth。然而，简单地应用在合成图像上学习的模型可能导致由于域偏移（domain shift）导致的真实图像上的高泛化误差。在本文中，我们从视觉外观水平和表示水平域适应（visual appearance-level and representation-level domain adaptation）的角度来解决这个问题。前者将源域图像调整为显示为从目标域中的“样式”中绘制，后者尝试学习域不变表示。具体来说，我们提出了完全卷积自适应网络（FCAN），这是一种结合了外观自适应网络（AAN）和表示自适应网络（RAN）的新型深度语义分割体系结构。AAN在像素空间中学习从一个域到另一个域的转换，并且RAN在对抗学习方式下被优化以最大程度地愚弄具有所学习的源和目标表示的域鉴别器。从GTA5（游戏视频）到城市风景（城市街道场景）的语义分割转换进行了大量实验，并且我们的建议与最先进的无监督自适应技术进行比较时取得了优异的结果。更为显著的是，我们获得了一项新纪录：在无人监督的环境下，BDDS（驾驶摄像头视频）的47.5％的mIoU。

arXiv：https://arxiv.org/abs/1804.08286

注：提出了Fully Convolutional Adaptation Networks (FCAN）网络（FCN的改进版），其结合了Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN).

================================================
FILE: 2018/06/06.md
================================================
**2018-06-06**

Summary: 这篇文章有4篇论文速递信息，涉及目标跟踪、GAN、Zero-Shot Learning、视频分类和行人重识别等方向（含一篇IJCAI 2018和一篇IROS 2018 submission ）。

# Object Tracking

**《Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline》**
**IROS 2018 submission** 
Abstract：在过去的十年中，很多机器人都被部署在户外，行人检测和跟踪是此类部署的重要组成部分。最重要的是，经常需要运行模块来分析人员并提取更高级别的属性，如年龄和性别，或者动态信息，如注视（gaze）和姿势。后者对于构建一个反应性的（reactive）社交机器人 - 人员交互尤为必要。
在本文中，我们将这些组件组合在完全模块化的检测跟踪分析流程（pipeline）中，称为DetTA。我们通过使用一致的跟踪ID来对分析模块的观测结果进行时间滤波（temporal filtering），在头部和骨骼姿态的例子中研究这种集成的好处，显示出在具有挑战性的真实世界情景中略有改善。我们还研究了所谓“自由飞行”（free-light）模式的潜力，其中人员属性的分析仅依赖于过滤器对特定帧的预测。在这里，我们的研究表明，这极大地提高了运行时间，而预测质量保持稳定。当在移动平台上运行许多分析组件时，特别是在代价很高的深度学习方法时代，这种见解对于降低功耗和共享宝贵的（GPU）显存尤其重要。
arXiv：https://arxiv.org/abs/1804.10134
github：https://github.com/sbreuers/detta



# GAN & Zero-Shot Learning & Video Classfication

**《Visual Data Synthesis via GAN for Zero-Shot Video Classification》**
**IJCAI 2018**
Abstract：视频分类中的Zero-Shot Learning（ZSL）是一个有前途的研究方向，旨在应对视频类别爆炸性增长带来的挑战。大多数现有的方法通过学习视觉和语义空间之间的投影（projection）来利用看不见的相关性。然而，这种基于投影的范式不能充分利用数据分布中隐含的区分信息，而且常常遭受由“异质性差距”（heterogeneity gap）引起的信息退化问题。在本文中，我们通过GAN提出了一个可视化数据合成框架来解决这些问题。具体而言，利用语义知识和视觉分布来合成未见类别的视频特征，并将ZSL转化为具有综合特征的典型监督问题。首先，我们提出了多层次的语义推理来促进视频特征合成，通过特征层次和标签层次语义推断来捕获联合视觉语义分布所蕴含的判别信息。其次，我们提出匹配感知互信息相关来克服信息降级问题，该问题通过互信息捕获匹配和不匹配视觉语义对中的看不见的相关性，为 zero-shot 合成过程提供鲁棒的引导信号。在四个视频数据集上的实验结果表明，我们的方法可以显着提高 zero-shot 视频分类性能。
arXiv：https://arxiv.org/abs/1804.10073
注：作为CV初学者表示，最近Zero-Shot Learning的曝光量很足啊！GAN的魅力依旧那么强！

# Generative Model

**《Generative Model for Heterogeneous Inference》**

**arXiv 2018**

Abstract：例如生成对抗网络（GAN）和变分自编码（VAE）等生成模型（GM）近年来蓬勃发展，并在生成新样本方面取得了高质量的结果。特别是在计算机视觉领域，GMs已经被用于图像修复（image inpainting），去噪（denoising）等领域，其可以被视为从观察像素到被破坏的像素的推断（inference）。然而，图像是分层结构的，与许多具有非分层特征的真实世界推断场景截然不同。这些推断方案包含异构随机变量和不规则的相互依赖。传统上它们是由贝叶斯网络（BN）建模的。然而，BN模型的学习和推理是NP-hard的，因此BN中的随机变量数量受到很大限制。在本文中，我们采用典型的GMs来实现多项式时间（polynomial time）的异构学习（heterogeneous learning ）和推理。我们还提出了一个扩展的自回归（EAR）模型和一个带有对抗损失的EARA模型，并给出了它们有效性的理论结果。对几个BN数据集的实验表明，与其他GM相比，我们提出的EAR模型在大多数情况下实现了最佳性能。除黑箱（black box）分析外，我们还对GMs的马尔可夫边界推理进行了一系列白盒（white box）分析实验，并给出了理论结果。
arXiv：https://arxiv.org/abs/1804.09858
注：很硬的论文啊！

# Re-ID

**《Domain Adaptation through Synthesis for Unsupervised Person Re-identification》**

**arXiv 2018**

Abstract：监控摄像机照明（illumination）的巨大差异使得行人重识别问题极具挑战性。目前的大规模重识别（re-identification）数据集具有大量的训练主题，但缺乏光照条件的多样性。因此，训练好的模型需要微调（fine-tuning）才能在看不见的照明条件下变得有效。为了缓解这个问题，我们引入了一个新的合成数据集，其中包含数百个照明条件。具体而言，我们使用100个虚拟人照亮多个HDR环境地图，这些地图可精确模拟真实的室内和室外照明。为了在看不见的照明条件下获得更好的准确性，我们提出了一种新颖的领域适应技术，它利用我们的合成数据并以完全无监督的方式进行微调。我们的方法比半监督和无监督的最先进的方法具有更高的准确性，并且与监督技术非常具有竞争力。
arXiv：https://arxiv.org/abs/1804.10094

================================================
FILE: 2018/06/08.md
================================================
**2018-06-08**

Summary: 这篇文章有4篇论文速递信息，涉及胶囊网络、迁移学习、优化CNN和手指检测等方向（含一篇NIPS 2017、一篇ICMR 2018和一篇 VCIP 2017 ）。

# Capusle Networks & Transfer Learning

**《Capsule networks for low-data transfer learning》**

Abstract：我们提出了一个基于胶囊网络（capsule network）的框架，用于通过少数例子将学习推广到新数据。使用生成（generative）和非生成胶囊网络与中间路由（intermediate routing），我们能够生成比相似卷积神经网络快25倍的新信息。我们在缺少一位数字的multiMNIST数据集上训练网络。在网络达到其最大精度后，我们将1-100个缺失数字的样本放入训练集，并测量返回到可比较的准确度所需的批次数。然后我们讨论胶囊网络带来的低数据传输学习的改进，并为胶囊网络研究提出未来的发展方向。

arXiv：https://arxiv.org/abs/1804.10172

注：最近感觉Capsule Network不是很火了啊~

# CNN

**《Competitive Learning Enriches Learning Representation and Accelerates the Fine-tuning of CNNs》**

**NIPS 2017**

Abstract：在这项研究中，我们提出将竞争性学习整合到卷积神经网络（CNN）中以改善表示学习和微调（fine-tuning）效率。传统的CNN使用反向传播学习，它可以通过区分任务实现强大的表示学习。但是，它需要大量标记数据，并且标记数据的获取比未标记数据的难得多。因此，有效使用未标记的数据对于DNN越来越重要。为了解决这个问题，我们将无监督的竞争学习引入卷积层，并利用未标记的数据进行有效的表示学习。使用玩具（toy）模型的验证实验的结果表明，强表示（strong representation ）学习使用未标记的数据有效地将图像的基础提取到卷积滤波器中，并且加快了后续监督的反向传播学习的微调的速度。当滤波器数量足够大时，杠杆作用更明显，并且在这种情况下，在微调的初始阶段误差率急剧下降。因此，所提出的方法扩大了CNN中的滤波器的数量，并且使得更加详细和通用的表示。它不仅可以提供一个深层广泛的神经网络的可能性。

arXiv：https://arxiv.org/abs/1804.09859

# Visual Estimation

**《Visual Estimation of Building Condition with Patch-level ConvNets》**

**ICMR 2018**

Abstract：建筑物的状况（condition）是房地产估价的重要因素。目前，房地产估价师对房地产估价具有一定的主观性。我们提出了一种新颖的基于视觉的方法，用于从建筑物的外部视图评估建筑物状况。为此，我们开发了一种多尺度基于patch模式的提取方法，并将其与卷积神经网络相结合，从视觉线索估计建筑物状况。我们的评估显示，视觉估计的建筑条件可以作为评估师对状况估计的proxy。

arXiv：https://arxiv.org/abs/1804.10113

注：计算机视觉和房地产估价有个约会！脑洞真大啊！

# Finger Detection

**《Two-Stream Binocular Network: Accurate Near Field Finger Detection Based On Binocular Images》**

VCIP 2017

Abstract：指尖检测（Fingertip ）在人机交互中起着重要作用。先前的工作是将双目（binocular）图像转换为深度图像。 然后使用基于深度的手姿势估计方法来预测指尖的三维位置。与以前的工作不同，我们提出了一个新的框架，名为双流双目网络（TSBnet），直接从双目图像中检测指尖。TSBnet首先共享左右图像低级特征的卷积图层。然后分别提取双流卷积网络中的高层特征。此外，我们添加了一个新层：双目距离测量层，以提高我们模型的性能。为了验证我们的方案，我们构建了一个双目手图像数据集，包含训练集中的约117k对图像和测试集中的10k对图像。 我们的方法在我们的测试装置上实现了10.9mm的平均误差，比以前的工作性能要好5.9mm（相对35.1％）。

arXiv：https://arxiv.org/abs/1804.10160

IEEE：https://ieeexplore.ieee.org/abstract/document/8305146/

Datasets：https://sites.google.com/view/thuhand17



================================================
FILE: 2018/06/11.md
================================================
**2018-06-11**

Summary: 这篇文章有4篇论文速递信息，涉及CNN pruning、新的人脸识别数据集、森林树木分类和交通标志检测等方向。

# CNN 

**《Accelerator-Aware Pruning for Convolutional Neural Networks》**

submitted to IEEE Transactions on Circuits and Systems for Video Technology

Abstract：卷积神经网络在计算机视觉任务中表现出巨大的性能，但是它们过多的权重和运算阻止了它们在嵌入式环境中被采用。其中一个解决方案涉及修剪（pruning），其中一些不重要的权重被迫为零。已经提出了许多修剪方案，但主要集中在修剪权重的数目上。之前的修剪方案几乎不考虑ASIC或FPGA加速器体系结构。当修剪后的网络运行在加速器上时，缺乏体系结构考虑会导致一些低效率问题，包括内部缓冲区失调和负载不平衡。本文提出了反映加速器体系结构的新修剪方案。在所提出的方案中，执行修剪使得对于与同时取得的激活相对应的每个权重组保留相同数量的权重。通过这种方式，修剪方案解决了无效率问题。即使受到约束，所提出的修剪方案也达到了与先前的无约束修剪方案类似的修剪比例，不仅在AlexNet和VGG16中，而且在像ResNet这样的最先进的非常深的网络中。此外，所提出的方案在已经修剪通道的细长网络中展现出可比的修剪比率。除了提高以前稀疏加速器的效率外，还将显示所提出的修剪方案可用于减少稀疏加速器的逻辑复杂度。

arXiv：https://arxiv.org/abs/1804.09862

# Face

**《Pushing the Limits of Unconstrained Face Detection: a Challenge Dataset and Baseline Results》**

Abstract：人脸识别在过去几年中取得了巨大的进步，每年都有新的里程碑被超越。虽然诸如尺度（scale），姿态（pose），外观上的巨大变化等许多挑战已被成功解决，但仍存在若干问题，这些问题未被现有方法或数据集专门捕获。在这项工作中，我们确定需要研究界关注的下一组挑战，并收集涉及这些问题的新图像数据集，例如基于差的天气，运动模糊，焦点模糊等。我们证明，在最先进的探测器和真实世界需求的性能方面存在相当大的差距。因此，为了进一步加强对无约束人脸检测的研究，我们提出了一种新的带注释的无约束人脸检测数据集（UFDD），其中有几个挑战和基准最近的方法。此外，我们对这些方法的结果和失败案例进行了深入分析。数据集以及baseline 结果将在适当的时候公布。

arXiv：https://arxiv.org/abs/1804.10275

# Image Classification

**《Automatic classification of trees using a UAV onboard camera and deep learning》**

Abstract：使用遥感数据自动分类树木一直是许多科学家和土地使用管理者的梦想。最近，无人驾驶飞行器（UAV）一直被认为是遥感森林的一种易于使用且具有成本效益的工具，深度学习因其在机器视觉方面的能力而备受关注。在这项研究中，我们使用商业的无人机和公开数据进行深度学习，我们构建了用于树木自动分类的机器视觉系统。在我们的方法中，我们将森林的无人机摄影图像分割成单独的树冠（tree crowns）并进行基于对象的深度学习。结果，该系统能够以89.0％的准确度对7种树木类型进行分类。该性能值得注意，因为我们只使用标准无人机的基本RGB图像。相比之下，大多数以前的研究使用昂贵的硬件，如多光谱成像器来提高性能。这一结果意味着我们的方法有可能以具有成本效益的方式对单个树木进行分类。这可以成为许多森林研究人员和管理人员的有用工具。

arXiv：https://arxiv.org/abs/1804.10390

# Traffic Sign Detection

**《Localized Traffic Sign Detection with Multi-scale Deconvolution Networks》**

Abstract：通过深度学习进行有效的交通标志检测对自动驾驶起着至关重要的作用。 但是，不同的国家有不同的交通标志集合，使得本地化的交通标志识别模型训练成为一项繁琐而艰巨的任务。为解决计算复杂算法需要花费大量时间和检测局部交通标志的模糊和亚像素图像的比率低的问题，我们提出了多尺度卷积网络（Multi-Scale Deconvolution Networks，MDN），它将多尺度卷积神经网络 解卷积子网络，导致高效可靠的本地化交通标志识别模型的培训。与中国交通标志数据集（CTSD）和德国交通标志基准（GTSRB）等本地化交通标志基准的经典算法相比，所提出的MDN是有效的。

arXiv：https://arxiv.org/abs/1804.10428

================================================
FILE: 2018/06/13.md
================================================
**2018-06-13**

Summary: 这篇文章有4篇论文速递信息，都是图像分割（image segmentation）方向，其实3篇是对U-Net网络进行了改进。

# Image Segmentation

**《dhSegment: A generic deep-learning approach for document segmentation》**

Abstract：In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks. We show that a single CNN-architecture can be used across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach.

arXiv：https://arxiv.org/abs/1804.10371

注：为什么翻译，因为......原文很稳



**《Automatic Pixelwise Object Labeling for Aerial Imagery Using Stacked U-Nets》**

Abstract：航空影像中物体标记的自动化是一项具有许多实际应用的计算机视觉任务。像能源勘探这样的领域需要一种自动化方法来每天处理连续的图像流（stream）。在本文中，我们提出了一个 pipeline来解决这个问题，使用一堆（a stack of）端到端的卷积神经网络（U-Net架构）。每个网络都作为后一个处理器工作。我们的模型在两个不同的数据集上胜过当前的最新技术：Inria Aerial Image Labeling数据集和Massachusetts Buildings数据集，每个数据集都具有不同的特征，如空间分辨率，物体形状和尺度。此外，我们通过处理子采样图像并稍后向上采样按像素标记来实验验证计算时间节省。节省的这些资源对分割质量的影响可以忽略不计。虽然本文进行的实验仅涵盖航空影像，但所呈现的技术是通用的并且可以处理其他类型的影像。

arXiv：https://arxiv.org/abs/1803.04953



**《Stacked U-Nets: A No-Frills Approach to Natural Image Segmentation》**

Abstract：许多成像任务需要有关图像中所有像素的全局信息。传统的自下而上分类网络通过降低分辨率来globalize information; 特征被池化并下采样为单个输出。但是对于语义分割和对象检测任务，网络必须提供更高分辨率的像素级输出。为了在保持解决方案的同时globalize information，许多研究人员提出了包含复杂的辅助模块，但这些代价是网络规模和计算成本大幅增加的代价。本文提出堆叠式网络（SUNets，stacked u-nets ），它在保持分辨率的同时迭代地组合不同分辨率尺度的特征。SUNets在能够处理自然图像复杂性的深层网络架构中充分利用了U-net的information globalization信息全球化能力。 使用少量参数，SUNets在语义分割任务上表现出色。

arXiv：https://arxiv.org/abs/1804.10343

code：https://github.com/shahsohil/sunets

注：待重点研究



**《Stack-U-Net: Refinement Network for Image Segmentation on the Example of Optic Disc and Cup》**

Abstract：在这项工作中，我们提出了一个特殊的级联网络图像分割，它是基于U-Net网络作为构建模块和迭代改进的思想。该模型主要用于获得更高的识别质量，用于寻找 borders of the optic disc and cup。与单个U-Net和最新的方法相比，无需增加数据集的数量即可实现非常高的分割质量。我们的实验包括与公共数据库DRIONS-DB，RIM-ONE v.3，DRISHTI-GS上最著名的方法的比较，以及与加利福尼亚大学旧金山医学院合作收集的私人数据集的评估。提出了对体系结构细节的分析，并且认为该模型可以用于广泛范围的类似性质的图像分割问题。

arXiv：https://arxiv.org/abs/1804.11294

Amusi 总结：U-Net在图像分割领域（特别是医学领域）真的可以为所欲为啊！啊！啊！

================================================
FILE: 2018/06/15.md
================================================
**2018-06-15**

Summary: 这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸检测和人脸表情识别。其中一篇是CVPR 2018。

[TOC]

# Face Recognition

**《Scalable Angular Discriminative Deep Metric Learning for Face Recognition》**

Abstract：随着深度学习的发展，深度度量学习（DML）在人脸识别方面取得了很大的进步。具体而言，在训练过程中广泛使用的softmax损失通常会带来较大的类内（intra-class）变化，并且仅在测试过程中利用特征归一化（feature normalization）来计算这些配对相似性（pair similarities）。为弥补差距，we impose the intra-class cosine similarity between the features and weight vectors in softmax loss larger than a margin in the training step，并从四个方面扩展。首先，我们探索一个硬采样（hard sample）策略的效果。为缓解调整边缘超参数的人力劳动（human labor），提出了一种自适应边缘更新策略。然后，给出一个规范化版本以充分利用余弦相似性约束。此外，我们增强了前一个约束，迫使类内余弦相似度大于指数（exponential）特征投影空间中具有余量的平均类间余弦相似度。在Labeled Face in the Wild（LFW），Youtube人脸（YTF）和IARPA Janus Benchmark A（IJB-A）数据集上的大量实验表明，所提出的方法优于主流DML方法并接近最先进的性能。

arXiv：https://arxiv.org/abs/1804.10899

注：感觉这篇论文很硬很硬啊！

# Facial Expression Recognition

**《Unsupervised Features for Facial Expression Intensity Estimation over Time》**

CVPR 2018

Abstract：脸部形状和人物运动的多样性是面部表情自动分析的最大挑战之一。在本文中，我们提出描述表达强度（expression intensity）随时间变化的特征（feature），同时对人和所表达的类型不变。我们的功能是适应整体表达 trajectory的多点动态加权组合。我们在几个都与时间分析面部表情有关的任务上评估我们的方法。所提出的特征与用于表达强度估计的最先进的方法进行比较，其表现优于其。我们使用我们提出的特征来暂时对齐记录的3D面部表情的多个序列。此外，我们展示了我们的特征如何用于揭示面部表情中人的特定差异。此外，我们应用我们的特征来识别基于动作单元标签的脸部视频序列中的局部变化。对于所有的实验，我们的特征证明对噪声和异常值具有很强的鲁棒性，使其适用于各种面部运动分析应用。

arXiv：https://arxiv.org/abs/1805.00780

注：哇，这个feature很棒棒哦！

**《Local Learning with Deep and Handcrafted Features for Facial Expression Recognition》**
Abstract：我们提出了一种方法，将卷积神经网络（CNN）学习的自动特征（automatic）与由视觉词袋（BOVW）模型计算的手工特征（handcrafted features）相结合，以获得面部表情识别中的最新结果。为了获得自动特征，我们试验了多种CNN体系结构，预先训练的模型和训练过程，例如，Dense-Sparse-Dense。融合这两种特征后，我们采用local 学习框架来预测每个测试图像的类别标签。local 学习框架基于三个步骤。首先，应用k最近邻模型来为输入测试图像选择最近的训练样本。其次，在所选择的训练样本上训练一对一支持向量机（SVM）分类器。最后，SVM分类器仅用于为其训练的测试图像预测类标签。尽管之前已经将local 学习与手工特征结合使用，但据我们所知，它从未与深层特征结合使用。 2013年面部表情识别（FER）挑战数据集和FER +数据集的实验表明我们的方法达到了最新的结果。 2013年FER数据集的最高准确率为75.42％，FER +数据集的最高准确率为86.71％，两组数据均超过所有竞争对手近2％。

arXiv：https://arxiv.org/abs/1804.10892

# Face Detection

**《Precise Box Score: Extract More Information from Datasets to Improve the Performance of Face Detection》**

Abstract：对于基于R-CNN框架的人脸检测网络的训练，如果与 ground-truth相交的 IoUs高于第一阈值（例如0.7），则将 anchor定分配为正样本；并且如果它们的IoU低于第二阈值（例如0.3）则为负样本。根据上述标签训练人脸检测模型。但是，本文不使用IoU在第一阈值和第二阈值之间的anchor。我们提出了一种新的训练策略，Precise Box Score(PBS)，来训练目标检测模型。所提出的训练策略使用具有介于第一和第二阈值之间的IoU的anchor，其可以一致地提高人脸检测的性能。我们提出的训练策略从数据集中提取更多信息，更好地利用现有数据集。此外，我们还介绍了一种简单而有效的模型压缩方法（SEMCM），它可以进一步提高面部检测器的性能。实验结果表明，基于我们提出的方案，人脸检测网络的性能可以持续提高。

arXiv：https://arxiv.org/abs/1804.10743

注：厉害了，不知道将Precise Box Score 应用到通用型目标检测上，效果会怎样？

================================================
FILE: 2018/06/19.md
================================================
**2018-06-19**

Summary: 这篇文章有4篇论文速递，都是目标检测方向，包括行人检测、车辆检测、指纹检测和目标跟踪等。

# Object Detection

**《Remote Detection of Idling Cars Using Infrared Imaging and Deep Networks》**

Abstract：怠速车辆（Idling vehicles）通过废气排放浪费能源并污染环境。在一些国家，禁止将车辆空转超过预定的时间，并且执法机构需要自动检测怠速车辆。我们提出第一个使用红外（IR）成像和深度网络来检测空转车的自动系统。

我们依靠怠速和停车时空热特征的差异，并使用长波红外摄像机监测车内温度。我们将怠速车检测问题制定为IR图像序列中的时空事件检测，并采用深度网络进行时空建模。我们收集了第一个IR图像序列数据集，用于怠速汽车检测。首先，我们使用卷积神经网络在每个红外图像中检测汽车，该网络在规则的RGB图像上进行预先训练，并在IR图像上进行微调以获得更高的准确性。然后，我们跟踪检测到的汽车随着时间的推移，以识别停放的汽车。最后，我们使用每辆停放汽车的3D时空红外图像体积作为卷积和循环网络的输入，以将它们分类为空闲或不空闲。我们对各种卷积和循环体系结构的时间和时空建模方法进行了广泛的经验性评估。我们在我们的IR图像序列数据集上呈现出有前景的实验结果。

arXiv：https://arxiv.org/abs/1804.10805

注：怠速车辆（Idling vehicles）简单理解就是启动的车辆在原地不动的状态，感觉像是空转。

**《MV-YOLO: Motion Vector-aided Tracking by Semantic Object Detection》**

Abstract：目标跟踪是许多可视化分析系统的基石。近年来，虽然在这方面取得了相当大的进展，但在实际视频中进行稳健，高效和准确的跟踪仍然是一项挑战。在本文中，我们提出了一种混合跟踪器，利用压缩视频流中的运动信息和作用于解码帧的通用语义对象检测器，构建适用于多种可视化分析应用的快速高效的跟踪引擎。所提出的方法与OTB跟踪数据集上的几个常见的跟踪器进行了比较。结果表明所提出的方法在速度和准确性方面的优点。所提出的方法相对于大多数现有跟踪器的另一个优点是其简单性和部署效率，这归因于其重用并重新利用系统中可能已存在的资源和信息，这是由于其他原因。

arXiv：https://arxiv.org/abs/1805.00107

**《Altered Fingerprints: Detection and Localization》**

Abstract：Fingerprint alteration（也称为模糊呈现攻击）是有意篡改或破坏真实的 friction ridge patterns以避免AFIS识别。本文提出了一种检测和定位指纹变化的方法。我们的主要贡献是：（i）设计和训练指纹图像上的CNN模型和图像中以细节点为中心的局部斑块，以检测和定位指纹变化区域，以及（ii）训练生成对抗网络（GAN）合成变化的指纹其特征与真实改变的指纹相似。成功训练的GAN可以缓解研究中改变指纹图像的有限可用性。来自270个科目的4,815个改变指纹的数据库和相同数量的滚动指纹图像用于训练和测试我们的模型。所提出的方法在错误检测率（FDR）为2％时实现99.24％的真实检测率（TDR），优于公布的结果。改变后的指纹检测和定位模型和代码以及合成生成的改变后的指纹数据集将是开源的。

arXiv：https://arxiv.org/abs/1805.00911

**《Real-Time Human Detection as an Edge Service Enabled by a Lightweight CNN》**

IEEE EDGE 2018

Abstract：边缘计算（Edge computing）允许更多计算任务在网络边缘的分布式节点上发生。今天，许多对延迟敏感的任务关键型应用程序可以利用这些边缘设备来缩短时间延迟，甚至可以通过现场存在实现实时的在线决策。智能监控中的人体检测，行为识别和预测属于这一类别，在这种情况下，大量视频流数据的转换会花费宝贵的时间，并给通信网络带来沉重的压力。人们普遍认为，视频处理和目标检测是计算密集型且太昂贵而无法由资源有限的边缘设备来处理。受 depthwise separable 卷积和S ingle Shot Multi-Box Detector (SSD)的启发，本文介绍了一种轻量级卷积神经网络（LCNN）。通过缩小分类器的搜索空间以专注于监控视频帧中的人体对象，所提出的LCNN算法能够以对于边缘设备的负担得起的计算工作量来检测行人。原型已经在使用OpenCV库的边缘节点（Raspberry PI 3）上实现，使用真实世界的监控视频流可以获得令人满意的性能。实验研究验证了LCNN的设计，并表明它是在边缘计算密集型应用的有前景的方法。

arXiv：https://arxiv.org/abs/1805.00330

================================================
FILE: 2018/06/23.md
================================================
**2018-06-23**

这篇文章有4篇论文速递，都是CVPR 2018论文，包括zero-shot learning、图像合成和图像转换等方向。

# Zero-Shot Learning

**《Sketch-a-Classifier: Sketch-based Photo Classifier Generation》**

**CVPR 2018 Spotlight**

Abstract：当代深度学习技术已经使图像识别成为合理可靠的技术。然而，训练有效的照片分类器通常需要大量的样本，这些样本限制了图像识别的可扩展性和适用于图像可能不可用的情况。这激发了zero-shot learning，通过从文本等其他形式的知识迁移来解决问题。在本文中，我们研究了一种合成图像分类器的替代方法：几乎直接从用户的想象中，通过自由手绘草图。This approach doesn't require the category to be nameable or describable via attributes as per zero-shot learning。我们通过训练{模型回归}网络来实现这一点，从{手绘草图}空间映射到照片分类器的空间。事实证明，这种映射可以以与类别无关的方式学习，允许用户合成用于新类别的照片分类器，而不需要注释的训练照片。 {我们还证明，这种分类器生成的方式也可以用来增强现有照片分类器的粒度（granularity ），或者作为name-based 的 zero-short learning的补充。

arXiv：https://arxiv.org/abs/1804.11182



# Image Synthesis

**《Conditional Image-to-Image Translation》**

**CVPR 2018 Poster**

Abstract：生成对抗网络（GAN）和对偶学习（dual learning）已经广泛应用于图像到图像的转换任务。然而，现有模型缺乏控制目标域中的 translation结果的能力，并且它们的结果通常缺乏多样性（diversity），因为固定图像通常导致（几乎）确定translation 结果。在本文中，我们研究了一个新问题，即有条件的图像到图像转换（conditional image-to-image translation），即将图像从源域转换到目标域中给定图像上的目标域。它要求生成的图像应从目标域继承条件图像（conditional image）的某些特定于域的功能。因此，改变目标域中的条件图像将导致来自源域的固定输入图像的各种 translation结果，并且因此条件输入图像有助于控制 translation结果。我们用基于GAN和对偶学习的不成对（unpaired）数据解决了这个问题。我们将两个条件 translation 模型（一个从A域到B域，另一个从B域到A域）转换为输入组合和重构，同时保留域独立特征。我们对男性的脸部进行实验，从女性的脸部 translation 和边缘到鞋子和书包的 translation。结果证明了我们提出的方法的有效性。

arXiv：https://arxiv.org/abs/1805.00251



**《Semi-parametric Image Synthesis》**

**CVPR 2018 Oral**

Abstract：我们提出了一种半参数（semi-parametric）方法从语义布局进行照片图像合成。该方法结合了参数和非参数（parametric and nonparametric）技术的互补优势。非参数组件是由一组训练图像构成的图像片段的 memory bank。在测试阶段，给定一个新的语义布局，the memory bank is used to retrieve photographic references that are provided as source material to a deep network。该合成是通过利用提供的照相材料（photographic material）的深层网络进行的。在多个语义分割数据集上进行的实验表明，所提出的方法比最近的纯参数化技术产生更为真实的图像。

arXiv：https://arxiv.org/abs/1804.10992

github：https://github.com/xjqicuhk/SIMS

video：https://www.youtube.com/watch?v=U4Q98lenGLQ&feature=youtu.be



**《Learning to Sketch with Shortcut Cycle Consistency》**

**CVPR 2018 Poster**

Abstract：看到的是素描（sketch） - 自由手写素描自然地建立人与机器视觉之间的联系。在本文中，我们提出了一种将对象照片翻译为素描的新颖方法，模仿人类素描绘制过程。这是一项非常具有挑战性的任务，因为照片和素描域的差异很大。此外，即使在参考照片中描绘相同的对象实例时，素描也展现出不同程度的复杂性和抽象性。这意味着即使有照片素描对，他们也只能提供弱的监督信号来学习翻译模型。与现有的解决D（E（照片）） - >草图问题的有监督方法相比，其中E（⋅）和D（⋅）分别表示编码器和解码器，我们利用反问题（例如D （素描）） - >照片），并结合无监督的域内重建学习任务，所有这些都在多任务学习框架内完成。与基于循环一致性的现有无监督方法（即D（E（D（E（photo）））） - > photo）相比，我们引入了在编码器瓶颈处强制执行的快捷方式一致性（例如D（E（photo）） - >照片）利用额外的自我监督。定性和定量结果都表明，所提出的模型优于一些最先进的替代方案。我们还表明，合成素描可用于训练更好的细粒度素描图像检索（FG-SBIR）模型，有效缓解素描数据稀缺的问题。

arXiv：https://arxiv.org/abs/1805.00247



这里提一下经典论文
**《Image-to-Image Translation with Conditional Adversarial Networks》**

homepage：https://phillipi.github.io/pix2pix/

arXiv：https://arxiv.org/abs/1611.07004

github：https://github.com/phillipi/pix2pix



================================================
FILE: 2018/06/29.md
================================================
**2018-06-29**

这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸表情识别、人脸情绪分类和人脸属性预测。其中一篇是CVPR 2018 workshop。

**《Robust Face Recognition with Deeply Normalized Depth Images》**

Abstract：已经证明深度信息对于面部识别是有用的。然而，现有的基于深度图像的面部识别方法仍然受到噪声深度值和变化的姿势和表情的影响。在本文中，我们提出了一种新的方法，用于将面部深度图像归一化为正面姿势和中性表情（neutral expression），并从归一化的深度图像中提取鲁棒特征。该方法通过两个深度卷积神经网络（DCNN），归一化网络（NetN）和特征提取网络（NetF）来实现。给定面部深度图像，NetN首先将其转换为HHA图像，通过DCNN从该图像重建3D面部。 NetN然后从重构的3D脸部生成姿势 - 表达归一化（PEN）深度图像。 PEN深度图像最终传递给NetF，NetF通过另一个DCNN提取强大的特征表示以进行人脸识别。我们的初步评估结果证明了所提出的方法在识别具有深度图像的任意姿势和表情的面部方面的优越性。

arXiv：https://arxiv.org/abs/1805.00406



**《Which Facial Expressions Can Reveal Your Gender? A Study With 3D Faces》**

Abstract：人类在外表和行为方面都表现出丰富的性别暗示。在计算机视觉领域，已经广泛研究了面部外观的性别线索（cue），而基于面部行为的性别识别研究仍然很少。在这项工作中，我们首先证明面部表情会影响3D面部中呈现的性别模式，并且在同一表达式中训练和测试时性别识别性能会提高。此外，我们设计的实验直接提取面部表情形成的形态变化作为特征，用于基于表达的性别识别。实验结果表明，在快乐和厌恶表达中，性别可以相当准确地被识别，而惊喜和悲伤表达不会传达很多与性别相关的信息。这是文献中第一部用3D面部研究基于表达的性别分类的工作，揭示了不同类型表达中包含的性别模式的强度，即快乐，厌恶，惊喜和悲伤的表达。

arXiv：https://arxiv.org/abs/1805.00371



**《I Know How You Feel: Emotion Recognition with Facial Landmarks》**

CVPR WiCV workshop 2018

Abstract：对于许多计算机视觉算法而言，人类情感（human emotions）的分类仍然是一项重要且具有挑战性的任务，尤其是在人类机器人的日常生活中与人类共存的时代。当前提出的用于情绪识别的方法使用多层卷积网络来解决该任务，该网络没有明确地推断出分类阶段中的任何面部特征。在这项工作中，我们假设一种根本不同的方法来解决情绪识别任务，该方法依赖于将面部标志作为分类损失函数的一部分。为此，我们扩展了最近提出的深度对齐网络（Deep Alignment Network ，DAN），该网络在最近的面部关键点识别挑战中实现了最佳的结果，其中包含与面部特征相关的术语。 由于这个简单的修改，我们的名为EmotionalDAN的模型能够在两个具有挑战性的基准数据集上超过最先进的情感分类方法达5％。

arXiv：https://arxiv.org/abs/1805.00326



**《A Deep Face Identification Network Enhanced by Facial Attributes Prediction》**

Abstract：在本文中，我们提出了一个新的深层框架，可以预测面部属性并将其作为一种 soft modality来提高面部识别性能。我们的模型是一个端到端框架，它由卷积神经网络（CNN）组成，其输出分为两个独立的分支;第一个分支预测面部属性，而第二个分支标识面部图像。与现有的仅使用共享CNN特征空间共同训练这两个任务的多任务方法相反，我们将预测属性与脸部模态的特征相融合，以提高人脸识别性能。实验结果表明，该模型为人脸识别和人脸属性预测性能带来了好处，特别是在性别预测等身份人脸属性的情况下。我们在由身份和面部属性注释的两个标准数据集上测试了我们的模型。实验结果表明，该模型优于目前大多数现有的人脸识别和属性预测方法。

arXiv：https://arxiv.org/abs/1805.00324

================================================
FILE: 2018/07/02.md
================================================
**2018-07-02**

这篇文章有2篇论文速递，都是图像分割方向，包括运动捕捉图像的语义分割、将FCN和GAN结合的巩膜分割。其中一篇是ACM SIGGRAPH 2018，另一篇是BTAS 2018。

**图像分割（Image Segmentation）**

**《Dilated Temporal Fully-Convolutional Network for Semantic Segmentation of Motion Capture Data》**

ACM SIGGRAPH 2018

Abstract：运动捕捉序列的语义分割在许多数据驱动的运动合成框架中起着关键作用。 这是一个预处理步骤，其中运动捕捉序列的长记录被划分为较小的段。之后，可以将诸如统计建模的其他方法应用于每组结构相似的段以学习抽象运动流形。然而，分段任务通常仍然是手动任务，这增加了生成大规模运动数据库的工作量和成本。因此，我们提出了一种使用扩张的时间完全卷积网络的运动捕捉数据的语义分段的自动框架。我们的模型优于action segmentation中的最先进模型，以及用于序列建模的三个网络。 我们进一步显示我们的模型对高噪音训练标签是鲁棒的。

arXiv：https://arxiv.org/abs/1806.09174



**《Fully Connected Networks and Generative Neural Networks Applied to Sclera Segmentation》**

BTAS 2018

Abstract：由于世界对安全系统的需求，生物识别技术可被视为计算机视觉研究的重要课题。其中一种引起关注的生物识别形式是基于巩膜的识别。进行这种类型识别的最初和最重要的步骤是分割感兴趣的区域，即巩膜（sclera）。在此背景下，本文介绍了基于完全连接网络（FCN）和生成对抗网络（GAN）的两种方法。FCN类似于常见的卷积神经网络，然而全连接的层（即分类层）从网络的末端被移除并且通过组合来自不同卷积层的输出层来产生输出。GAN基于博弈论，我们有两个网络彼此竞争以产生最佳分割。为了与baselines 进行公平的比较以及对提出的方法进行定量和客观的评估，我们向科学界提供了来自两个数据库的新的1,300个手动分割图像。这些实验在UBIRIS.v2和MICHE数据库上进行，我们命题的最佳表现配置分别实现了F分数的87.48％和88.32％。

arXiv：https://arxiv.org/abs/1806.08722

================================================
FILE: 2018/07/05.md
================================================
**2018-07-05**

这篇文章有4篇论文速递，都是GAN方向，包括根据文本生成图像和多域图像生成等方向。其中一篇是IJCAI 2018。

# GAN

**《Text to Image Synthesis Using Generative Adversarial Networks》**

Abstract：从自然语言生成图像是最近条件生成模型的主要应用之一。除了测试我们对条件性，高维度分布进行建模的能力之外，文本到图像合成还具有许多令人兴奋和实际的应用，例如照片编辑或计算机辅助内容创建。使用生成对抗网络（GAN）已经取得了最新进展。本文首先对这些主题进行介绍，并讨论了现有技术模型的现状。此外，本文提出了Wasserstein GAN-CLS，这是一种基于Wasserstein距离的条件图像生成的新模型，可以保证稳定性。然后，展示了Wasserstein GAN-CLS的新型损失函数如何用于条件渐进式生长（Conditional Progressive Growing）GAN。与建议的损失相结合，该模型将仅使用句子级视觉语义的模型的最佳初始得分（在加州理工学院数据集上）提高了7.07％。唯一比有条件的Wasserstein渐进式增长GAN表现更好的模型是最近提出的使用词级视觉语义（word-level visual semantics）的AttnGAN。

arXiv：https://arxiv.org/abs/1805.00676

注：超级重磅文章！整整72页！



**《Transferring GANs: generating images from limited data》**

Abstract：通过微调将预训练网络的知识传递到新域是基于判别模型的应用的广泛使用的实践。据我们所知，这种做法尚未在生成性深层网络的背景下（the context of generative deep networks）进行研究。因此，我们研究应用于生成对抗网络的图像生成的域自适应（domain adaptation）。我们评估域适应的几个方面，包括目标域大小的影响，源域和目标域之间的相对距离，以及条件GAN的初始化。我们的结果表明，使用来自预训练网络的知识可以缩短收敛时间并且可以显著提高所生成图像的质量，尤其是当目标数据有限时。我们表明，即使在没有条件训练的情况下训练预训练模型，也可以为条件GAN绘制这些结论。我们的结果还表明，密度（density）可能比多样性更重要，具有一个或几个密集采样类的数据集可能比更多不同的数据集（如ImageNet或Places）更好的源模型。

arXiv：https://arxiv.org/abs/1805.01677



**《MEGAN: Mixture of Experts of Generative Adversarial Networks for Multimodal Image Generation》**

IJCAI 2018

Abstract：最近，生成对抗网络（GAN）在生成逼真图像方面表现出了很好的表现。然而，他们经常难以在给定数据集中学习复杂的基础模态（underlying modalities），导致生成质量差的图像。为了解决这个问题，我们提出了一种称为mixture of experts GAN（MEGAN）的新方法，这是一种多生成网络的集合方法。MEGAN中的每个生成网络专门用于生成具有特定模态子集的图像，例如图像类。我们提出的模型不是采用多个模态的手工聚类的单独步骤，而是通过 gating networks对多个生成网络的端到端学习进行训练， gating networks负责为给定条件选择合适的生成网络。我们采用分类重新参数化技巧，在选择生成器的同时保持梯度流动的分类决策。我们证明了个体生成器学习数据的不同且显著的子部分，并且对于CelebA获得了0.2470的多尺度结构相似性（MS-SSIM）得分，并且在CIFAR-10中获得了8.33的竞争性无监督初始得分。

arXiv：https://arxiv.org/abs/1805.02481v2



**《Unpaired Multi-Domain Image Generation via Regularized Conditional GANs》**

Abstract：在本文中，我们研究了多域（multi-domain）图像生成的问题，其目的是从不同的域生成相应的图像对。随着近年来生成模型的发展，图像生成取得了很大进展，并已应用于各种计算机视觉任务。然而，由于难以学习不同域图像的对应性，尤其是当未给出配对样本的信息时，多域图像生成可能无法实现期望的性能。为了解决这个问题，我们提出了规则化条件GAN（RegCGAN），它能够学习在没有配对训练数据的情况下生成相应的图像。 RegCGAN基于条件GAN，我们引入两个正则化器来指导模型学习不同域的相应语义。我们对未给出配对训练数据的若干任务评估所提出的模型，包括边缘和照片的生成，具有不同属性的面部的生成等。实验结果表明我们的模型可以成功地生成所有这些的相应图像，同时优于 baseline方法。我们还介绍了将RegCGAN应用于无监督域自适应的方法。

arXiv：https://arxiv.org/abs/1805.02456

================================================
FILE: 2018/07/06.md
================================================
**2018-07-06**

这篇文章有2篇论文速递，都是目标检测方向，一篇是RefineNet，其是SSD算法、RPN网络和FPN算法的结合，另一篇是DES，其是基于SSD网络进行了改进。注意，两篇都是CVPR 2018文章。

# Object Detection

《Single-Shot Refinement Neural Network for Object Detection》

CVPR 2018

Abstract：对于目标检测，两阶段方法（例如，更快的R-CNN）已经实现了最高精度，而一阶段方法（例如，SSD）具有高效率的优点。为了继承两者的优点，同时克服它们的缺点，在本文中，我们提出了一种新的基于single-shot的检测器，称为RefineDet，它比两阶段方法获得更好的精度，并保持一阶段方法的检测效率。 RefineDet由两个相互连接的模块组成，即 anchor refinement 模块和目标检测模块。具体地，前者旨在（1）过滤掉negative anchor以减少分类器的搜索空间，以及（2）粗略地调整anchor的位置和大小以为随后的回归器提供更好的初始化。后一模块将精细anchor作为前者的输入，以进一步改进回归并预测多类别标签。同时，我们设计了一个传输连接块来传输锚点细化模块中的特征，以预测对象检测模块中对象的位置，大小和类别标签。多任务丢失功能使我们能够以端到端的方式训练整个网络。 PASCAL VOC 2007，PASCAL VOC 2012和MS COCO的大量实验证明，RefineDet可以高效地实现最先进的检测精度。

arXiv：https://arxiv.org/abs/1711.06897

github：https://github.com/sfzhang15/RefineDet

注：之后会推出该论文的精读文章！

《Single-Shot Object Detection with Enriched Semantics》

CVPR 2018

Abstract：我们提出了一种新颖的 single-shot 目标检测网络，名为“Detection with Enriched  semantics”（DES）。我们的动机是通过语义分割分支和全局激活模块来丰富典型深度检测器内目标检测特征的语义。分割分支由弱分割ground-truth监督，即，不需要额外的注释。与此同时，我们采用全局激活模块，以自我监督的方式学习通道和对象类之间的关系。PASCAL VOC和MS COCO检测数据集的综合实验结果证明了该方法的有效性。特别是，使用基于VGG16的DES，我们在VOC2007测试中实现了81.7的mAP，在COCO测试开发上实现了32.8的mAP，在Titan Xp GPU上每个图像的推断速度为31.5毫秒。 使用较低分辨率的版本，我们在VOC2007上实现了79.7的mAP，每张图像的推断速度为13.0毫秒。

arXiv：https://arxiv.org/abs/1712.00433

注：之后会推出该论文的精读文章！

================================================
FILE: 2018/07/07.md
================================================
**2018-07-07**

这篇文章有 2篇论文速递，都是图像分割方向（CVPR 2018），一篇提出CCB-Cut损失，另一篇是对FCN网络进行了改进。注意，两篇都是CVPR 2018文章。

# Image Segmentation


**《Compassionately Conservative Balanced Cuts for Image Segmentation》**

CVPR 2018

Abstract：The Normalized Cut (NCut) objective function, widely used in data clustering and image segmentation, quantifies the cost of graph partitioning in a way that biases clusters or segments that are balanced towards having lower values than unbalanced partitionings. However, this bias is so strong that it avoids any singleton partitions, even when vertices are very weakly connected to the rest of the graph. Motivated by the B\"uhler-Hein family of balanced cut costs, we propose the family of Compassionately Conservative Balanced (CCB) Cut costs, which are indexed by a parameter that can be used to strike a compromise between the desire to avoid too many singleton partitions and the notion that all partitions should be balanced. We show that CCB-Cut minimization can be relaxed into an orthogonally constrained ℓτ-minimization problem that coincides with the problem of computing Piecewise Flat Embeddings (PFE) for one particular index value, and we present an algorithm for solving the relaxed problem by iteratively minimizing a sequence of reweighted Rayleigh quotients (IRRQ). Using images from the BSDS500 database, we show that image segmentation based on CCB-Cut minimization provides better accuracy with respect to ground truth and greater variability in region size than NCut-based image segmentation.

arXiv：https://arxiv.org/abs/1803.09903



**《Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation》**

CVPR 2018

Abstract：With pervasive applications of medical imaging in health-care, biomedical image segmentation plays a central role in quantitative analysis, clinical diagno- sis, and medical intervention. Since manual anno- tation su ers limited reproducibility, arduous e orts, and excessive time, automatic segmentation is desired to process increasingly larger scale histopathological data. Recently, deep neural networks (DNNs), par- ticularly fully convolutional networks (FCNs), have been widely applied to biomedical image segmenta- tion, attaining much improved performance. At the same time, quantization of DNNs has become an ac- tive research topic, which aims to represent weights with less memory (precision) to considerably reduce memory and computation requirements of DNNs while maintaining acceptable accuracy. In this paper, we apply quantization techniques to FCNs for accurate biomedical image segmentation. Unlike existing litera- ture on quantization which primarily targets memory and computation complexity reduction, we apply quan- tization as a method to reduce over tting in FCNs for better accuracy. Speci cally, we focus on a state-of- the-art segmentation framework, suggestive annotation [22], which judiciously extracts representative annota- tion samples from the original training dataset, obtain- ing an e ective small-sized balanced training dataset. We develop two new quantization processes for this framework: (1) suggestive annotation with quantiza- tion for highly representative training samples, and (2) network training with quantization for high accuracy. Extensive experiments on the MICCAI Gland dataset show that both quantization processes can improve the segmentation performance, and our proposed method exceeds the current state-of-the-art performance by up to 1%. In addition, our method has a reduction of up to 6.4x on memory usage.

arXiv：https://arxiv.org/abs/1803.04907

================================================
FILE: 2018/07/19.md
================================================
**2018-07-19**

这篇文章有 2篇论文速递，都是ECCV 2018 paper，一篇关于语义分割方向，另一篇是关于深度预测方向。

# Semantic Segmentation

**《Effective Use of Synthetic Data for Urban Scene Semantic Segmentation》**

ECCV 2018

Abstract：训练深度网络以执行语义分割需要大量标记数据。为了减轻注释真实图像的手动工作，研究人员研究了合成数据的使用，这些数据可以自动标记。不幸的是，在合成数据上训练的网络在真实图像上表现得相对较差。虽然这可以通过域适应（domain adaptation）来解决，但是现有方法都需要在训练期间访问真实图像。在本文中，我们介绍了一种截然不同的处理合成图像的方法，这种方法不需要在训练时看到任何真实的图像。Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently。特别是，前者应该以基于检测的方式处理，以更好地解释这样的事实：虽然它们在合成图像中的纹理不是照片般逼真的，但它们的形状看起来很自然。我们的实验证明了我们的方法对Cityscapes和CamVid的有效性，仅对合成数据进行了训练。

arXiv：https://arxiv.org/abs/1807.06132

注：domain adaptation这个概念最近很火！

# Stereo

**《ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems》**

ECCV 2018

Abstract：在本文中，我们介绍ActiveStereoNet，这是active立体声系统的第一个深度学习解决方案。由于缺乏ground truth，我们的方法是完全自监督的，但它产生精确的深度，子像素精度为像素的1/30;它没有遭受常见的过度平滑问题;它保留了边缘;它明确地处理遮挡。我们引入了一种新的重建损失（reconstruction loss），它对噪声和无纹理patches更加稳健，并且对于光照变化是不变的。使用具有自适应支持权重方案的基于窗口的成本聚合来优化所提出的损失。这种成本聚合是边缘保留并使损失函数平滑，这是使网络达到令人信服的结果的关键。最后，我们展示了预测无效区域（如遮挡）的任务如何在没有ground truth的情况下进行端到端的训练。该component对于减少模糊至关重要，特别是改善了深度不连续性的预测。对真实和合成数据进行广泛的定量和定性评估，证明了在许多具有挑战性的场景中的最新技术成果。

arXiv：https://arxiv.org/abs/1807.06009

================================================
FILE: 2018/07/23.md
================================================
**2018-07-23**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出卷积块注意力模块，它可以无缝地集成到任何CNN架构中；另一篇是利用 GAN技术实现多视图3D重建。

# CNN

**《CBAM: Convolutional Block Attention Module》**

**ECCV 2018**

Abstract：我们提出了卷积块注意力模块（CBAM，Convolutional Block Attention Module ），这是一种用于前馈卷积神经网络的简单而有效的注意力（attention）模块。给定中间特征图，我们的模块沿着两个单独的维度（通道和空间）顺序地（sequentially）推断注意力图，然后将注意力图乘以输入特征图以进行自适应特征细化。由于CBAM是一个轻量级的通用模块，它可以无缝地集成到任何CNN架构中，代价可以忽略不计，并且可以与基本CNN一起进行端到端的训练。 我们通过对ImageNet-1K，MS~COCO检测和VOC~2007检测数据集的大量实验来验证我们的CBAM。 我们的实验表明，各种模型在分类和检测性能方面均有一定的改进，证明了CBAM的广泛适用性。 代码和模型将随后公开提供。

arXiv：[链接：https://arxiv.org/abs/1807.06521](https://arxiv.org/abs/1807.06521)

注：很棒的论文，相信可以帮助一波同学写论文（划水）

# Multi-View Reconstruction

**《Specular-to-Diffuse Translation for Multi-View Reconstruction》**

**ECCV 2018** 

Abstract：大多数多视图3D重建算法，特别是当使用来自阴影的形状提示时，假设对象外观主要是漫射的（predominantly diffuse）。为了缓解这种限制，我们引入了S2Dnet，一种生成的对抗网络，用于将具有镜面反射的物体的多个视图转换为漫反射（ diffuse），从而可以更有效地应用多视图重建方法。我们的网络将无监督的图像到图像转换扩展到多视图“镜面到漫反射”的转换。为了在多个视图中保留对象外观，我们引入了一个多视图一致性损失（MVC，Multi-View Coherence loss），用于评估视图转换后局部patches的相似性和faithfulness。我们的MVC损失确保在图像到图像转换下保留多视图图像之间的局部对应的相似性。因此，与几种单视图 baseline 技术相比，我们的网络产生了明显更好的结果。此外，我们使用基于物理的渲染精心设计并生成大型综合训练数据集。在测试过程中，我们的网络仅将原始光泽图像作为输入，无需额外信息，如分割掩模或光照估计。结果表明，使用我们的网络过滤的图像可以显著地改善多视图重建。我们还展示了在现实世界训练和测试数据上的出色表现。

arXiv：[链接：https://arxiv.org/abs/1807.05439](https://arxiv.org/abs/1807.05439)

================================================
FILE: 2018/07/27.md
================================================
**2018-07-27**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出对目标周围的视觉上下文建模，来实现目标检测数据集的增广；另一篇是提出一种综合贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。

# Data Augmentation

**《Modeling Visual Context is Key to Augmenting Object Detection Datasets》**

ECCV 2018

Abstract：众所周知，用于深度神经网络的数据增广（data augmentation）对于训练视觉识别系统是十分重要的。通过人为增加训练样本的数量，它有助于减少过度拟合并改善泛化。对于物体检测（object detection），用于数据增强的经典方法包括生成通过基本几何变换和原始训练图像的颜色变化获得的图像。在这项工作中，我们更进一步，利用 segmentation annotations 来增加训练数据上存在的对象实例的数量。为了使这种方法获得成功，我们证明，适当地建模对象周围的视觉上下文（ visual context ）对于将它们放置在正确的环境中至关重要。否则，我们会发现之前的策略确实会受到伤害。通过我们的上下文（context）模型，当VOC'12基准测试中很少有标记示例可用时，我们实现了显著的平均精度改进。

arXiv：https://arxiv.org/abs/1807.07428

# Face Recognition

**《From Face Recognition to Models of Identity: A Bayesian Approach to Learning about Unknown Identities from Unsupervised Data》**

ECCV 2018

Abstract：当前的面部识别系统可以在各种成像条件下稳健地识别身份。在这些系统中，通过分类到从监督身份标记获得的已知身份来执行识别。这个当前范例存在两个问题：（1）current systems are unable to benefit from unlabelled data which may be available in large quantities; （2）当前系统将成功识别等同于给定输入图像的标记。另一方面，人类会对完全无监督的个体进行识别，即使没有能够命名该个体，也要认识到他们之前见过的人的身份。我们如何超越当前的分类范式，更加人性化地理解身份？我们提出了一个综合的贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。我们的模型不仅对已知身份获得了良好的识别性能，它还可以从无监督数据中发现新身份，并学习将身份与不同情境联系起来，这取决于哪些身份倾向于一起观察。此外，提出的半监督组件不仅能够处理熟人的名字，而且还能够处理统一框架中未标记的熟悉面孔和完全陌生人。

arXiv：https://arxiv.org/abs/1807.07872

================================================
FILE: 2018/07/31.md
================================================
**2018-07-31**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出semi-convolutional等创新点来改进Mask RCNN；另一篇是提出CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络，实现超分辨率。

# Instance Segmentation

**《Semi-convolutional Operators for Instance Segmentation》**

ECCV 2018

Abstract：目标检测（Object detection）和实例分割（instance segmentation）由基于区域的方法（例如Mask RCNN）主导。然而，人们越来越关注将这些问题减少到像素标记任务，因为后者可以更高效，可以在许多其他任务中使用的图像到图像（image-to-image）网络架构中无缝集成，并且对于不能由边界框近似的目标更加准确。在本文中，我们从理论和经验上表明，使用卷积算子不能轻易地实现构建可以分离对象实例的 dense pixel embeddings 。同时，我们表明简单的修改，我们称之为 semi-convolutional，其在这项任务中有更好的表现。我们证明了这些算子也可用于改进Mask RCNN等方法，展示了比单独使用Mask RCNN可实现的复杂生物形状和PASCAL VOC类别更好的分割。

arXiv：https://arxiv.org/abs/1807.10712

# Super Resolution


**《CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping》**

ECCV 2018

Abstract：The Reference-based Super-resolution (RefSR) super-resolves a low-resolution (LR) image given an external high-resolution (HR) reference image，其中参考图像和LR图像共享相似的视点但具有显著的分辨率间隙 x8。现有的RefSR方法以级联的方式工作，例如 patch匹配，然后是具有两个独立定义的目标函数的合成 pipeline，导致inter-patch misalignment，grid effect and inefficient optimization。为了解决这些问题，我们提出了CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络。我们的网络包含图像编码器（encoder），cross-scale warping layers和融合解码器（decoder）：编码器用于从LR和参考图像中提取多尺度特征;cross-scale warping layers在空间上将参考特征图与LR特征图对齐;解码器最终聚合来自两个域的特征映射以合成HR输出。使用跨尺度变形，我们的网络能够以端到端的方式在像素级执行空间对齐，从而改善现有方案的精度（大约2dB-4dB）和效率（超过100倍） 。

arXiv：https://arxiv.org/abs/1807.10547

================================================
FILE: 2018/08/03.md
================================================
**2018-08-03**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于卷积神经网络（CNN）的密度估计方法来解决图像中人群计数的问题；另一篇是提出实时立体匹配的端到端深度架构StereoNet，实现了亚像素匹配精度的深度预测。

# Crowd Counting

**《Iterative Crowd Counting》**

**ECCV 2018**

Abstract：在这项工作中，我们解决了图像中人群计数的问题。我们提出了一种基于卷积神经网络（CNN）的密度估计方法来解决这个问题。一次性预测高分辨率密度图是一项具有挑战性的任务。因此，我们提出了一个用于生成高分辨率密度图的两分支CNN架构，其中第一个分支生成低分辨率密度图，第二个分支包含来自第一个分支的低分辨率预测和特征图以生成高分辨率密度图。我们还提出了我们方法的多阶段扩展，其中管道中的每个阶段都使用来自所有先前阶段的预测。与目前最佳的人群计数方法的实证比较表明，我们的方法在三个具有挑战性的人群计数基准上实现了最低的平均绝对误差：Shanghaitech，WorldExpo'10和UCF数据集。

arXiv：https://arxiv.org/abs/1807.09959

# Depth Prediction

**《StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction》**

**ECCV 2018**

Abstract：本文介绍了StereoNet，这是第一个用于实时立体匹配的端到端深度架构，在NVidia Titan X上以60 fps运行，可生成高质量，边缘保留，无量化（quantization-free）的视差图。本文的一个重要创新点是网络实现了亚像素匹配精度，而不是传统立体匹配方法的精度。This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision.。通过采用学习的边缘感知上采样功能来实现空间精度。我们的模型使用Siamese网络从左右图像中提取特征。在非常低分辨率的cost volume中计算视差的第一估计，然后分层地通过使用紧凑的像素到像素细化网络的学习的上采样函数来重新引入高频细节。利用颜色输入作为指导，该功能（function）能够产生高质量的边缘感知输出。我们在多个基准测试中获得了最佳的结果。

arXiv：https://arxiv.org/abs/1807.08865

注：哇，实时立体匹配啊！

================================================
FILE: 2018/08/07.md
================================================
**2018-08-07**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的网格自动编码的卷积神经网络，用于生成3D人脸；另一篇提出新的RFNet，实现看图说话（image caption）。

# 3D Face

**《Generating 3D faces using Convolutional Mesh Autoencoders》**

**ECCV 2018**

Abstract：人脸的3D表示（representations）对于计算机视觉问题是有用的，例如3D面部跟踪和从图像重建，以及诸如角色生成和动画的图形应用。传统模型使用线性子空间或高阶张量概括来学习面部的潜在表示（latent representation）。由于这种线性，它们无法捕获极端变形和非线性表达式。为了解决这个问题，我们引入了一个多功能模型（versatile model），该模型使用网格表面上的光谱卷积来学习面部的非线性表示。我们引入了网格采样操作，这种操作能够实现分层网格表示，捕获模型中多个尺度的形状和表达的非线性变化。在variational setting中，我们的模型从多元高斯分布中采样不同的逼真3D人脸。我们的训练数据包括在12个不同subjects中捕获的20,466个极端表情网格。尽管训练数据有限，但我们训练的模型优于最先进的面部模型，重建误差降低50％，而参数减少75％。我们还表明，用我们的自动编码器替换现有最先进的人脸模型的表达空间，可以实现更低的重建误差。

arXiv：https://arxiv.org/abs/1807.10267

github：https://github.com/anuragranj/coma

# Image Captioning

**《Recurrent Fusion Network for Image Captioning》**

**ECCV 2018** 

Abstract：最近，看图说话（Image captioning）已经取得了很大进展，并且所有最先进的模型都采用了编码器 - 解码器框架。在此框架下，输入图像由卷积神经网络（CNN）编码，然后通过递归神经网络（RNN）转换为自然语言。依赖于该框架的现有模型仅使用一种CNN，例如ResNet或Inception-X，其仅从一个特定视点描述图像内容。因此，不能全面地理解输入图像的语义含义，这限制了captioning的性能。在本文中，为了利用来自多个编码器的补充信息，我们提出了一种用于处理看图说话的新型循环融合网络（RFNet）。我们模型中的融合过程可以利用图像编码器的输出之间的相互作用，然后为解码器生成新的紧凑但信息丰富的表示。 MSCOCO数据集上的实验证明了我们提出的RFNet的有效性，它为看图说话（image caption）提供了一种新的先进技术。

arXiv：https://arxiv.org/abs/1807.09986

注：Image Caption挺有意思的！CNN和RNN完美结合~


================================================
FILE: 2018/08/11.md
================================================
**2018-08-11**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于Disentangled Representations网络，实现图像到图像转换；另一篇提出新的SPG masks，可有效地生成高质量的目标定位图。

# Image to Image Translation

**《Diverse Image-to-Image Translation via Disentangled Representations》**

**ECCV 2018（oral）**

Abstract：图像到图像转换旨在学习两个视觉域之间的映射。许多应用存在两个主要挑战：1）缺少对齐的训练对（aligned training pairs）2）来自单个输入图像的多个可能输出。在这项工作中，我们提出了一种基于disentangled representation的方法，用于在没有成对训练图像的情况下产生多样化的输出。为了实现多样性（diversity），我们提出将图像嵌入到两个空间中：a domain-invariant content space capturing shared information across domains and a domain-specific attribute space。我们的模型采用从给定输入中提取的编码内容特征和从属性空间采样的属性向量，以在测试时产生不同的输出。为了处理不成对的训练数据，我们引入了新的基于disentangled representations的cross-cycle consistency loss。定性结果表明，我们的模型可以在无需配对训练数据的情况下，在各种任务上生成多样且逼真的图像。对于定量比较，我们使用感知距离度量（perceptual distance metric）来衡量用户研究和多样性的真实性。与MNIST-M和LineMod数据集上的最新技术相比，我们将所提出的模型应用于域适应并显示出最佳效果（SOTA）。

arXiv：https://arxiv.org/abs/1808.00948

homepage：http://vllab.ucmerced.edu/hylee/DRIT/

github：https://github.com/HsinYingLee/DRIT

# Object Localization

**《Self-produced Guidance for Weakly-supervised Object Localization》**

**ECCV 2018**

Abstract：弱监督方法通常基于分类网络产生的注意力图（attention maps）生成定位结果。然而，注意力图表现出对象的最具辨别力的部分，这些部分是小的和稀疏的。我们建议生成自生导引（generate Self-produced Guidance ，SPG）掩模，其将前景，感兴趣对象与背景分离，以向分类网络提供像素的空间相关信息。提出了一种分阶段（stagewise）方法，以结合高置性对象区域来学习SPG掩模。注意力图中的高置信区域用于逐步学习SPG掩模。然后将掩模用作辅助像素级监督，以便于分类网络的训练。对ILSVRC的广泛实验表明，SPG可有效地生成高质量的对象定位图。特别是，提出的SPG在ILSVRC验证集上实现了43.83％的Top-1定位错误率，这是一种新的SOTA错误率。

arXiv：https://arxiv.org/abs/1807.08902


================================================
FILE: 2018/08/15.md
================================================
**2018-08-15**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成；另一篇提出利用FiLM来调节语言上基于图像的卷积网络计算，解决视推理问题。

# VAE

**《MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics》**

**ECCV 2018**

Abstract：Long-term human motion can be represented as a series of motion modes---motion sequences that capture short-term temporal dynamics---with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis.

摘要：长期（long-term）人体运动可以表示为一系列运动模式 - 捕捉短期时间动态的运动序列 - 它们之间的过渡。我们利用这种结构，提出了一种新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成。我们的模型联合学习运动模式的特征嵌入（可以从中重建运动序列）和表示一个运动模式到下一个运动模式的转换的特征变换。我们的模型能够从相同的输入生成"未来"的多种多样且可信的运动序列。我们将此方法应用于面部和全身运动，并演示了基于类比的运动传递和视频合成等应用。

arXiv：https://arxiv.org/abs/1808.04545

# Visual Reasoning

**《Visual Reasoning with Multi-hop Feature Modulation》**

**ECCV 2018**

Abstract：Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt --- on-par with single-hop FiLM generation --- while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

摘要：最近计算机视觉和自然语言处理方面的突破激发了人们对挑战多模式任务（如视觉问答和视觉对话）的兴趣。对于这样的任务，一种成功的方法是通过特征线性调制（FiLM）层（即，每通道缩放和移位）来调节语言上基于图像的卷积网络计算。我们提出以多跳方式生成在卷积网络的层次结构上的FiLM层的参数，而不是像在先前的工作中那样一次生成。通过在参与语言输入和生成FiLM层参数之间交替，这种方法能够更好地扩展到具有较长输入序列的设置，例如对话（dialogue）。我们证明了多跳FiLM生成实现了短输入序列任务的最新技术参考 - 与单跳FiLM生成相媲美 - 同时也明显优于先前的先进技术GuessWhat上的单跳FiLM生成？！视觉对话任务。

arXiv：https://arxiv.org/abs/1808.04446

注：Amusi觉得将CV与NLP结合有非常大的研究意义和前景。

================================================
FILE: 2018/08/25.md
================================================
**2018-08-21**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的弱监督和半监督框架可实现含无限数量标签的语义分割；另一篇提出使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。

# Semantic Segmentation

**《Concept Mask: Large-Scale Segmentation from Semantic Concepts》**

**ECCV 2018**

Abstract：Existing works on semantic segmentation typically consider a small number of labels, ranging from tens to a few hundreds. With a large number of labels, training and evaluation of such task become extremely challenging due to correlation between labels and lack of datasets with complete annotations. We formulate semantic segmentation as a problem of image segmentation given a semantic concept, and propose a novel system which can potentially handle an unlimited number of concepts, including objects, parts, stuff, and attributes. We achieve this using a weakly and semi-supervised framework leveraging multiple datasets with different levels of supervision. We first train a deep neural network on a 6M stock image dataset with only image-level labels to learn visual-semantic embedding on 18K concepts. Then, we refine and extend the embedding network to predict an attention map, using a curated dataset with bounding box annotations on 750 concepts. Finally, we train an attention-driven class agnostic segmentation network using an 80-category fully annotated dataset. We perform extensive experiments to validate that the proposed system performs competitively to the state of the art on fully supervised concepts, and is capable of producing accurate segmentations for weakly learned and unseen concepts.

摘要：关于语义分割的现有工作通常考虑少量标签，范围从几十到几百。由于标签之间的相关性以及缺少具有完整注释的数据集，因此对于大量标签，对此类任务的训练和评估变得极具挑战性。我们将语义分割表示为给定语义概念的图像分割问题，并提出一种新颖的系统，它可以处理无限数量的概念，包括对象，部件，东西和属性。我们使用弱监督和半监督框架来实现这一目标，该框架利用具有不同监督级别的多个数据集。我们首先在6M图像数据集上训练深度神经网络，仅使用图像级标签来学习18K概念的视觉语义嵌入。然后，我们使用带有750个概念的边界框注释的curated 数据集来优化和扩展嵌入网络以预测注意力图。最后，我们使用80类完全注释的数据集训练注意力驱动的类不可知分割网络。我们进行了大量实验，以验证所提出的系统在完全监督的概念上与现有技术相比具有竞争力，并且能够为弱学习和看不见的概念产生准确的分割。

arXiv：https://arxiv.org/abs/1808.06032

# Monocular Depth Estimation

**《Learning Monocular Depth by Distilling Cross-domain Stereo Networks》**

**ECCV 2018**

Abstract：Monocular depth estimation aims at estimating a pixelwise depth map for a single image, which has wide applications in scene understanding and autonomous driving. Existing supervised and unsupervised methods face great challenges. Supervised methods require large amounts of depth measurement data, which are generally difficult to obtain, while unsupervised methods are usually limited in estimation accuracy. Synthetic data generated by graphics engines provide a possible solution for collecting large amounts of depth data. However, the large domain gaps between synthetic and realistic data make directly training with them challenging. In this paper, we propose to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Cross-domain synthetic data could be fully utilized in this novel framework. Different strategies are proposed to ensure learned depth perception capability well transferred across different domains. Our extensive experiments show state-of-the-art results of monocular depth estimation on KITTI dataset.

摘要：单目深度估计旨在估计单个图像的像素深度图，其在场景理解和自动驾驶中具有广泛的应用。现有的监督和无监督方法面临巨大挑战。监督方法需要大量深度测量数据，这些数据通常难以获得，而无监督方法通常在估计精度方面受到限制。合成数据为收集大量深度数据提供了可能的解决方案。然而，合成数据和实际数据之间存在较大的域（domain）差距，这使得直接训练具有一定挑战性。在本文中，我们建议使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。跨域合成数据可以在这个新颖的框架中得到充分利用。提出了不同的策略来确保学习深度感知能力在不同域之间良好地传递。我们的广泛实验显示了KITTI数据集上单目深度估计的最新结果。

arXiv：https://arxiv.org/abs/1808.06586

================================================
FILE: 2018/10/12.md
================================================
**2018-10-12**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出IoU-Net，用来学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU。 网络获得了定位置信度，通过保留精确的定位边界框来改进NMS。 此外，提出了一种基于优化的边界框细化方法，其中将预测的IoU表示为目标；另一篇提出DetNet，这是一种专门用于物体检测的新型 backbone 网络。

# Object Detection

**《Acquisition of Localization Confidence for Accurate Object Detection》**

Abstract：Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

摘要：现代的基于CNN的物体检测器依靠边界框回归和非最大抑制(NMS)来定位对象。 虽然类标签的概率自然反映了分类置信度(classification confidence)，但缺乏定位置信度(localization confidence)。 这使得正确定位的边界框在迭代回归期间 degenerate 或甚至在NMS期间被抑制。 在本文中，**我们提出了IoU-Net学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU**。 网络获得了定位置信度，通过保留精确的定位边界框来改进NMS。 此外，提出了一种基于优化的边界框细化方法，其中将预测的IoU表示为目标。 MS-COCO数据集上的大量实验表明了IoU-Net的有效性，以及它与几种最先进的物体探测器的兼容性和适应性。

arXiv：https://arxiv.org/abs/1807.11590

注：源码未放出

**《DetNet: A Backbone network for Object Detection》**

Abstract：Recent CNN based object detectors, no matter one-stage methods like YOLO, SSD, and RetinaNe or two-stage detectors like Faster R-CNN, R-FCN and FPN are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection. 1. Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. 2. Object detection not only needs to recognize the category of the object instances but also spatially locate the position. Large downsampling factor brings large valid receptive field, which is good for image classification but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet~(4.8G FLOPs) backbone. The code will be released for the reproduction.

摘要：最近的基于CNN的物体探测器，无论是像YOLO，SSD和RetinaNet这样的one-stage方法，还是像Faster R-CNN，R-FCN和FPN这样的two-stage探测器，都经常试图直接从ImageNet预先训练好的图像模型中进行微调分类。关于专门为物体检测设计的 backbone 特征提取器的讨论很少。更重要的是，**图像分类和对象检测的任务之间存在若干差异**。(1)最近的物体探测器如FPN和RetinaNet通常涉及额外的阶段，以防止图像分类的任务处理各种尺度的物体。 (2)目标检测不仅需要识别对象实例的类别，还需要在空间上定位位置。较大的下采样因子带来了较大的有效感受野，有利于图像分类，但会损害对象定位能力。由于图像分类和物体检测之间存在差距，本文提出了DetNet，这是一种专门用于物体检测的新型 backbone 网络。此外，DetNet还包括针对传统backbone网络的额外阶段，用于图像分类，同时在更深层中保持高空间分辨率。在没有任何其它tricks的情况下，基于我们的DetNet~（4.8G FLOP）backbone，在MSCOCO基准测试中获得了目标检测和实例分割的最优结果。

arXiv：https://arxiv.org/abs/1804.06215

注：源码未放出

================================================
FILE: 2018/10/17.md
================================================
**2018-10-17**

这篇文章介绍两篇 ECCV 2018关于语义分割（Semantic Segmentation）最新的 paper，一篇提出双边分割网络（Bilateral Segmentation Network，BiSeNet）在不牺牲空间分辨率（spatial resolution）的情况下来实现实时inference速度；另一篇提出UDA框架和CBST框架，并引入空间先验（spatial prior）来细化生成的标签。

# Semantic Segmentation

**《BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation》**

Abstract：Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.

摘要：语义分割（semantic segmentation）需要丰富的空间信息和相当大的感受野（receptive field）。但是，现代方法通常会牺牲空间分辨率（spatial resolution）来实现实时inference速度，从而导致性能不佳。在本文中，我们通过一种新颖的双边分割网络（Bilateral Segmentation Network，BiSeNet）来解决这一难题。我们首先设计一个小步幅的 Spatial Path，以保留空间信息并生成高分辨率特征。同时，采用具有快速下采样策略的 Context Path 来获得足够的感受野。在这两条 path 的顶部，我们引入了一个新的特征融合模块（Feature Fusion Module），以有效地结合特征。所提出的BiSeNet框架在Cityscapes，CamVid和COCO-Stuff数据集上的速度和分割性能之间取得了适当的平衡。具体来说，对于2048x1024输入，我们在Cityscapes测试数据集上实现了68.4％的Mean IOU，在一块NVIDIA Titan XP卡上的速度为105 FPS，这明显快于当前其它可比的方法。

arXiv：http://arxiv.org/abs/1808.00897

注：源码未放出

**《Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training》**

Abstract：Recent deep networks achieved state of the art performance on a variety of semantic segmentation tasks. Despite such progress, these models often face challenges in real world “wild tasks” where large difference between labeled training/source data and unseen test/target data exists. In particular, such difference is often referred to as “domain gap”, and could cause significantly decreased performance which cannot be easily remedied by further increasing the representation power. Unsupervised domain adaptation (UDA) seeks to overcome such problem without target domain labels. In this paper, we propose a novel UDA framework based on an iterative self-training (ST) procedure, where the problem is formulated as latent variable loss minimization, and can be solved by alternatively generating pseudo labels on target data and re-training the model with these labels. On top of ST, we also propose a novel classbalanced self-training (CBST) framework to avoid the gradual dominance of large classes on pseudo-label generation, and introduce spatial priors to refine generated labels. Comprehensive experiments show that the proposed methods achieve state of the art semantic segmentation performance under multiple major UDA settings.

摘要：最近的深度网络在各种语义分割任务上实现了最先进的性能。尽管取得了这些进展，但这些模型经常面临现实世界“wild tasks”中的挑战，其中存在标记的训练/源数据与看不见的测试/目标数据之间的巨大差异。特别地，这种差异通常被称为“domain gap”，并且可能导致显著的性能降低。这并不能通过进一步增加表示能力而容易地补救。无监督域适应（Unsupervised Domain Adaptation，UDA）试图在没有目标域标签的情况下克服这种问题。在本文中，我们提出了一种基于迭代自训练（Self-training，ST）过程的新型UDA框架，其中该问题被公式化为潜在变量损失最小化，并且可以通过在目标数据上交替生成伪标签（pseudo labels）并重新训练来解决。带有这些标签的模型。在ST之上，我们还提出了一种新颖的类平衡自我训练（Class Balanced Self-training，CBST）框架，avoid the gradual dominance of large classes on pseudo-label generation，并引入空间先验（spatial prior）来细化生成的标签。综合实验表明，所提出的方法在多个主要UDA设置下实现了最先进的语义分割性能。

paper：http://openaccess.thecvf.com/content_ECCV_2018/papers/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.pdf

================================================
FILE: 2018/11/05-09.md
================================================
**2018-11-05~2018-11-09**

这篇文章介绍43篇论文，涉及CNN、图像分类、数据增广、Face、图像分割、OCR、GAN、风格迁移、目标跟踪、数据集和姿态估计等方向。

# **数据集**

**《The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale》**

IJCV

arXiv：https://arxiv.org/abs/1811.00982

数据集官网：https://storage.googleapis.com/openimages/web/index.html

注：920w 幅图像



**《Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning》**

CVPR 2018

arXiv：https://arxiv.org/abs/1811.02307

datasets：https://usa.honda-ri.com/hdd



# **CNN**

**《Invertible Residual Networks》**

arXiv：https://arxiv.org/abs/1811.00995



**《You Only Search Once: Single Shot Neural Architecture Search via Direct Sparse Optimization》**

ICLR2019 Submission

arXiv：https://arxiv.org/abs/1811.01567

注：图森中实习生的work，超越NAS



**《Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance》**

Submitted to IJCV 2018

arXiv：https://arxiv.org/abs/1811.01335



**《Activation Functions: Comparison of trends in Practice and Research for Deep Learning》**

arXiv：https://arxiv.org/abs/1811.03378



**《Microscopic Nuclei Classification, Segmentation and Detection with improved Deep Convolutional Neural Network (DCNN) Approaches》**

arXiv：https://arxiv.org/abs/1811.03447



**《ColorUNet: A convolutional classification approach to colorization》**

arXiv：https://arxiv.org/abs/1811.03120



**《ExGate: Externally Controlled Gating for Feature-based Attention in Artificial Neural Networks》**

arXiv：https://arxiv.org/abs/1811.03403



# **图像分类**

**《Learning from Large-scale Noisy Web Data with Ubiquitous Reweighting for Image Classification》**

arXiv：https://arxiv.org/abs/1811.00700



**数据增广**

**《Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond》**

TPAMI 

arXiv：https://arxiv.org/abs/1811.02545



# **Face**

**《Exposing DeepFake Videos By Detecting Face Warping Artifacts》**

arXiv：https://arxiv.org/abs/1811.00656



**《Exposing Deep Fakes Using Inconsistent Head Poses》**

arXiv：https://arxiv.org/abs/1811.00661



**《Fast Face Image Synthesis with Minimal Training》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.01474

datasets：https://cvrl.nd.edu/projects/data/



**《Facial Landmark Detection for Manga Images》**

arXiv：https://arxiv.org/abs/1811.03214



# **特定目标检测**

**《Real-time Driver Drowsiness Detection for Android Application Using Deep Neural Networks Techniques》**

arXiv：https://arxiv.org/abs/1811.01627



**《Query-based Logo Segmentation》**

arXiv：https://arxiv.org/abs/1811.01395



# **图像分割**

**《Prediction Error Meta Classification in Semantic Segmentation: Detection via Aggregated Dispersion Measures of Softmax Probabilities》**

arXiv：https://arxiv.org/abs/1811.00648



**《Unsupervised RGBD Video Object Segmentation Using GANs》**

ACCV workshop

arXiv：https://arxiv.org/abs/1811.01526



**《DUNet: A deformable network for retinal vessel segmentation》**

arXiv：https://arxiv.org/abs/1811.01206



**《Ischemic Stroke Lesion Segmentation in CT Perfusion Scans using Pyramid Pooling and Focal Loss》**

2018 MICCAI workshop

arXiv：https://arxiv.org/abs/1811.01085



**《An End-to-end Approach to Semantic Segmentation with 3D CNN and Posterior-CRF in Medical Images》**

NIPS 2018 Workshop

arXiv：https://arxiv.org/abs/1811.03549



**《Adaptive Semantic Segmentation with a Strategic Curriculum of Proxy Labels》**

arXiv：https://arxiv.org/abs/1811.03542



**《Deep Semantic Instance Segmentation of Tree-like Structures Using Synthetic Data》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.03208



# **GAN**

**《Improving GAN with neighbors embedding and gradient matching》**

AAAI 2019

arXiv：https://arxiv.org/abs/1811.01333



**《A General Theory of Equivariant CNNs on Homogeneous Spaces》**

arXiv：https://arxiv.org/abs/1811.02017



**《Triple consistency loss for pairing distributions in GAN-based face synthesis》**

arXiv：https://arxiv.org/abs/1811.03492

github：https://github.com/ESanchezLozano/GANnotation

youtube：https://youtu.be/-8r7zexg4yg



**OCR**

**《Auto-ML Deep Learning for Rashi Scripts OCR》**

arXiv：https://arxiv.org/abs/1811.01290



# **不规则文字识别**

**《Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition》**

arXiv：https://arxiv.org/abs/1811.00751



# **风格迁移**

**《Evolvement Constrained Adversarial Learning for Video Style Transfer》**

arXiv：https://arxiv.org/abs/1811.02476



# **竞赛Workshop**

**《Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge》**

arXiv：https://arxiv.org/abs/1811.00662



# **姿态估计**

**《Improving Multi-Person Pose Estimation using Label Correction》**

arXiv：https://arxiv.org/abs/1811.03331



# **目标跟踪**

**《High Speed Tracking With A Fourier Domain Kernelized Correlation Filter》**

arXiv：https://arxiv.org/abs/1811.03236



# **Zero-shot Learning**

**《Model Selection for Generalized Zero-shot Learning》**

arXiv：https://arxiv.org/abs/1811.03252



# **3D**

**《SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection》**

arXiv：https://arxiv.org/abs/1811.01571



# **滤波**

**《Fast Adaptive Bilateral Filtering》**

TIP

arXiv：https://arxiv.org/abs/1811.02308



**《Fast High-Dimensional Bilateral and Nonlocal Means Filtering》**

TIP

arXiv：https://arxiv.org/abs/1811.02363



# **其它**

**《Continual Occlusions and Optical Flow Estimation》**

ACCV 2018

arXiv：https://arxiv.org/abs/1811.01602



**《Texture Synthesis Guided Deep Hashing for Texture Image Retrieval》**

arXiv：https://arxiv.org/abs/1811.01401



**《Semantic bottleneck for computer vision tasks》**

ACCV 2018

arXiv：https://arxiv.org/abs/1811.02234



**《3DCapsule: Extending the Capsule Architecture to Classify 3D Point Clouds》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.02191



**《Automatic Thresholding of SIFT Descriptors》**

ICIP 2016

arXiv：https://arxiv.org/abs/1811.03173



**《DragonPaint: Rule based bootstrapping for small data with an application to cartoon coloring》**

arXiv：https://arxiv.org/abs/1811.03151

================================================
FILE: 2018/11/19.md
================================================
**2018-11-19**

这篇文章介绍12篇论文，涉及CNN、Face、3D、OCR、GAN和目标检测等方向。

# CNN

**《GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism》**

arXiv：https://arxiv.org/abs/1811.06965

**《Residual Convolutional Neural Network Revisited with Active Weighted Mapping》**

arXiv：https://arxiv.org/abs/1811.06878

注：咦，给ResNet加权重！

**《DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks》**

submitted to CVPR19

arXiv：https://arxiv.org/abs/1811.06783

# Face
**《Image Pre-processing Using OpenCV Library on MORPH-II Face Database》**

arXiv：https://arxiv.org/abs/1811.06934

# 3D
**《The Perfect Match: 3D Point Cloud Matching with Smoothed** 
**Densities》**

arXiv：https://arxiv.org/abs/1811.06879

# 目标检测

**《DeRPN: Taking a further step toward more general object detection》**

AAAI 2019

arXiv：https://arxiv.org/abs/1811.06700

github：https://github.com/HCIILAB/DeRPN

**《Improving Fingerprint Pore Detection with a Small FCN》**

arXiv：https://arxiv.org/abs/1811.06846

github：https://github.com/gdahia/fingerprint-pore-detection

注：NB，指纹孔都能检测，密集恐惧症者勿入！

**《Detecting The Objects on The Road Using Modular Lightweight Network》**

arXiv：https://arxiv.org/abs/1811.06641

# GAN
**《Conditional GANs for Multi-Illuminant Color Constancy: Revolution or Yet Another Approach?》**

arXiv：https://arxiv.org/abs/1811.06604

# Other

**《Automatic Paper Summary Generation from Visual and Textual Information》**

ICMV 2018

arXiv：https://arxiv.org/abs/1811.06943

github：https://cvpaperchallenge.github.io/AutoPaperSummaryGen/

注：自动生成论文概要！这么NB的么！

**《Anomaly Detection using Deep Learning based Image Completion》**

ICMLA 2018

arXiv：https://arxiv.org/abs/1811.06861

**《Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road》**

arXiv：https://arxiv.org/abs/1811.06666

================================================
FILE: 2018/11/20.md
================================================
**2018-11-20**

这篇文章介绍46篇论文，涉及CNN、Face、图像分类、目标检测、图像分割、GAN、Re-ID、SLAM和迁移学习等方向。

# CNN

**《Deeper Interpretability of Deep Networks》**

arXiv：https://arxiv.org/abs/1811.07807

> Deep Convolutional Neural Networks (CNNs) have been one of the most influential recent developments in computer vision, particularly for categorization. There is an increasing demand for explainable AI as these systems are deployed in the real world. However, understanding the information represented and processed in CNNs remains in most cases challenging. Within this paper, we explore the use of new information theoretic techniques developed in the field of neuroscience to enable novel understanding of how a CNN represents information. We trained a 10-layer ResNet architecture to identify 2,000 face identities from 26M images generated using a rigorously controlled 3D face rendering model that produced variations of intrinsic (i.e. face morphology, gender, age, expression and ethnicity) and extrinsic factors (i.e. 3D pose, illumination, scale and 2D translation). With our methodology, we demonstrate that unlike human's network overgeneralizes face identities even with extreme changes of face shape, but it is more sensitive to changes of texture. To understand the processing of information underlying these counterintuitive properties, we visualize the features of shape and texture that the network processes to identify faces. Then, we shed a light into the inner workings of the black box and reveal how hidden layers represent these features and whether the representations are invariant to pose. We hope that our methodology will provide an additional valuable tool for interpretability of CNNs.

**《Deep Shape-from-Template: Wide-Baseline, Dense and Fast Registration and Deformable Reconstruction from a Single Image》**

arXiv：https://arxiv.org/abs/1811.07791

> We present Deep Shape-from-Template (DeepSfT), a novel Deep Neural Network (DNN) method for solving real-time automatic registration and 3D reconstruction of a deformable object viewed in a single monocular image.DeepSfT advances the state-of-the-art in various aspects. Compared to existing DNN SfT methods, it is the first fully convolutional real-time approach that handles an arbitrary object geometry, topology and surface representation. It also does not require ground truth registration with real data and scales well to very complex object models with large numbers of elements. Compared to previous non-DNN SfT methods, it does not involve numerical optimization at run-time, and is a dense, wide-baseline solution that does not demand, and does not suffer from, feature-based matching. It is able to process a single image with significant deformation and viewpoint changes, and handles well the core challenges of occlusions, weak texture and blur. DeepSfT is based on residual encoder-decoder structures and refining blocks. It is trained end-to-end with a novel combination of supervised learning from simulated renderings of the object model and semi-supervised automatic fine-tuning using real data captured with a standard RGB-D camera. The cameras used for fine-tuning and run-time can be different, making DeepSfT practical for real-world use. We show that DeepSfT significantly outperforms state-of-the-art wide-baseline approaches for non-trivial templates, with quantitative and qualitative evaluation.

**《Do Normalization Layers in a Deep ConvNet Really Need to Be Distinct?》**

arXiv：https://arxiv.org/abs/1811.07727

> Yes, they do. This work investigates a perspective for deep learning: whether different normalization layers in a ConvNet require different normalizers. This is the first step towards understanding this phenomenon. We allow each convolutional layer to be stacked before a switchable normalization (SN) that learns to choose a normalizer from a pool of normalization methods. Through systematic experiments in ImageNet, COCO, Cityscapes, and ADE20K, we answer three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? Our results suggest that (1) using distinct normalizers improves both learning and generalization of a ConvNet; (2) the choices of normalizers are more related to depth and batch size, but less relevant to parameter initialization, learning rate decay, and solver; (3) different tasks and datasets have different behaviors when learning to select normalizers.

**《Self-Referenced Deep Learning》**

arXiv：https://arxiv.org/abs/1811.07598

> Knowledge distillation is an effective approach to transferring knowledge from a teacher neural network to a student target network for satisfying the low-memory and fast running requirements in practice use. Whilst being able to create stronger target networks compared to the vanilla non-teacher based learning strategy, this scheme needs to train additionally a large teacher model with expensive computational cost. In this work, we present a Self-Referenced Deep Learning (SRDL) strategy. Unlike both vanilla optimisation and existing knowledge distillation, SRDL distils the knowledge discovered by the in-training target model back to itself to regularise the subsequent learning procedure therefore eliminating the need for training a large teacher model. SRDL improves the model generalisation performance compared to vanilla learning and conventional knowledge distillation approaches with negligible extra computational cost. Extensive evaluations show that a variety of deep networks benefit from SRDL resulting in enhanced deployment performance on both coarse-grained object categorisation tasks (CIFAR10, CIFAR100, Tiny ImageNet, and ImageNet) and fine-grained person instance identification tasks (Market-1501).

**《Multimodal Densenet》**

arXiv：https://arxiv.org/abs/1811.07407

> Humans make accurate decisions by interpreting complex data from multiple sources. Medical diagnostics, in particular, often hinge on human interpretation of multi-modal information. In order for artificial intelligence to make progress in automated, objective, and accurate diagnosis and prognosis, methods to fuse information from multiple medical imaging modalities are required. However, combining information from multiple data sources has several challenges, as current deep learning architectures lack the ability to extract useful representations from multimodal information, and often simple concatenation is used to fuse such information. In this work, we propose Multimodal DenseNet, a novel architecture for fusing multimodal data. Instead of focusing on concatenation or early and late fusion, our proposed architectures fuses information over several layers and gives the model flexibility in how it combines information from multiple sources. We apply this architecture to the challenge of polyp characterization and landmark identification in endoscopy. Features from white light images are fused with features from narrow band imaging or depth maps. This study demonstrates that Multimodal DenseNet outperforms monomodal classification as well as other multimodal fusion techniques by a significant margin on two different datasets.

**《RePr: Improved Training of Convolutional Filters》**

arXiv：https://arxiv.org/abs/1811.07275

> A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by the network's filters. Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory requirements at run-time. We attempt to address this problem from another angle - not by changing the network structure but by altering the training method. We show that by temporarily pruning and then restoring a subset of the model's filters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved generalization. We show that the existing model-pruning criteria are not optimal for selecting filters to prune in this context and introduce inter-filter orthogonality as the ranking criteria to determine under-expressive filters. Our method is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the performance across a variety of tasks, especially when applied to smaller networks.

**《PydMobileNet: Improved Version of MobileNets with Pyramid Depthwise Separable Convolution》**

arXiv：https://arxiv.org/abs/1811.07083

> Convolutional neural networks (CNNs) have shown remarkable performance in various computer vision tasks in recent years. However, the increasing model size has raised challenges in adopting them in real-time applications as well as mobile and embedded vision applications. Many works try to build networks as small as possible while still have acceptable performance. The state-of-the-art architecture is MobileNets. They use Depthwise Separable Convolution (DWConvolution) in place of standard Convolution to reduce the size of networks. This paper describes an improved version of MobileNet, called Pyramid Mobile Network. Instead of using just a 3×3 kernel size for DWConvolution like in MobileNet, the proposed network uses a pyramid kernel size to capture more spatial information. The proposed architecture is evaluated on two highly competitive object recognition benchmark datasets (CIFAR-10, CIFAR-100). The experiments demonstrate that the proposed network achieves better performance compared with MobileNet as well as other state-of-the-art networks. Additionally, it is more flexible in fine-tuning the trade-off between accuracy, latency and model size than MobileNets.

# Face

**《Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition》**

arXiv：https://arxiv.org/abs/1811.07770

> Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.

# 图像分类

**《High Order Neural Networks for Video Classification》**

arXiv：https://arxiv.org/abs/1811.07519

> Capturing spatiotemporal correlations is an essential topic in video classification. In this paper, we present high order operations as a generic family of building blocks for capturing high order correlations from high dimensional input video space. We prove that several successful architectures for visual classification tasks are in the family of high order neural networks, theoretical and experimental analysis demonstrates their underlying mechanism is high order. We also proposal a new LEarnable hiGh Order (LEGO) block, whose goal is to capture spatiotemporal correlation in a feedforward manner. Specifically, LEGO blocks implicitly learn the relation expressions for spatiotemporal features and use the learned relations to weight input features. This building block can be plugged into many neural network architectures, achieving evident improvement without introducing much overhead. On the task of video classification, even using RGB only without fine-tuning with other video datasets, our high order models can achieve results on par with or better than the existing state-of-the-art methods on both Something-Something (V1 and V2) and Charades datasets.

**《DeepConsensus: using the consensus of features from multiple layers to attain robust image classification》**

arXiv：https://arxiv.org/abs/1811.07266

> We consider a classifier whose test set is exposed to various perturbations that are not present in the training set. These test samples still contain enough features to map them to the same class as their unperturbed counterpart. Current architectures exhibit rapid degradation of accuracy when trained on standard datasets but then used to classify perturbed samples of that data. To address this, we present a novel architecture named DeepConsensus that significantly improves generalization to these test-time perturbations. Our key insight is that deep neural networks should directly consider summaries of low and high level features when making classifications. Existing convolutional neural networks can be augmented with DeepConsensus, leading to improved resistance against large and small perturbations on MNIST, EMNIST, FashionMNIST, CIFAR10 and SVHN datasets.


# 目标检测

**《Weakly Supervised Soft-detection-based Aggregation Method for Image Retrieval》**

arXiv：https://arxiv.org/abs/1811.07619

> In recent year, the compact representations based on activations of Convolutional Neural Network (CNN) achieve remarkable performance in image retrieval. Some interested object only takes up a small part of the whole image. Therefore, it is significant to extract the discriminative representations that contain regional information of pivotal small object. In this paper, we propose a novel weakly supervised soft-detection-based aggregation (SDA) method free from bounding box annotations for image retrieval. In order to highlight the certain discriminative pattern of objects and suppress the noise of background, we employ trainable soft region proposals that indicate the probability of interested object and reflect the significance of candidate regions. 
We conduct comprehensive experiments on standard image retrieval datasets. Our weakly supervised SDA method achieves state-of-the-art performance on most benchmarks. The results demonstrate that the proposed SDA method is effective for image retrieval.

**《Fast Efficient Object Detection Using Selective Attention》**

arXiv：https://arxiv.org/abs/1811.07502

> Deep learning object detectors achieve state-of-the-art accuracy at the expense of high computational overheads, impeding their utilization on embedded systems such as drones. A primary source of these overheads is the exhaustive classification of typically 10^4-10^5 regions per image. Given that most of these regions contain uninformative background, the detector designs seem extremely superfluous and inefficient. In contrast, biological vision systems leverage selective attention for fast and efficient object detection. Recent neuroscientific findings shedding new light on the mechanism behind selective attention allowed us to formulate a new hypothesis of object detection efficiency and subsequently introduce a new object detection paradigm. To that end, we leverage this knowledge to design a novel region proposal network and empirically show that it achieves high object detection performance on the COCO dataset. Moreover, the model uses two to three orders of magnitude fewer computations than state-of-the-art models and consequently achieves inference speeds exceeding 500 frames/s, thereby making it possible to achieve object detection on embedded systems.

**《FotonNet: A HW-Efficient Object Detection System Using 3D-Depth Segmentation and 2D-DNN Classifier》**

arXiv：https://arxiv.org/abs/1811.07493

> Object detection and classification is one of the most important computer vision problems. Ever since the introduction of deep learning \cite{krizhevsky2012imagenet}, we have witnessed a dramatic increase in the accuracy of this object detection problem. However, most of these improvements have occurred using conventional 2D image processing. Recently, low-cost 3D-image sensors, such as the Microsoft Kinect (Time-of-Flight) or the Apple FaceID (Structured-Light), can provide 3D-depth or point cloud data that can be added to a convolutional neural network, acting as an extra set of dimensions. In our proposed approach, we introduce a new 2D + 3D system that takes the 3D-data to determine the object region followed by any conventional 2D-DNN, such as AlexNet. In this method, our approach can easily dissociate the information collection from the Point Cloud and 2D-Image data and combine both operations later. Hence, our system can use any existing trained 2D network on a large image dataset, and does not require a large 3D-depth dataset for new training. Experimental object detection results across 30 images show an accuracy of 0.67, versus 0.54 and 0.51 for RCNN and YOLO, respectively.

**《R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy》**

arXiv：https://arxiv.org/abs/1811.07126

> Object detection plays a vital role in natural scene and aerial scene and is full of challenges. Although many advanced algorithms have succeeded in the natural scene, the progress in the aerial scene has been slow due to the complexity of the aerial image and the large degree of freedom of remote sensing objects in scale, orientation, and density. In this paper, a novel multi-category rotation detector is proposed, which can efficiently detect small objects, arbitrary direction objects, and dense objects in complex remote sensing images. Specifically, the proposed model adopts a targeted feature fusion strategy called inception fusion network, which fully considers factors such as feature fusion, anchor sampling, and receptive field to improve the ability to handle small objects. Then we combine the pixel attention network and the channel attention network to weaken the noise information and highlight the objects feature. Finally, the rotational object detection algorithm is realized by redefining the rotating bounding box. Experiments on public datasets including DOTA, NWPU VHR-10 demonstrate that the proposed algorithm significantly outperforms state-of-the-art methods. The code and models will be available at https://github.com/DetectionTeamUCAS/R2CNN-Plus-Plus_Tensorflow.


# Saliency Detection

**《Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection》**

arXiv：https://arxiv.org/abs/1811.07480

> The existing still-static deep learning based saliency researches do not consider the weighting and highlighting of extracted features from different layers, all features contribute equally to the final saliency decision-making. Such methods always evenly detect all "potentially significant regions" and unable to highlight the key salient object, resulting in detection failure of dynamic scenes. In this paper, based on the fact that salient areas in videos are relatively small and concentrated, we propose a \textbf{key salient object re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up feature guidance} to improve detection accuracy in video scenes. KSORA includes two sub-modules (WFE and KOS): WFE processes local salient feature selection using bottom-up strategy, while KOS ranks each object in global fashion by top-down statistical knowledge, and chooses the most critical object area for local enhancement. The proposed KSORA can not only strengthen the saliency value of the local key salient object but also ensure global saliency consistency. Results on three benchmark datasets suggest that our model has the capability of improving the detection accuracy on complex scenes. The significant performance of KSORA, with a speed of 17FPS on modern GPUs, has been verified by comparisons with other ten state-of-the-art algorithms.

# 场景文本检测

**《Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks》**

arXiv：https://arxiv.org/abs/1811.07432

> Recently, semantic segmentation and general object detection frameworks have been widely adopted by scene text detecting tasks. However, both of them alone have obvious shortcomings in practice. In this paper, we propose a novel end-to-end trainable deep neural network framework, named Pixel-Anchor, which combines semantic segmentation and SSD in one network by feature sharing and anchor-level attention mechanism to detect oriented scene text. To deal with scene text which has large variances in size and aspect ratio, we combine FPN and ASPP operation as our encoder-decoder structure in the semantic segmentation part, and propose a novel Adaptive Predictor Layer in the SSD. Pixel-Anchor detects scene text in a single network forward pass, no complex post-processing other than an efficient fusion Non-Maximum Suppression is involved. We have benchmarked the proposed Pixel-Anchor on the public datasets. Pixel-Anchor outperforms the competing methods in terms of text localization accuracy and run speed, more specifically, on the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.8768 at 10 FPS for 960 x 1728 resolution images.

**《Improving Rotated Text Detection with Rotation Region Proposal Networks》**

arXiv：https://arxiv.org/abs/1811.07031

> A significant number of images shared on social media platforms such as Facebook and Instagram contain text in various forms. It's increasingly becoming commonplace for bad actors to share misinformation, hate speech or other kinds of harmful content as text overlaid on images on such platforms. A scene-text understanding system should hence be able to handle text in various orientations that the adversary might use. Moreover, such a system can be incorporated into screen readers used to aid the visually impaired. In this work, we extend the scene-text extraction system at Facebook, Rosetta, to efficiently handle text in various orientations. Specifically, we incorporate the Rotation Region Proposal Networks (RRPN) in our text extraction pipeline and offer practical suggestions for building and deploying a model for detecting and recognizing text in arbitrary orientations efficiently. Experimental results show a significant improvement on detecting rotated text.


# 图像分割

**《OrthoSeg: A Deep Multimodal Convolutional Neural Network for Semantic Segmentation of Orthoimagery》**

arXiv：https://arxiv.org/abs/1811.07859

> This paper addresses the task of semantic segmentation of orthoimagery using multimodal data e.g. optical RGB, infrared and digital surface model. We propose a deep convolutional neural network architecture termed OrthoSeg for semantic segmentation using multimodal, orthorectified and coregistered data. We also propose a training procedure for supervised training of OrthoSeg. The training procedure complements the inherent architectural characteristics of OrthoSeg for preventing complex co-adaptations of learned features, which may arise due to probable high dimensionality and spatial correlation in multimodal and/or multispectral coregistered data. OrthoSeg consists of parallel encoding networks for independent encoding of multimodal feature maps and a decoder designed for efficiently fusing independently encoded multimodal feature maps. A softmax layer at the end of the network uses the features generated by the decoder for pixel-wise classification. The decoder fuses feature maps from the parallel encoders locally as well as contextually at multiple scales to generate per-pixel feature maps for final pixel-wise classification resulting in segmented output. We experimentally show the merits of OrthoSeg by demonstrating state-of-the-art accuracy on the ISPRS Potsdam 2D Semantic Segmentation dataset. Adaptability is one of the key motivations behind OrthoSeg so that it serves as a useful architectural option for a wide range of problems involving the task of semantic segmentation of coregistered multimodal and/or multispectral imagery. Hence, OrthoSeg is designed to enable independent scaling of parallel encoder networks and decoder network to better match application requirements, such as the number of input channels, the effective field-of-view, and model capacity.

**《M2U-Net: Effective and Efficient Retinal Vessel Segmentation for Resource-Constrained Environments》**

arXiv：https://arxiv.org/abs/1811.07738

> In this paper, we present a novel neural network architecture for retinal vessel segmentation that improves over the state of the art on two benchmark datasets, is the first to run in real time on high resolution images, and its small memory and processing requirements make it deployable in mobile and embedded systems. 
The M2U-Net has a new encoder-decoder architecture that is inspired by the U-Net. It adds pretrained components of MobileNetV2 in the encoder part and novel contractive bottleneck blocks in the decoder part that, combined with bilinear upsampling, drastically reduce the parameter count to 0.55M compared to 31.03M in the original U-Net. 
We have evaluated its performance against a wide body of previously published results on three public datasets. On two of them, the M2U-Net achieves new state-of-the-art performance by a considerable margin. When implemented on a GPU, our method is the first to achieve real-time inference speeds on high-resolution fundus images. We also implemented our proposed network on an ARM-based embedded system where it segments images in between 0.6 and 15 sec, depending on the resolution. Thus, the M2U-Net enables a number of applications of retinal vessel structure extraction, such as early diagnosis of eye diseases, retinal biometric authentication systems, and robot assisted microsurgery.

# 目标跟踪

**《Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling》**

arXiv：https://arxiv.org/abs/1811.07498

> It remains a huge challenge to design effective and efficient trackers under complex scenarios, including occlusions, illumination changes and pose variations. To cope with this problem, a promising solution is to integrate the temporal consistency across consecutive frames and multiple feature cues in a unified model. Motivated by this idea, we propose a novel correlation filter-based tracker in this work, in which the temporal relatedness is reconciled under a multi-task learning framework and the multiple feature cues are modeled using a multi-view learning approach. We demonstrate the resulting regression model can be efficiently learned by exploiting the structure of blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm is developed thereafter for efficient online tracking. Meanwhile, we incorporate an adaptive scale estimation mechanism to strengthen the stability of scale variation tracking. We implement our tracker using two types of features and test it on two benchmark datasets. Experimental results demonstrate the superiority of our proposed approach when compared with other state-of-the-art trackers. project homepage：http://bmal.hust.edu.cn/project/KMF2JMTtracking.html

**《Deep Siamese Networks with Bayesian non-Parametrics for Video Object Tracking》**

arXiv：https://arxiv.org/abs/1811.07386

> We present a novel algorithm utilizing a deep Siamese neural network as a general object similarity function in combination with a Bayesian optimization (BO) framework to encode spatio-temporal information for efficient object tracking in video. In particular, we treat the video tracking problem as a dynamic (i.e. temporally-evolving) optimization problem. Using Gaussian Process priors, we model a dynamic objective function representing the location of a tracked object in each frame. By exploiting temporal correlations, the proposed method queries the search space in a statistically principled and efficient way, offering several benefits over current state of the art video tracking methods.

**《Exploit the Connectivity: Multi-Object Tracking with TrackletNet》**

arXiv：https://arxiv.org/abs/1811.07258

> Multi-object tracking (MOT) is an important and practical task related to both surveillance systems and moving camera applications, such as autonomous driving and robotic vision. However, due to unreliable detection, occlusion and fast camera motion, tracked targets can be easily lost, which makes MOT very challenging. Most recent works treat tracking as a re-identification (Re-ID) task, but how to combine appearance and temporal features is still not well addressed. In this paper, we propose an innovative and effective tracking method called TrackletNet Tracker (TNT) that combines temporal and appearance information together as a unified framework. First, we define a graph model which treats each tracklet as a vertex. The tracklets are generated by appearance similarity with CNN features and intersection-over-union (IOU) with epipolar constraints to compensate camera movement between adjacent frames. Then, for every pair of two tracklets, the similarity is measured by our designed multi-scale TrackletNet. Afterwards, the tracklets are clustered into groups which represent individual object IDs. Our proposed TNT has the ability to handle most of the challenges in MOT, and achieve promising results on MOT16 and MOT17 benchmark datasets compared with other state-of-the-art methods.


# GAN

**《Injecting and removing malignant features in mammography with CycleGAN: Investigation of an automated adversarial attack using neural networks》**

arXiv：https://arxiv.org/abs/1811.07767

> Purpose To train a cycle-consistent generative adversarial network (CycleGAN) on mammographic data to inject or remove features of malignancy, and to determine whether these AI-mediated attacks can be detected by radiologists. Material and Methods From the two publicly available datasets, BCDR and INbreast, we selected images from cancer patients and healthy controls. An internal dataset served as test data, withheld during training. We ran two experiments training CycleGAN on low and higher resolution images (256×256 px and 512×408 px). Three radiologists read the images and rated the likelihood of malignancy on a scale from 1-5 and the likelihood of the image being manipulated. The readout was evaluated by ROC analysis (Area under the ROC curve = AUC). Results At the lower resolution, only one radiologist exhibited markedly lower detection of cancer (AUC=0.85 vs 0.63, p=0.06), while the other two were unaffected (0.67 vs. 0.69 and 0.75 vs. 0.77, p=0.55). Only one radiologist could discriminate between original and modified images slightly better than guessing/chance (0.66, p=0.008). At the higher resolution, all radiologists showed significantly lower detection rate of cancer in the modified images (0.77-0.84 vs. 0.59-0.69, p=0.008), however, they were now able to reliably detect modified images due to better visibility of artifacts (0.92, 0.92 and 0.97). Conclusion A CycleGAN can implicitly learn malignant features and inject or remove them so that a substantial proportion of small mammographic images would consequently be misdiagnosed. At higher resolutions, however, the method is currently limited and has a clear trade-off between manipulation of images and introduction of artifacts.

**《SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint》**

arXiv：https://arxiv.org/abs/1811.07630

> We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together in an adversarial architecture with additional cycle consistency losses. To train, SEIGAN needs only bounding box supervision and does not require pairing or ground truth masks. SEIGAN produces better generated images (evaluated by human assessors) than other approaches and produces high-quality segmentation masks, improving over other adversarially trained approaches and getting closer to the results of fully supervised training.

**《GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint》**

arXiv：https://arxiv.org/abs/1811.07296

> We know SGAN may have a risk of gradient vanishing. A significant improvement is WGAN, with the help of 1-Lipschitz constraint on discriminator to prevent from gradient vanishing. Is there any GAN having no gradient vanishing and no 1-Lipschitz constraint on discriminator? We do find one, called GAN-QP. 
To construct a new framework of Generative Adversarial Network (GAN) usually includes three steps: 1. choose a probability divergence; 2. convert it into a dual form; 3. play a min-max game. In this articles, we demonstrate that the first step is not necessary. We can analyse the property of divergence and even construct new divergence in dual space directly. As a reward, we obtain a simpler alternative of WGAN: GAN-QP. We demonstrate that GAN-QP have a better performance than WGAN in theory and practice.

# 3D

**《Modeling Local Geometric Structure of 3D Point Clouds using Geo-CNN》**

arXiv：https://arxiv.org/abs/1811.07782

> Recent advances in deep convolutional neural networks (CNNs) have motivated researchers to adapt CNNs to directly model points in 3D point clouds. Modeling local structure has been proven to be important for the success of convolutional architectures, and researchers exploited the modeling of local point sets in the feature extraction hierarchy. However, limited attention has been paid to explicitly model the geometric structure amongst points in a local region. To address this problem, we propose Geo-CNN, which applies a generic convolution-like operation dubbed as GeoConv to each point and its local neighborhood. Local geometric relationships among points are captured when extracting edge features between the center and its neighboring points. We first decompose the edge feature extraction process onto three orthogonal bases, and then aggregate the extracted features based on the angles between the edge vector and the bases. This encourages the network to preserve the geometric structure in Euclidean space throughout the feature extraction hierarchy. GeoConv is a generic and efficient operation that can be easily integrated into 3D point cloud analysis pipelines for multiple applications. We evaluate Geo-CNN on ModelNet40 and KITTI and achieve state-of-the-art performance.

**《PointConv: Deep Convolutional Networks on 3D Point Clouds》**

arXiv：https://arxiv.org/abs/1811.07246

> Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and the density functions through kernel density estimation. A novel reformulation is proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.

**《Topology-Aware Non-Rigid Point Cloud Registration》**

arXiv：https://arxiv.org/abs/1811.07014

> In this paper, we introduce a non-rigid registration pipeline for pairs of unorganized point clouds that may be topologically different. Standard warp field estimation algorithms, even under robust, discontinuity-preserving regularization, tend to produce erratic motion estimates on boundaries associated with "close-to-open" topology changes. We overcome this limitation by exploiting backward motion: in the opposite motion direction, a "close-to-open' event becomes "open-to-close", which is by default handled correctly. At the core of our approach lies a general, topology-agnostic warp field estimation algorithm, similar to those employed in recently introduced dynamic reconstruction systems from RGB-D input. We improve motion estimation on boundaries associated with topology changes in an efficient post-processing phase. Based on both forward and (inverted) backward warp hypotheses, we explicitly detect regions of the deformed geometry that undergo topological changes by means of local deformation criteria and broadly classify them as "contacts" or `separations'. Subsequently, the two motion hypotheses are seamlessly blended on a local basis, according to the type and proximity of detected events. Our method achieves state-of-the-art motion estimation accuracy on the MPI Sintel dataset. Experiments on a custom dataset with topological event annotations demonstrate the effectiveness of our pipeline in estimating motion on event boundaries, as well as promising performance in explicit topological event detection.


# Re-ID

**《Past, Present, and Future Approaches Using Computer Vision for Animal Re-Identification from Camera Trap Data》**

arXiv：https://arxiv.org/abs/1811.07749

> The ability of a researcher to re-identify (re-ID) an individual animal upon re-encounter is fundamental for addressing a broad range of questions in the study of ecosystem function, community and population dynamics, and behavioural ecology. In this review, we describe a brief history of camera traps for re-ID, present a collection of computer vision feature engineering methodologies previously used for animal re-ID, provide an introduction to the underlying mechanisms of deep learning relevant to animal re-ID, highlight the success of deep learning methods for human re-ID, describe the few ecological studies currently utilizing deep learning for camera trap analyses, and our predictions for near future methodologies based on the rapid development of deep learning methods. By utilizing novel deep learning methods for object detection and similarity comparisons, ecologists can extract animals from an image/video data and train deep learning classifiers to re-ID animal individuals beyond the capabilities of a human observer. This methodology will allow ecologists with camera/video trap data to re-identify individuals that exit and re-enter the camera frame. Our expectation is that this is just the beginning of a major trend that could stand to revolutionize the analysis of camera trap data and, ultimately, our approach to animal ecology.

**《CA3Net: Contextual-Attentional Attribute-Appearance Network for Person Re-Identification》**

arXiv：https://arxiv.org/abs/1811.07544

> Person re-identification aims to identify the same pedestrian across non-overlapping camera views. Deep learning techniques have been applied for person re-identification recently, towards learning representation of pedestrian appearance. This paper presents a novel Contextual-Attentional Attribute-Appearance Network (CA3Net) for person re-identification. The CA3Net simultaneously exploits the complementarity between semantic attributes and visual appearance, the semantic context among attributes, visual attention on attributes as well as spatial dependencies among body parts, leading to discriminative and robust pedestrian representation. Specifically, an attribute network within CA3Net is designed with an Attention-LSTM module. It concentrates the network on latent image regions related to each attribute as well as exploits the semantic context among attributes by a LSTM module. An appearance network is developed to learn appearance features from the full body, horizontal and vertical body parts of pedestrians with spatial dependencies among body parts. The CA3Net jointly learns the attribute and appearance features in a multi-task learning manner, generating comprehensive representation of pedestrians. Extensive experiments on two challenging benchmarks, i.e., Market-1501 and DukeMTMC-reID datasets, have demonstrated the effectiveness of the proposed approach.

**《Re-Identification with Consistent Attentive Siamese Networks》**

arXiv：https://arxiv.org/abs/1811.07487

> We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets, and establish a new state of the art, with our proposed method resulting in mAP performance improvements of 6.4%, 4.2%, and 1.2% respectively.


# SLAM

**《Collaborative Dense SLAM》**

arXiv：https://arxiv.org/abs/1811.07632

> In this paper, we present a new system for live collaborative dense surface reconstruction. Cooperative robotics, multi participant augmented reality and human-robot interaction are all examples of situations where collaborative mapping can be leveraged for greater agent autonomy. Our system builds on ElasticFusion to allow a number of cameras starting with unknown initial relative positions to maintain local maps utilising the original algorithm. Carrying out visual place recognition across these local maps the system can identify when two maps overlap in space, providing an inter-map constraint from which the system can derive the relative poses of the two maps. Using these resulting pose constraints, our system performs map merging, allowing multiple cameras to fuse their measurements into a single shared reconstruction. The advantage of this approach is that it avoids replication of structures subsequent to loop closures, where multiple cameras traverse the same regions of the environment. Furthermore, it allows cameras to directly exploit and update regions of the environment previously mapped by other cameras within the system. We provide both quantitative and qualitative analyses using the synthetic ICL-NUIM dataset and the real-world Freiburg dataset including the impact of multi-camera mapping on surface reconstruction accuracy, camera pose estimation accuracy and overall processing time. We also include qualitative results in the form of sample reconstructions of room sized environments with up to 3 cameras undergoing intersecting and loopy trajectories.

# 迁移学习

**《An Efficient Transfer Learning Technique by Using Final Fully-Connected Layer Output Features of Deep Networks》**

arXiv：https://arxiv.org/abs/1811.07459

> In this paper, we propose a computationally efficient transfer learning approach using the output vector of final fully-connected layer of deep convolutional neural networks for classification. Our proposed technique uses a single layer perceptron classifier designed with hyper-parameters to focus on improving computational efficiency without adversely affecting the performance of classification compared to the baseline technique. Our investigations show that our technique converges much faster than baseline yielding very competitive classification results. We execute thorough experiments to understand the impact of similarity between pre-trained and new classes, similarity among new classes, number of training samples in the performance of classification using transfer learning of the final fully-connected layer's output features.

**《Transfer Learning with Deep CNNs for Gender Recognition and Age Estimation》**

arXiv：https://arxiv.org/abs/1811.07344

> In this project, competition-winning deep neural networks with pretrained weights are used for image-based gender recognition and age estimation. Transfer learning is explored using both VGG19 and VGGFace pretrained models by testing the effects of changes in various design schemes and training parameters in order to improve prediction accuracy. Training techniques such as input standardization, data augmentation, and label distribution age encoding are compared. Finally, a hierarchy of deep CNNs is tested that first classifies subjects by gender, and then uses separate male and female age models to predict age. A gender recognition accuracy of 98.7% and an MAE of 4.1 years is achieved. This paper shows that, with proper training techniques, good results can be obtained by retasking existing convolutional filters towards a new purpose.

# 风格迁移

**《GLStyleNet: Higher Quality Style Transfer Combining Global and Local Pyramid Features》**

arXiv：https://arxiv.org/abs/1811.07260

> Recent studies using deep neural networks have shown remarkable success in style transfer especially for artistic and photo-realistic images. However, the approaches using global feature correlations fail to capture small, intricate textures and maintain correct texture scales of the artworks, and the approaches based on local patches are defective on global effect. In this paper, we present a novel feature pyramid fusion neural network, dubbed GLStyleNet, which sufficiently takes into consideration multi-scale and multi-level pyramid features by best aggregating layers across a VGG network, and performs style transfer hierarchically with multiple losses of different scales. Our proposed method retains high-frequency pixel information and low frequency construct information of images from two aspects: loss function constraint and feature fusion. Our approach is not only flexible to adjust the trade-off between content and style, but also controllable between global and local. Compared to state-of-the-art methods, our method can transfer not just large-scale, obvious style cues but also subtle, exquisite ones, and dramatically improves the quality of style transfer. We demonstrate the effectiveness of our approach on portrait style transfer, artistic style transfer, photo-realistic style transfer and Chinese ancient painting style transfer tasks. Experimental results indicate that our unified approach improves image style transfer quality over previous state-of-the-art methods, while also accelerating the whole process in a certain extent. Our code is available at https://github.com/EndyWon/GLStyleNet.


# Image Caption

**《Intention Oriented Image Captions with Guiding Objects》**

arXiv：https://arxiv.org/abs/1811.07662

> Although existing image caption models can produce promising results using recurrent neural networks (RNNs), it is difficult to guarantee that an object we care about is contained in generated descriptions, for example in the case that the object is inconspicuous in image. Problems become even harder when these objects did not appear in training stage. In this paper, we propose a novel approach for generating image captions with guiding objects (CGO). The CGO constrains the model to involve a human-concerned object, when the object is in the image, in the generated description while maintaining fluency. Instead of generating the sequence from left to right, we start description with a selected object and generate other parts of the sequence based on this object. To achieve this, we design a novel framework combining two LSTMs in opposite directions. We demonstrate the characteristics of our method on MSCOCO to generate descriptions for each detected object in images. With CGO, we can extend the ability of description to the objects being neglected in image caption labels and provide a set of more comprehensive and diverse descriptions for an image. CGO shows obvious advantages when applied to the task of describing novel objects. We show experiment results on both MSCOCO and ImageNet datasets. Evaluations show that our method outperforms the state-of-the-art models in the task with average F1 75.8, leading to better descriptions in terms of both content accuracy and fluency.

# Few-Shot Learning

**《Deep Comparison: Relation Columns for Few-Shot Learning》**

arXiv：https://arxiv.org/abs/1811.07100

> Few-shot deep learning is a topical challenge area for scaling visual recognition to open-ended growth in the space of categories to recognise. A promising line work towards realising this vision is deep networks that learn to match queries with stored training images. However, methods in this paradigm usually train a deep embedding followed by a single linear classifier. Our insight is that effective general-purpose matching requires discrimination with regards to features at multiple abstraction levels. We therefore propose a new framework termed Deep Comparison Network(DCN) that decomposes embedding learning into a sequence of modules, and pairs each with a relation module. The relation modules compute a non-linear metric to score the match using the corresponding embedding module's representation. To ensure that all embedding module's features are used, the relation modules are deeply supervised. Finally generalisation is further improved by a learned noise regulariser. The resulting network achieves state of the art performance on both miniImageNet and tieredImageNet, while retaining the appealing simplicity and efficiency of deep metric learning approaches.


# 数据集

《iQIYI-VID: A Large Dataset for Multi-modal Person Identification》

arXiv：https://arxiv.org/abs/1811.07548

> Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for task of person identification in the wild. We further demonstrate that a simple fusion of multi-modal features can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.


# Other

**《Addressing the Invisible: Street Address Generation for Developing Countries with Deep Learning》**

NIPS 2018 Workshop

arXiv：https://arxiv.org/abs/1811.07769

> More than half of the world's roads lack adequate street addressing systems. Lack of addresses is even more visible in daily lives of people in developing countries. We would like to object to the assumption that having an address is a luxury, by proposing a generative address design that maps the world in accordance with streets. The addressing scheme is designed considering several traditional street addressing methodologies employed in the urban development scenarios around the world. Our algorithm applies deep learning to extract roads from satellite images, converts the road pixel confidences into a road network, partitions the road network to find neighborhoods, and labels the regions, roads, and address units using graph- and proximity-based algorithms. We present our results on a sample US city, and several developing cities, compare travel times of users using current ad hoc and new complete addresses, and contrast our addressing solution to current industrial and open geocoding alternatives.

**《Handwriting Recognition of Historical Documents with few labeled data》**

arXiv：https://arxiv.org/abs/1811.07768

> Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated textlines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multiscale data. We also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition.

**《GroundNet: Segmentation-Aware Monocular Ground Plane Estimation with Geometric Consistency》**

arXiv：https://arxiv.org/abs/1811.07222

> We focus on the problem of estimating the orientation of the ground plane with respect to a mobile monocular camera platform (e.g., ground robot, wearable camera, assistive robotic platform). To address this problem, we formulate the ground plane estimation problem as an inter-mingled multi-task prediction problem by jointly optimizing for point-wise surface normal direction, 2D ground segmentation, and depth estimates. Our proposed model -- GroundNet -- estimates the ground normal in two streams separately and then a consistency loss is applied on top of the two streams to enforce geometric consistency. A semantic segmentation stream is used to isolate the ground regions and are used to selectively back-propagate parameter updates only through the ground regions in the image. Our experiments on KITTI and ApolloScape datasets verify that the GroundNet is able to predict consistent depth and normal within the ground region. It also achieves top performance on ground plane normal estimation and horizon line detection.

**《Image-to-GPS Verification Through A Bottom-Up Pattern Matching Network》**

arXiv：https://arxiv.org/abs/1811.07288

> The image-to-GPS verification problem asks whether a given image is taken at a claimed GPS location. In this paper, we treat it as an image verification problem -- whether a query image is taken at the same place as a reference image retrieved at the claimed GPS location. We make three major contributions: 1) we propose a novel custom bottom-up pattern matching (BUPM) deep neural network solution; 2) we demonstrate that the verification can be directly done by cross-checking a perspective-looking query image and a panorama reference image, and 3) we collect and clean a dataset of 30K pairs query and reference. Our experimental results show that the proposed BUPM solution outperforms the state-of-the-art solutions in terms of both verification and localization.

**《Matching RGB Images to CAD Models for Object Pose Estimation》**

arXiv：https://arxiv.org/abs/1811.07249

> We propose a novel method for 3D object pose estimation in RGB images, which does not require pose annotations of objects in images in the training stage. We tackle the pose estimation problem by learning how to establish correspondences between RGB images and rendered depth images of CAD models. During training, our approach only requires textureless CAD models and aligned RGB-D frames of a subset of object instances, without explicitly requiring pose annotations for the RGB images. We employ a deep quadruplet convolutional neural network for joint learning of suitable keypoints and their associated descriptors in pairs of rendered depth images which can be matched across modalities with aligned RGB-D views. During testing, keypoints are extracted from a query RGB image and matched to keypoints extracted from rendered depth images, followed by establishing 2D-3D correspondences. The object's pose is then estimated using the RANSAC and PnP algorithms. We conduct experiments on the recently introduced Pix3D dataset and demonstrate the efficacy of our proposed approach in object pose estimation as well as generalization to object instances not seen during training.

**《Optical Flow Dataset and Benchmark for Visual Crowd Analysis》**

arXiv：https://arxiv.org/abs/1811.07170

> The performance of optical flow algorithms greatly depends on the specifics of the content and the application for which it is used. Existing and well established optical flow datasets are limited to rather particular contents from which none is close to crowd behavior analysis; whereas such applications heavily utilize optical flow. We introduce a new optical flow dataset exploiting the possibilities of a recent video engine to generate sequences with ground-truth optical flow for large crowds in different scenarios. We break with the development of the last decade of introducing ever increasing displacements to pose new difficulties. Instead we focus on real-world surveillance scenarios where numerous small, partly independent, non rigidly moving objects observed over a long temporal range pose a challenge. By evaluating different optical flow algorithms, we find that results of established datasets can not be transferred to these new challenges. In exhaustive experiments we are able to provide new insight into optical flow for crowd analysis. Finally, the results have been validated on the real-world UCF crowd tracking benchmark while achieving competitive results compared to more sophisticated state-of-the-art crowd tracking approaches.

**《Simulating LIDAR Point Cloud for Autonomous Driving using Real-world Scenes and Traffic Flows》**

arXiv：https://arxiv.org/abs/1811.07112

> We present a LIDAR simulation framework that can automatically generate 3D point cloud based on LIDAR type and placement. The point cloud, annotated with ground truth semantic labels, is to be used as training data to improve environmental perception capabilities for autonomous driving vehicles. Different from previous simulators, we generate the point cloud based on real environment and real traffic flow. More specifically we employ a mobile LIDAR scanner with cameras to capture real world scenes. The input to our simulation framework includes dense 3D point cloud and registered color images. Moving objects (such as cars, pedestrians, bicyclists) are automatically identified and recorded. These objects are then removed from the input point cloud to restore a static background (e.g., environment without movable objects). With that we can insert synthetic models of various obstacles, such as vehicles and pedestrians in the static background to create various traffic scenes. A novel LIDAR renderer takes the composite scene to generate new realistic LIDAR points that are already annotated at point level for synthetic objects. Experimental results show that our system is able to close the performance gap between simulation and real data to be 1 ~ 6% in different applications, and for model fine tuning, only 10% ~ 20% extra real data could help to outperform the original model trained with full real dataset.

**《DSCnet: Replicating Lidar Point Clouds with Deep Sensor Cloning》**

arXiv：https://arxiv.org/abs/1811.07070

> Convolutional neural networks (CNNs) have become increasingly popular for solving a variety of computer vision tasks, ranging from image classification to image segmentation. Recently, autonomous vehicles have created a demand for depth information, which is often obtained using hardware sensors such as Light detection and ranging (LIDAR). Although it can provide precise distance measurements, most LIDARs are still far too expensive to sell in mass-produced consumer vehicles, which has motivated methods to generate depth information from commodity automotive sensors like cameras. 
In this paper, we propose an approach called Deep Sensor Cloning (DSC). The idea is to use Convolutional Neural Networks in conjunction with inexpensive sensors to replicate the 3D point-clouds that are created by expensive LIDARs. To accomplish this, we develop a new dataset (DSDepth) and a new family of CNN architectures (DSCnets). While previous tasks such as KITTI depth prediction use an interpolated RGB-D images as ground-truth for training, we instead use DSCnets to directly predict LIDAR point-clouds. When we compare the output of our models to a $75,000 LIDAR, we find that our most accurate DSCnet achieves a relative error of 5.77% using a single camera and 4.69% using stereo cameras.

================================================
FILE: 2018/12/10.md
================================================
**【计算机视觉论文速递】2018-12-10**

本文分享共计12篇论文，涉及图像分类、目标检测、图像分割、GAN和三维重建等方向。

[TOC]

# Image Classification

**《Variational Saccading: Efficient Inference for Large Resolution Images》**

NIPS 2018 Bayesian Deep Learning Workshop

arXiv：https://arxiv.org/abs/1812.03170

> Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224x244 in Resnet models. This limitation excludes the 4000x3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sample from the high-resolution original input distribution, we propose using a smaller proxy distribution to learn the co-ordinates that correspond to regions of interest in the high-dimensional space. We introduce a new principled variational lower bound that captures the relationship of the proxy distribution's posterior and the original image's co-ordinate space in a way that maximizes the conditional classification likelihood. We empirically demonstrate on one synthetic benchmark and one real world large resolution DSLR camera image dataset that our method produces comparable results with 10x faster inference and lower memory consumption than a model that utilizes the entire original input distribution.

**《LNEMLC: Label Network Embeddings for Multi-Label Classifiation》**

arXiv：https://arxiv.org/abs/1812.02956

> Multi-label classification aims to classify instances with discrete non-exclusive labels. Most approaches on multi-label classification focus on effective adaptation or transformation of existing binary and multi-class learning approaches but fail in modelling the joint probability of labels or do not preserve generalization abilities for unseen label combinations. To address these issues we propose a new multi-label classification scheme, LNEMLC - Label Network Embedding for Multi-Label Classification, that embeds the label network and uses it to extend input space in learning and inference of any base multi-label classifier. The approach allows capturing of labels' joint probability at low computational complexity providing results comparable to the best methods reported in the literature. We demonstrate how the method reveals statistically significant improvements over the simple kNN baseline classifier. We also provide hints for selecting the robust configuration that works satisfactorily across data domains.


# Object Detection

**《ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape》**

arXiv：https://arxiv.org/abs/1812.02781

> We present a deep learning method for end-to-end monocular 3D object detection and metric shape retrieval. We propose a novel loss formulation by lifting 2D detection, orientation, and scale estimation into 3D space. Instead of optimizing these quantities separately, the 3D instantiation allows to properly measure the metric misalignment of boxes. We experimentally show that our 10D lifting of sparse 2D Regions of Interests (RoIs) achieves great results both for 6D pose and recovery of the textured metric geometry of instances. This further enables 3D synthetic data augmentation via inpainting recovered meshes directly onto the 2D scenes. We evaluate on KITTI3D against other strong monocular methods and demonstrate that our approach doubles the AP on the 3D pose metrics on the official test set, defining the new state of the art.


# Image Segmentation

《Scale-aware multi-level guidance for interactive instance segmentation》

arXiv：https://arxiv.org/abs/1812.02967

> In interactive instance segmentation, users give feedback to iteratively refine segmentation masks. The user-provided clicks are transformed into guidance maps which provide the network with necessary cues on the whereabouts of the object of interest. Guidance maps used in current systems are purely distance-based and are either too localized or non-informative. We propose a novel transformation of user clicks to generate scale-aware guidance maps that leverage the hierarchical structural information present in an image. Using our guidance maps, even the most basic FCNs are able to outperform existing approaches that require state-of-the-art segmentation networks pre-trained on large scale segmentation datasets. We demonstrate the effectiveness of our proposed transformation strategy through comprehensive experimentation in which we significantly raise state-of-the-art on four standard interactive segmentation benchmarks.

**《A High-Order Scheme for Image Segmentation via a modified Level-Set method》**

arXiv：https://arxiv.org/abs/1812.03026

> The method is based on an adaptive "filtered" scheme recently introduced by the authors. The main feature of the scheme is the possibility to stabilize an a priori unstable high-order scheme via a filter function which allows to combine a high-order scheme in the regularity regions and a monotone scheme elsewhere, in presence of singularities. The filtered scheme considered in this paper uses the local Lax-Friedrichs scheme as monotone scheme and the Lax-Wendroff scheme as high-order scheme but other couplings are possible. Moreover, we introduce also a modified velocity function for the level-set model used in segmentation, this velocity allows to obtain more accurate results with respect to other velocities proposed in the literature. Some numerical tests on synthetic and real images confirm the accuracy of the proposed method and the advantages given by the new velocity.

# GAN

**《Color Constancy by GANs: An Experimental Survey》**

arXiv：https://arxiv.org/abs/1812.03085

> In this paper, we formulate the color constancy task as an image-to-image translation problem using GANs. By conducting a large set of experiments on different datasets, an experimental survey is provided on the use of different types of GANs to solve for color constancy i.e. CC-GANs (Color Constancy GANs). Based on the experimental review, recommendations are given for the design of CC-GAN architectures based on different criteria, circumstances and datasets.

**《StoryGAN: A Sequential Conditional GAN for Story Visualization》**

arXiv：https://arxiv.org/abs/1812.02784

> In this work we propose a new task called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. Therefore, we propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, respectively, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.


# 3D Reconstrcution

**《Real-time Indoor Scene Reconstruction with RGBD and Inertia Input》**

arXiv：https://arxiv.org/abs/1812.03015

> Camera motion estimation is a key technique for 3D scene reconstruction and Simultaneous localization and mapping (SLAM). To make it be feasibly achieved, previous works usually assume slow camera motions, which limits its usage in many real cases. We propose an end-to-end 3D reconstruction system which combines color, depth and inertial measurements to achieve robust reconstruction with fast sensor motions. Our framework extends Kalman filter to fuse the three kinds of information and involve an iterative method to jointly optimize feature correspondences, camera poses and scene geometry. We also propose a novel geometry-aware patch deformation technique to adapt the feature appearance in image domain, leading to a more accurate feature matching under fast camera motions. Experiments show that our patch deformation method improves the accuracy of feature tracking, and our 3D reconstruction outperforms the state-of-the-art solutions under fast camera motions.

**《SeFM: A Sequential Feature Point Matching Algorithm for Object 3D Reconstruction》**

arXiv：https://arxiv.org/abs/1812.02925

> 3D reconstruction is a fundamental issue in many applications and the feature point matching problem is a key step while reconstructing target objects. Conventional algorithms can only find a small number of feature points from two images which is quite insufficient for reconstruction. To overcome this problem, we propose SeFM a sequential feature point matching algorithm. We first utilize the epipolar geometry to find the epipole of each image. Rotating along the epipole, we generate a set of the epipolar lines and reserve those intersecting with the input image. Next, a rough matching phase, followed by a dense matching phase, is applied to find the matching dot-pairs using dynamic programming. Furthermore, we also remove wrong matching dot-pairs by calculating the validity. Experimental results illustrate that SeFM can achieve around 1,000 to 10,000 times matching dot-pairs, depending on individual image, compared to conventional algorithms and the object reconstruction with only two images is semantically visible. Moreover, it outperforms conventional algorithms, such as SIFT and SURF, regarding precision and recall.



# Re-ID

**《Optimizing Speed/Accuracy Trade-Off for Person Re-identification via Knowledge Distillation》**

arXiv：https://arxiv.org/abs/1812.02937

> Finding a person across a camera network plays an important role in video surveillance. For a real-world person re-identification application, in order to guarantee an optimal time response, it is crucial to find the balance between accuracy and speed. We analyse this trade-off, comparing a classical method, that comprises hand-crafted feature description and metric learning, in particular, LOMO and XQDA, with state-of-the-art deep learning techniques, using image classification networks, ResNet and MobileNets. Additionally, we propose and analyse network distillation as a learning strategy to reduce the computational cost of the deep learning approach at test time. We evaluate both methods on the Market-1501 and DukeMTMC-reID large-scale datasets.


# 其它

**《Graph Cut Segmentation Methods Revisited with a Quantum Algorithm》**

arXiv：https://arxiv.org/abs/1812.03050

> The design and performance of computer vision algorithms are greatly influenced by the hardware on which they are implemented. CPUs, multi-core CPUs, FPGAs and GPUs have inspired new algorithms and enabled existing ideas to be realized. This is notably the case with GPUs, which has significantly changed the landscape of computer vision research through deep learning. As the end of Moores law approaches, researchers and hardware manufacturers are exploring alternative hardware computing paradigms. Quantum computers are a very promising alternative and offer polynomial or even exponential speed-ups over conventional computing for some problems. This paper presents a novel approach to image segmentation that uses new quantum computing hardware. Segmentation is formulated as a graph cut problem that can be mapped to the quantum approximation optimization algorithm (QAOA). This algorithm can be implemented on current and near-term quantum computers. Encouraging results are presented on artificial and medical imaging data. This represents an important, practical step towards leveraging quantum computers for computer vision.

**《Neural Image Decompression: Learning to Render Better Image Previews》**

arXiv：https://arxiv.org/abs/1812.02831

> A rapidly increasing portion of Internet traffic is dominated by requests from mobile devices with limited- and metered-bandwidth constraints. To satisfy these requests, it has become standard practice for websites to transmit small and extremely compressed image previews as part of the initial page-load process. Recent work, based on an adaptive triangulation of the target image, has shown the ability to generate thumbnails of full images at extreme compression rates: 200 bytes or less with impressive gains (in terms of PSNR and SSIM) over both JPEG and WebP standards. However, qualitative assessments and preservation of semantic content can be less favorable. We present a novel method to significantly improve the reconstruction quality of the original image with no changes to the encoded information. Our neural-based decoding not only achieves higher PSNR and SSIM scores than the original methods, but also yields a substantial increase in semantic-level content preservation. In addition, by keeping the same encoding stream, our solution is completely inter-operable with the original decoder. The end result is suitable for a range of small-device deployments, as it involves only a single forward-pass through a small, scalable network.


================================================
FILE: 2018/12/17-21.md
================================================
【计算机视觉论文速递】2018-12-17~12-21

- [ ] 2018-12-17
- [ ] 2018-12-18
- [ ] 2018-12-19
- [ ] 2018-12-20
- [x] 2018-12-21

本文分享共计9篇论文，涉及CNN、GAN、姿态估计和Meta-Learning等方向。

[TOC]

# CNN

**《DAC: Data-free Automatic Acceleration of Convolutional Networks》**

WACV 2019

arXiv：https://arxiv.org/abs/1812.08374

> Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance.

# GAN

**《RankGAN: A Maximum Margin Ranking GAN for Generating Faces》**

ACCV 2018 Best Student Paper Award

arXiv：https://arxiv.org/abs/1812.08196

> We present a new stage-wise learning paradigm for training generative adversarial networks (GANs). The goal of our work is to progressively strengthen the discriminator and thus, the generators, with each subsequent stage without changing the network architecture. We call this proposed method the RankGAN. We first propose a margin-based loss for the GAN discriminator. We then extend it to a margin-based ranking loss to train the multiple stages of RankGAN. We focus on face images from the CelebA dataset in our work and show visual as well as quantitative improvements in face generation and completion tasks over other GAN approaches, including WGAN and LSGAN.


# Pose Estimation

**《OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields》**

arXiv：https://arxiv.org/abs/1812.08008

arXiv(old)：https://arxiv.org/abs/1611.08050

datasets：https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/

> Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

# Image Caption

**《Nocaps: novel object captioning at scale》**

arXiv：https://arxiv.org/abs/1812.08658

homepage：https://nocaps.org/

> Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, more than 500 object classes seen in test images have no training captions (hence, nocaps). We evaluate several existing approaches to novel object captioning on our challenging benchmark. In automatic evaluations these approaches show modest improvements over a strong baseline trained only on image-caption data. However, even when using ground-truth object detections, the results are significantly weaker than our human baseline - indicating substantial room for improvement.


# Medical Image Analysis

**《Unconstrained Iris Segmentation using Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.08245

> The extraction of consistent and identifiable features from an image of the human iris is known as iris recognition. Identifying which pixels belong to the iris, known as segmentation, is the first stage of iris recognition. Errors in segmentation propagate to later stages. Current segmentation approaches are tuned to specific environments. We propose using a convolution neural network for iris segmentation. Our algorithm is accurate when trained in a single environment and tested in multiple environments. Our network builds on the Mask R-CNN framework (He et al., ICCV 2017). Our approach segments faster than previous approaches including the Mask R-CNN network. Our network is accurate when trained on a single environment and tested with a different sensors (either visible light or near-infrared). Its accuracy degrades when trained with a visible light sensor and tested with a near-infrared sensor (and vice versa). A small amount of retraining of the visible light model (using a few samples from a near-infrared dataset) yields a tuned network accurate in both settings. For training and testing, this work uses the Casia v4 Interval, Notre Dame 0405, Ubiris v2, and IITD datasets.

# Meta-Learning

**《Unsupervised Meta-learning of Figure-Ground Segmentation via Imitating Visual Effects》**

arXiv：https://arxiv.org/abs/1812.08442

> This paper presents a "learning to learn" approach to figure-ground image segmentation. By exploring webly-abundant images of specific visual effects, our method can effectively learn the visual-effect internal representations in an unsupervised manner and uses this knowledge to differentiate the figure from the ground in an image. Specifically, we formulate the meta-learning process as a compositional image editing task that learns to imitate a certain visual effect and derive the corresponding internal representation. Such a generative process can help instantiate the underlying figure-ground notion and enables the system to accomplish the intended image segmentation. Whereas existing generative methods are mostly tailored to image synthesis or style transfer, our approach offers a flexible learning mechanism to model a general concept of figure-ground segmentation from unorganized images that have no explicit pixel-level annotations. We validate our approach via extensive experiments on six datasets to demonstrate that the proposed model can be end-to-end trained without ground-truth pixel labeling yet outperforms the existing methods of unsupervised segmentation tasks.


# Other

**《Explanatory Graphs for CNNs》**

arXiv：https://arxiv.org/abs/1812.07997

> This paper introduces a graphical model, namely an explanatory graph, which reveals the knowledge hierarchy hidden inside conv-layers of a pre-trained CNN. Each filter in a conv-layer of a CNN for object classification usually represents a mixture of object parts. We develop a simple yet effective method to disentangle object-part pattern components from each filter. We construct an explanatory graph to organize the mined part patterns, where a node represents a part pattern, and each edge encodes co-activation relationships and spatial relationships between patterns. More crucially, given a pre-trained CNN, the explanatory graph is learned without a need of annotating object parts. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. We transferred part patterns in the explanatory graph to the task of part localization, and our method significantly outperformed other approaches.

**《Deep Paper Gestalt》**

arXiv：https://arxiv.org/abs/1812.08775

github：https://github.com/vt-vl-lab/paper-gestalt

> Recent years have witnessed a significant increase in the number of paper submissions to computer vision conferences. The sheer volume of paper submissions and the insufficient number of competent reviewers cause a considerable burden for the current peer review system. In this paper, we learn a classifier to predict whether a paper should be accepted or rejected based solely on the visual appearance of the paper (i.e., the gestalt of a paper). Experimental results show that our classifier can safely reject 50% of the bad papers while wrongly reject only 0.4% of the good papers, and thus dramatically reduce the workload of the reviewers. We also provide tools for providing suggestions to authors so that they can improve the gestalt of their papers.

**《One-Class Feature Learning Using Intra-Class Splitting》**

arXiv：https://arxiv.org/abs/1812.08468

> This paper proposes a novel generic one-class feature learning method which is based on intra-class splitting. In one-class classification, feature learning is challenging, because only samples of one class are available during training. Hence, state-of-the-art methods require reference multi-class datasets to pretrain feature extractors. In contrast, the proposed method realizes feature learning by splitting the given normal class into typical and atypical normal samples. By introducing closeness loss and dispersion loss, an intra-class joint training procedure between the two subsets after splitting enables the extraction of valuable features for one-class classification. Various experiments on three well-known image classification datasets demonstrate the effectiveness of our method which outperformed other baseline models in 25 of 30 experiments.


================================================
FILE: 2018/12/24-28.md
================================================
【计算机视觉论文速递】2018-12-24~12-28

- [x] 2018-12-24
- [ ] 2018-12-25
- [ ] 2018-12-26
- [x] 2018-12-27
- [ ] 2018-12-28

本文分享共计39篇论文，涉及图像分类、目标检测、语义分割、GAN、姿态估计、SLAM、显著性目标检测和Zero-Shot Learning等方向。

[TOC]

# CNN

**《Slimmable Neural Networks》**

ICLR 2019

arXiv：https://arxiv.org/abs/1812.08928

github：https://github.com/JiahuiYu/slimmable_networks

> We present a simple and general method to train a single neural network executable at different widths (number of channels in a layer), permitting instant and adaptive accuracy-efficiency trade-offs at runtime. Instead of training individual networks with different width configurations, we train a shared network with switchable batch normalization. At runtime, the network can adjust its width on the fly according to on-device benchmarks and resource constraints, rather than downloading and offloading different models. Our trained networks, named slimmable neural networks, achieve similar (and in many cases better) ImageNet classification accuracy than individually trained models of MobileNet v1, MobileNet v2, ShuffleNet and ResNet-50 at different widths respectively. We also demonstrate better performance of slimmable models compared with individual ones across a wide range of applications including COCO bounding-box object detection, instance segmentation and person keypoint detection without tuning hyper-parameters. Lastly we visualize and discuss the learned features of slimmable networks. Code and models are available at: https://github.com/JiahuiYu/slimmable_networks

**《ChamNet: Towards Efficient Network Design through Platform-Aware Model Adaptation》**

arXiv：https://arxiv.org/abs/1812.08934

> This paper proposes an efficient neural network (NN) architecture design methodology called Chameleon that honors given resource constraints. Instead of developing new building blocks or using computationally-intensive reinforcement learning algorithms, our approach leverages existing efficient network building blocks and focuses on exploiting hardware traits and adapting computation resources to fit target latency and/or energy constraints. We formulate platform-aware NN architecture search in an optimization framework and propose a novel algorithm to search for optimal architectures aided by efficient accuracy and resource (latency and/or energy) predictors. At the core of our algorithm lies an accuracy predictor built atop Gaussian Process with Bayesian optimization for iterative sampling. With a one-time building cost for the predictors, our algorithm produces state-of-the-art model architectures on different platforms under given constraints in just minutes. Our results show that adapting computation resources to building blocks is critical to model performance. Without the addition of any bells and whistles, our models achieve significant accuracy improvements against state-of-the-art hand-crafted and automatically designed architectures. We achieve 73.8% and 75.3% top-1 accuracy on ImageNet at 20ms latency on a mobile CPU and DSP. At reduced latency, our models achieve up to 8.5% (4.8%) and 6.6% (9.3%) absolute top-1 accuracy improvements compared to MobileNetV2 and MnasNet, respectively, on a mobile CPU (DSP), and 2.7% (4.6%) and 5.6% (2.6%) accuracy gains over ResNet-101 and ResNet-152, respectively, on an Nvidia GPU (Intel CPU).


**《Learning from Web Data: the Benefit of Unsupervised Object Localization》**

IEEE TIP 2019

arXiv：https://arxiv.org/abs/1812.09232

> Annotating a large number of training images is very time-consuming. In this background, this paper focuses on learning from easy-to-acquire web data and utilizes the learned model for fine-grained image classification in labeled datasets. Currently, the performance gain from training with web data is incremental, like a common saying "better than nothing, but not by much". Conventionally, the community looks to correcting the noisy web labels to select informative samples. In this work, we first systematically study the built-in gap between the web and standard datasets, i.e. different data distributions between the two kinds of data. Then, in addition to using web labels, we present an unsupervised object localization method, which provides critical insights into the object density and scale in web images. Specifically, we design two constraints on web data to substantially reduce the difference of data distributions for the web and standard datasets. First, we present a method to control the scale, localization and number of objects in the detected region. Second, we propose to select the regions containing objects that are consistent with the web tag. Based on the two constraints, we are able to process web images to reduce the gap, and the processed web data is used to better assist the standard dataset to train CNNs. Experiments on several fine-grained image classification datasets confirm that our method performs favorably against the state-of-the-art methods.

# Image Classification

**《Chinese Herbal Recognition based on Competitive Attentional Fusion of Multi-hierarchies Pyramid Features》**

arXiv：https://arxiv.org/abs/1812.09648

github：https://github.com/scut-aitcm/Chinese-Herbs-Dataset

注：中药识别，有点意思的

> Convolution neural netwotks (CNNs) are successfully applied in image recognition task. In this study, we explore the approach of automatic herbal recognition with CNNs and build the standard Chinese herbs datasets firstly. According to the characteristics of herbal images, we proposed the competitive attentional fusion pyramid networks to model the features of herbal image, which mdoels the relationship of feature maps from different levels, and re-weights multi-level channels with channel-wise attention mechanism. In this way, we can dynamically adjust the weight of feature maps from various layers, according to the visual characteristics of each herbal image. Moreover, we also introduce the spatial attention to recalibrate the misaligned features caused by sampling in features amalgamation. Extensive experiments are conducted on our proposed datasets and validate the superior performance of our proposed models. The Chinese herbs datasets will be released upon acceptance to facilitate the research of Chinese herbal recognition.


# Face

**《A Survey to Deep Facial Attribute Analysis》**

arXiv：https://arxiv.org/abs/1812.10265

注：好综述

> Facial attribute analysis has received considerable attention with the development of deep neural networks in the past few years. Facial attribute analysis contains two crucial issues: Facial Attribute Estimation (FAE), which recognizes whether facial attributes are present in given images, and Facial Attribute Manipulation (FAM), which synthesizes or removes desired facial attributes. In this paper, we provide a comprehensive survey on deep facial attribute analysis covering FAE and FAM. First, we present the basic knowledge of the two stages (i.e., data pre-processing and model construction) in the general deep facial attribute analysis pipeline. Second, we summarize the commonly used datasets and performance metrics. Third, we create a taxonomy of the state-of-the-arts and review detailed algorithms in FAE and FAM, respectively. Furthermore, we introduce several additional facial attribute related issues and applications. Finally, the possible challenges and future research directions are discussed.


**《A Smart Security System with Face Recognition》**

arXiv：https://arxiv.org/abs/1812.09127

> Web-based technology has improved drastically in the past decade. As a result, security technology has become a major help to protect our daily life. In this paper, we propose a robust security based on face recognition system (SoF). In particular, we develop this system to giving access into a home for authenticated users. The classifier is trained by using a new adaptive learning method. The training data are initially collected from social networks. The accuracy of the classifier is incrementally improved as the user starts using the system. A novel method has been introduced to improve the classifier model by human interaction and social media. By using a deep learning framework - TensorFlow, it will be easy to reuse the framework to adopt with many devices and applications.

# Object Detection

**《3D multirater RCNN for multimodal multiclass detection and characterisation of extremely small objects》**

MIDL 2019 submission

arXiv：https://arxiv.org/abs/1812.09046

> Extremely small objects (ESO) have become observable on clinical routine magnetic resonance imaging acquisitions, thanks to a reduction in acquisition time at higher resolution. Despite their small size (usually <10 voxels per object for an image of more than 106 voxels), these markers reflect tissue damage and need to be accounted for to investigate the complete phenotype of complex pathological pathways. In addition to their very small size, variability in shape and appearance leads to high labelling variability across human raters, resulting in a very noisy gold standard. Such objects are notably present in the context of cerebral small vessel disease where enlarged perivascular spaces and lacunes, commonly observed in the ageing population, are thought to be associated with acceleration of cognitive decline and risk of dementia onset. In this work, we redesign the RCNN model to scale to 3D data, and to jointly detect and characterise these important markers of age-related neurovascular changes. We also propose training strategies enforcing the detection of extremely small objects, ensuring a tractable and stable training process.

**《Detection of distal radius fractures trained by a small set of X-ray images and Faster R-CNN》**

arXiv：https://arxiv.org/abs/1812.09025

> Distal radius fractures are the most common fractures of the upper extremity in humans. As such, they account for a significant portion of the injuries that present to emergency rooms and clinics throughout the world. We trained a Faster R-CNN, a machine vision neural network for object detection, to identify and locate distal radius fractures in anteroposterior X-ray images. We achieved an accuracy of 96\% in identifying fractures and mean Average Precision, mAP, of 0.866. This is significantly more accurate than the detection achieved by physicians and radiologists. These results were obtained by training the deep learning network with only 38 original images of anteroposterior hands X-ray images with fractures. This opens the possibility to detect with this type of neural network rare diseases or rare symptoms of common diseases , where only a small set of diagnosed X-ray images could be collected for each disease.

**《Practical Adversarial Attack Against Object Detector》**

arXiv：https://arxiv.org/abs/1812.10217

注：很有意思的研究

> In this paper, we proposed the first practical adversarial attacks against object detectors in realistic situations: the adversarial examples are placed in different angles and distances, especially in the long distance (over 20m) and wide angles 120 degree. To improve the robustness of adversarial examples, we proposed the nested adversarial examples and introduced the image transformation techniques. Transformation methods aim to simulate the variance factors such as distances, angles, illuminations, etc., in the physical world. Two kinds of attacks were implemented on YOLO V3, a state-of-the-art real-time object detector: hiding attack that fools the detector unable to recognize the object, and appearing attack that fools the detector to recognize the non-existent object. The adversarial examples are evaluated in three environments: indoor lab, outdoor environment, and the real road, and demonstrated to achieve the success rate up to 92.4% based on the distance range from 1m to 25m. In particular, the real road testing of hiding attack on a straight road and a crossing road produced the success rate of 75% and 64% respectively, and the appearing attack obtained the success rates of 63% and 81% respectively, which we believe, should catch the attention of the autonomous driving community.


# Semantic Segmentation

**《Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes》**

arXiv：https://arxiv.org/abs/1812.09953

> During the last half decade, convolutional neural networks (CNNs) have triumphed over semantic segmentation, which is one of the core tasks in many applications such as autonomous driving and augmented reality. However, to train CNNs requires a considerable amount of data, which is difficult to collect and laborious to annotate. Recent advances in computer graphics make it possible to train CNNs on photo-realistic synthetic imagery with computer-generated annotations. Despite this, the domain mismatch between the real images and the synthetic data hinders the models' performance. Hence, we propose a curriculum-style learning approach to minimizing the domain gap in urban scene semantic segmentation. The curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain; in particular, the first task is to learn global label distributions over images and local distributions over landmark superpixels. These are easy to estimate because images of urban scenes have strong idiosyncrasies (e.g., the size and spatial relations of buildings, streets, cars, etc.). We then train a segmentation network, while regularizing its predictions in the target domain to follow those inferred properties. In experiments, our method outperforms the baselines on two datasets and two backbone networks. We also report extensive ablation studies about our approach.

# GAN

**《Improving MMD-GAN Training with Repulsive Loss Function》**

ICLR 2019

arXiv：https://arxiv.org/abs/1812.09916

> Generative adversarial nets (GANs) are widely used to learn the data sampling process and their performance may heavily depend on the loss functions, given a limited computational budget. This study revisits MMD-GAN that uses the maximum mean discrepancy (MMD) as the loss function for GAN and makes two contributions. First, we argue that the existing MMD loss function may discourage the learning of fine details in data as it attempts to contract the discriminator outputs of real data. To address this issue, we propose a repulsive loss function to actively learn the difference among the real data by simply rearranging the terms in MMD. Second, inspired by the hinge loss, we propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the repulsive loss function. The proposed methods are applied to the unsupervised image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets. Results show that the repulsive loss function significantly improves over the MMD loss at no additional computational cost and outperforms other representative loss functions. The proposed methods achieve an FID score of 16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral normalization.


# Visual Tracking

**《Saliency Guided Hierarchical Robust Visual Tracking》**

arXiv：https://arxiv.org/abs/1812.08973

> A saliency guided hierarchical visual tracking (SHT) algorithm containing global and local search phases is proposed in this paper. In global search, a top-down saliency model is novelly developed to handle abrupt motion and appearance variation problems. Nineteen feature maps are extracted first and combined with online learnt weights to produce the final saliency map and estimated target locations. After the evaluation of integration mechanism, the optimum candidate patch is passed to the local search. In local search, a superpixel based HSV histogram matching is performed jointly with an L2-RLS tracker to take both color distribution and holistic appearance feature of the object into consideration. Furthermore, a linear refinement search process with fast iterative solver is implemented to attenuate the possible negative influence of dominant particles. Both qualitative and quantitative experiments are conducted on a series of challenging image sequences. The superior performance of the proposed method over other state-of-the-art algorithms is demonstrated by comparative study.

# 3D

**《Perceptually-based single-image depth super-resolution》**

arXiv：https://arxiv.org/abs/1812.09874

> RGBD images, combining high-resolution color and lower-resolution depth from various types of depth sensors, are increasingly common. One can significantly improve the resolution of depth images by taking advantage of color information; deep learning methods make combining color and depth information particularly easy. However, fusing these two sources of data may lead to a variety of artifacts. If depth maps are used to reconstruct 3D shapes, e.g., for virtual reality applications, the visual quality of upsampled images is particularly important. To achieve high-quality results, visual metric need to be taken into account. The main idea of our approach is to measure the quality of depth map upsampling using renderings of resulting 3D surfaces. We demonstrate that a simple visual appearance-based loss, when used with either a trained CNN or simply a deep prior, yields significantly improved 3D shapes, as measured by a number of existing perceptual metrics. We compare this approach with a number of existing optimization and learning-based techniques.

**《A Survey on Non-rigid 3D Shape Analysis》**

arXiv：https://arxiv.org/abs/1812.10111

> Shape is an important physical property of natural and manmade 3D objects that characterizes their external appearances. Understanding differences between shapes and modeling the variability within and across shape classes, hereinafter referred to as \emph{shape analysis}, are fundamental problems to many applications, ranging from computer vision and computer graphics to biology and medicine. This chapter provides an overview of some of the recent techniques that studied the shape of 3D objects that undergo non-rigid deformations including bending and stretching. Recent surveys that covered some aspects such classification, retrieval, recognition, and rigid or nonrigid registration, focused on methods that use shape descriptors. Descriptors, however, provide abstract representations that do not enable the exploration of shape variability. In this chapter, we focus on recent techniques that treated the shape of 3D objects as points in some high dimensional space where paths describe deformations. Equipping the space with a suitable metric enables the quantification of the range of deformations of a given shape, which in turn enables (1) comparing and classifying 3D objects based on their shape, (2) computing smooth deformations, i.e. geodesics, between pairs of objects, and (3) modeling and exploring continuous shape variability in a collection of 3D models. This article surveys and classifies recent developments in this field, outlines fundamental issues, discusses their potential applications in computer vision and graphics, and highlights opportunities for future research. Our primary goal is to bridge the gap between various techniques that have been often independently proposed by different communities including mathematics and statistics, computer vision and graphics, and medical image analysis.


# Pose Estimation

**《Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation》**

arXiv：https://arxiv.org/abs/1812.09899

> We propose a novel approach to jointly perform 3D object retrieval and pose estimation from monocular images.In order to make the method robust to real world scene variations in the images, e.g. texture, lighting and background,we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.Our method can then be trained for robustness under real world scene variations without having to render a large training set simulating these variations. Our learned embedding explicitly disentangles a shape vector and a pose vector, which alleviates both pose bias for 3D shape retrieval and categorical bias for pose estimation. Having the learned disentangled embedding, we train a CNN to map the images to the embedding space, and then retrieve the closest 3D shape from the database and estimate the 6D pose of the object using the embedding vectors. Our method achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore outperforms the previous state-of-the-art methods on both tasks.

**《Structure-Aware 3D Hourglass Network for Hand Pose Estimation from Single Depth Image》**

BMVC 2018

arXiv：https://arxiv.org/abs/1812.10320

> In this paper, we propose a novel structure-aware 3D hourglass network for hand pose estimation from a single depth image, which achieves state-of-the-art results on MSRA and NYU datasets. Compared to existing works that perform image-to-coordination regression, our network takes 3D voxel as input and directly regresses 3D heatmap for each joint. To be specific, we use hourglass network as our backbone network and modify it into 3D form. We explicitly model tree-like finger bone into the network as well as in the loss function in an end-to-end manner, in order to take the skeleton constraints into consideration. Final estimation can then be easily obtained from voxel density map with simple post-processing. Experimental results show that the proposed structure-aware 3D hourglass network is able to achieve a mean joint error of 7.4 mm in MSRA and 8.9 mm in NYU datasets, respectively.


# Text

**《TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network》**

ACCV 2018 oral

百度出品，必属精品，Mark！

arXiv：https://arxiv.org/abs/1812.09900

> Reading text from images remains challenging due to multi-orientation, perspective distortion and especially the curved nature of irregular text. Most of existing approaches attempt to solve the problem in two or multiple stages, which is considered to be the bottleneck to optimize the overall performance. To address this issue, we propose an end-to-end trainable network architecture, named TextNet, which is able to simultaneously localize and recognize irregular text from images. Specifically, we develop a scale-aware attention mechanism to learn multi-scale image features as a backbone network, sharing fully convolutional features and computation for localization and recognition. In text detection branch, we directly generate text proposals in quadrangles, covering oriented, perspective and curved text regions. To preserve text features for recognition, we introduce a perspective RoI transform layer, which can align quadrangle proposals into small feature maps. Furthermore, in order to extract effective features for recognition, we propose to encode the aligned RoI features by RNN into context information, combining spatial attention mechanism to generate text sequences. This overall pipeline is capable of handling both regular and irregular cases. Finally, text localization and recognition tasks can be jointly trained in an end-to-end fashion with designed multi-task loss. Experiments on standard benchmarks show that the proposed TextNet can achieve state-of-the-art performance, and outperform existing approaches on irregular datasets by a large margin.


# Saliency

**《SMILER: Saliency Model Implementation Library for Experimental Research》**

arXiv：https://arxiv.org/abs/1812.08848

github：https://github.com/TsotsosLab/SMILER

> The Saliency Model Implementation Library for Experimental Research (SMILER) is a new software package which provides an open, standardized, and extensible framework for maintaining and executing computational saliency models. This work drastically reduces the human effort required to apply saliency algorithms to new tasks and datasets, while also ensuring consistency and procedural correctness for results and conclusions produced by different parties. At its launch SMILER already includes twenty three saliency models (fourteen models based in MATLAB and nine supported through containerization), and the open design of SMILER encourages this number to grow with future contributions from the community. The project may be downloaded and contributed to through its GitHub page: https://github.com/TsotsosLab/SMILER

# SLAM

**《A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation》**

arXiv：https://arxiv.org/abs/1812.10016

> This paper presents a novel framework for simultaneously implementing localization and segmentation, which are two of the most important vision-based tasks for robotics. While the goals and techniques used for them were considered to be different previously, we show that by making use of the intermediate results of the two modules, their performance can be enhanced at the same time. Our framework is able to handle both the instantaneous motion and long-term changes of instances in localization with the help of the segmentation result, which also benefits from the refined 3D pose information. We conduct experiments on various datasets, and prove that our framework works effectively on improving the precision and robustness of the two tasks and outperforms existing localization and segmentation algorithms.

# Salient Object Detection

**《Selectivity or Invariance: Boundary-aware Salient Object Detection》**

arXiv：https://arxiv.org/abs/1812.10066

> Typically, a salient object detection (SOD) model faces opposite requirements in processing object interiors and boundaries. The features of interiors should be invariant to strong appearance change so as to pop-out the salient object as a whole, while the features of boundaries should be selective to slight appearance change to distinguish salient objects and background. To address this selectivity-invariance dilemma, we propose a novel boundary-aware network with successive dilation for image-based SOD. In this network, the feature selectivity at boundaries is enhanced by incorporating a boundary localization stream, while the feature invariance at interiors is guaranteed with a complex interior perception stream. Moreover, a transition compensation stream is adopted to amend the probable failures in transitional regions between interiors and boundaries. In particular, an integrated successive dilation module is proposed to enhance the feature invariance at interiors and transitional regions. Extensive experiments on four datasets show that the proposed approach outperforms 11 state-of-the-art methods.

# Re-ID

**《3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification》**

IEEE Transactions on Neural Networks and Learning Systems

arXiv：https://arxiv.org/abs/1812.10222

> In this paper, we introduce a global video representation to video-based person re-identification (re-ID) that aggregates local 3D features across the entire video extent. Most of the existing methods rely on 2D convolutional networks (ConvNets) to extract frame-wise deep features which are pooled temporally to generate the video-level representations. However, 2D ConvNets lose temporal input information immediately after the convolution, and a separate temporal pooling is limited in capturing human motion in shorter sequences. To this end, we present a \textit{global} video representation (3D PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the appearance and motion dynamics in full-length videos. However, encoding each video frame in its entirety and computing an aggregate global representation across all frames is tremendously challenging due to occlusions and misalignments. To resolve this, our proposed network is further augmented with 3D part alignment module to learn local features through soft-attention module. These attended features are statistically aggregated to yield identity-discriminative representations. Our global 3D features are demonstrated to achieve state-of-the-art results on three benchmark datasets: MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011

**《Spatial and Temporal Mutual Promotion for Video-based Person Re-identification》**

AAAI 2019

arXiv：https://arxiv.org/abs/1812.10305

> Video-based person re-identification is a crucial task of matching video sequences of a person across multiple camera views. Generally, features directly extracted from a single frame suffer from occlusion, blur, illumination and posture changes. This leads to false activation or missing activation in some regions, which corrupts the appearance and motion representation. How to explore the abundant spatial-temporal information in video sequences is the key to solve this problem. To this end, we propose a Refining Recurrent Unit (RRU) that recovers the missing parts and suppresses noisy parts of the current frame's features by referring historical frames. With RRU, the quality of each frame's appearance representation is improved. Then we use the Spatial-Temporal clues Integration Module (STIM) to mine the spatial-temporal information from those upgraded features. Meanwhile, the multi-level training objective is used to enhance the capability of RRU and STIM. Through the cooperation of those modules, the spatial and temporal features mutually promote each other and the final spatial-temporal feature representation is more discriminative and robust. Extensive experiments are conducted on three challenging datasets, i.e., iLIDS-VID, PRID-2011 and MARS. The experimental results demonstrate that our approach outperforms existing state-of-the-art methods of video-based person re-identification on iLIDS-VID and MARS and achieves favorable results on PRID-2011.

**《Cluster Loss for Person Re-Identification》**

ICVGIP 2018

arXiv：https://arxiv.org/abs/1812.10325

> Person re-identification (ReID) is an important problem in computer vision, especially for video surveillance applications. The problem focuses on identifying people across different cameras or across different frames of the same camera. The main challenge lies in identifying the similarity of the same person against large appearance and structure variations, while differentiating between individuals. Recently, deep learning networks with triplet loss have become a common framework for person ReID. However, triplet loss focuses on obtaining correct orders on the training set. We demonstrate that it performs inferior in a clustering task. In this paper, we design a cluster loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better generalization ability and can achieve higher accuracy on the test set especially for a clustering task. We also introduce a batch hard training mechanism for improving the results and faster convergence of training.

**《EgoReID: Person re-identification in Egocentric Videos Acquired by Mobile Devices with First-Person Point-of-View》**

arXiv：https://arxiv.org/abs/1812.09570

> Widespread use of wearable cameras and recording devices such as cellphones have opened the door to a lot of interesting research in first-person Point-of-view (POV) videos (egocentric videos). In recent years, we have seen the performance of video-based person Re-Identification (ReID) methods improve considerably. However, with the influx of varying video domains, such as egocentric videos, it has become apparent that there are still many open challenges to be faced. These challenges are a result of factors such as poor video quality due to ego-motion, blurriness, severe changes in lighting conditions and perspective distortions. To facilitate the research towards conquering these challenges, this paper contributes a new, first-of-its-kind dataset called EgoReID. The dataset is captured using 3 mobile cellphones with non-overlapping field-of-view. It contains 900 IDs and around 10,200 tracks with a total of 176,000 detections. Moreover, for each video we also provide 12-sensor meta data. Directly applying current approaches to our dataset results in poor performance. Considering the unique nature of our dataset, we propose a new framework which takes advantage of both visual and sensor meta data to successfully perform Person ReID. In this paper, we propose to adopt human body region parsing to extract local features from different body regions and then employ 3D convolution to better encode temporal information of each sequence of body parts. In addition, we also employ sensor meta data to determine target's next camera and their estimated time of arrival, such that the search is only performed among tracks present in the predicted next camera around the estimated time. This considerably improves our ReID performance as it significantly reduces our search space.


# Super-resolution

**《3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.09079

> In video super-resolution, the spatio-temporal coherence between, and among the frames must be exploited appropriately for accurate prediction of the high resolution frames. Although 2D convolutional neural networks (CNNs) are powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal feature extraction as they can preserve temporal information. To this end, we propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that does not require motion alignment as preprocessing. Our 3DSRnet maintains the temporal depth of spatio-temporal feature maps to maximally capture the temporally nonlinear characteristics between low and high resolution frames, and adopts residual learning in conjunction with the sub-pixel outputs. It outperforms the most state-of-the-art method with average 0.45 and 0.36 dB higher in PSNR for scales 3 and 4, respectively, in the Vidset4 benchmark. Our 3DSRnet first deals with the performance drop due to scene change, which is important in practice but has not been previously considered.


# Image Denosing

**《A Multiscale Image Denoising Algorithm Based On Dilated Residual Convolution Network》**

arXiv：https://arxiv.org/abs/1812.09131

> Image denoising is a classical problem in low level computer vision. Model-based optimization methods and deep learning approaches have been the two main strategies for solving the problem. Model-based optimization methods are flexible for handling different inverse problems but are usually time-consuming. In contrast, deep learning methods have fast testing speed but the performance of these CNNs is still inferior. To address this issue, here we propose a novel deep residual learning model that combines the dilated residual convolution and multi-scale convolution groups. Due to the complex patterns and structures of inside an image, the multiscale convolution group is utilized to learn those patterns and enlarge the receptive field. Specifically, the residual connection and batch normalization are utilized to speed up the training process and maintain the denoising performance. In order to decrease the gridding artifacts, we integrate the hybrid dilated convolution design into our model. To this end, this paper aims to train a lightweight and effective denoiser based on multiscale convolution group. Experimental results have demonstrated that the enhanced denoiser can not only achieve promising denoising results, but also become a strong competitor in practical application.

# Zero-Shot Learning

**《Domain-Aware Generalized Zero-Shot Learning》**

arXiv：https://arxiv.org/abs/1812.09903

> Generalized zero-shot learning (GZSL) is the problem of learning a classifier where some classes have samples, and others are learned from side information, like semantic attributes or text description, in a zero-shot learning fashion (ZSL). A major challenge in GZSL is to learn consistently for those two different domains. Here we describe a probabilistic approach that breaks the model into three modular components, and then combines them in a consistent way. Specifically, our model consists of three classifiers: A "gating" model that softly decides if a sample is from a "seen" class and two experts: a ZSL expert, and an expert model for seen classes. We address two main difficulties in this approach: How to provide an accurate estimate of the gating probability without any training samples for unseen classes; and how to use an expert predictions when it observes samples outside of its domain. 
The key insight in our approach is to pass information between the three models to improve each others accuracy, while keeping the modular structure. We test our approach, Domain-Aware GZSL (DAZL) on three standard GZSL benchmark datasets (AWA, CUB, SUN), and find that it largely outperforms state-of-the-art GZSL models. DAZL is also the first model that closes the gap and surpasses the performance of generative models for GZSL, even-though it is a light-weight model that is much easier to train and tune.

# Few-Shot

**《Learning Compositional Representations for Few-Shot Recognition》**

arXiv：https://arxiv.org/abs/1812.09213

> One of the key limitations of modern deep learning based approaches lies in the amount of data required to train them. Humans, on the other hand, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning ability is the compositional structure of concept representations in the human brain - something that deep learning models are lacking. In this work we make a step towards bridging this gap between human and machine learning by introducing a simple regularization technique that allows the learned representation to be decomposable into parts. We evaluate the proposed approach on three datasets: CUB-200-2011, SUN397, and ImageNet, and demonstrate that our compositional representations require fewer examples to learn classifiers for novel categories, outperforming state-of-the-art few-shot learning approaches by a significant margin.

**《Similarity R-C3D for Few-shot Temporal Activity Detection》**

arXiv：https://arxiv.org/abs/1812.10000

> Many activities of interest are rare events, with only a few labeled examples available. Therefore models f

Download .txt

gitextract_lcikzkm9/

├── 2018/
│   ├── 05/
│   │   ├── 16.md
│   │   ├── 19.md
│   │   ├── 22.md
│   │   ├── 24.md
│   │   └── 29.md
│   ├── 06/
│   │   ├── 06.md
│   │   ├── 08.md
│   │   ├── 11.md
│   │   ├── 13.md
│   │   ├── 15.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   └── 29.md
│   ├── 07/
│   │   ├── 02.md
│   │   ├── 05.md
│   │   ├── 06.md
│   │   ├── 07.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   ├── 27.md
│   │   └── 31.md
│   ├── 08/
│   │   ├── 03.md
│   │   ├── 07.md
│   │   ├── 11.md
│   │   ├── 15.md
│   │   └── 25.md
│   ├── 10/
│   │   ├── 12.md
│   │   └── 17.md
│   ├── 11/
│   │   ├── 05-09.md
│   │   ├── 19.md
│   │   └── 20.md
│   ├── 12/
│   │   ├── 10.md
│   │   ├── 17-21.md
│   │   ├── 24-28.md
│   │   └── 31.md
│   └── cvpr2018-paper-list.csv
├── 2018-Paper.md
├── 2019/
│   ├── 01/
│   │   └── 01-04.md
│   └── 03/
│       └── 12.md
├── 2019-Paper.md
├── 2020-Paper.md
├── 2021-Paper.md
├── 2023-Paper.md
└── README.md

Download .json

Condensed preview — 44 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (610K chars).

[
  {
    "path": "2018/05/16.md",
    "chars": 2193,
    "preview": "**2018-05-16**\n\nSummary: 有4篇论文速递信息，涉及单目图像深度估计、6-DoF跟踪、图像合成和动作捕捉等方向（含1篇CVPR 2018论文和1篇ICRA 2018论文）。\n\n# Depth Estimation\n\n\n"
  },
  {
    "path": "2018/05/19.md",
    "chars": 2207,
    "preview": "**2018-05-19**\n\nSummary: 这篇文章有4篇论文速递信息，涉及人脸识别（综述）、人脸检测、3D 目标检测和姿态估计和目标检测等方向（含2篇CVPR 2018）。\n\n# Face\n\n[1]《Deep Face Recogn"
  },
  {
    "path": "2018/05/22.md",
    "chars": 1814,
    "preview": "**2018-05-22**\n\nSummary: 这篇文章有4篇论文速递信息，涉及图像分割、视频分割、目标追踪和异常检测等方向。\n\n#Image Segmentation\n\n**[1]《Deep Object Co-Segmentation"
  },
  {
    "path": "2018/05/24.md",
    "chars": 2719,
    "preview": "**2018-05-24**\n\nSummary: 这篇文章有5篇论文速递信息，涉及活体检测、SFM、视差估计、Zero-short Learning和3D shape等方向。\n\n# 活体检测\n\n**[1]《Liveness Detectio"
  },
  {
    "path": "2018/05/29.md",
    "chars": 2681,
    "preview": "**2018-05-29**\n\nSummary: 这篇文章有4篇论文速递信息，涉及图像分类、视频分类和语义分割等方向（含一篇ICLR 2018和一篇CVPR 2018）。\n\n# Image Classification\n\n**[1]《Iam"
  },
  {
    "path": "2018/06/06.md",
    "chars": 2617,
    "preview": "**2018-06-06**\n\nSummary: 这篇文章有4篇论文速递信息，涉及目标跟踪、GAN、Zero-Shot Learning、视频分类和行人重识别等方向（含一篇IJCAI 2018和一篇IROS 2018 submission "
  },
  {
    "path": "2018/06/08.md",
    "chars": 2093,
    "preview": "**2018-06-08**\n\nSummary: 这篇文章有4篇论文速递信息，涉及胶囊网络、迁移学习、优化CNN和手指检测等方向（含一篇NIPS 2017、一篇ICMR 2018和一篇 VCIP 2017 ）。\n\n# Capusle Net"
  },
  {
    "path": "2018/06/11.md",
    "chars": 2153,
    "preview": "**2018-06-11**\n\nSummary: 这篇文章有4篇论文速递信息，涉及CNN pruning、新的人脸识别数据集、森林树木分类和交通标志检测等方向。\n\n# CNN \n\n**《Accelerator-Aware Pruning f"
  },
  {
    "path": "2018/06/13.md",
    "chars": 2717,
    "preview": "**2018-06-13**\n\nSummary: 这篇文章有4篇论文速递信息，都是图像分割（image segmentation）方向，其实3篇是对U-Net网络进行了改进。\n\n# Image Segmentation\n\n**《dhSegm"
  },
  {
    "path": "2018/06/15.md",
    "chars": 2576,
    "preview": "**2018-06-15**\n\nSummary: 这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸检测和人脸表情识别。其中一篇是CVPR 2018。\n\n[TOC]\n\n# Face Recognition\n\n**《Scalable A"
  },
  {
    "path": "2018/06/19.md",
    "chars": 2165,
    "preview": "**2018-06-19**\n\nSummary: 这篇文章有4篇论文速递，都是目标检测方向，包括行人检测、车辆检测、指纹检测和目标跟踪等。\n\n# Object Detection\n\n**《Remote Detection of Idling"
  },
  {
    "path": "2018/06/23.md",
    "chars": 2784,
    "preview": "**2018-06-23**\n\n这篇文章有4篇论文速递，都是CVPR 2018论文，包括zero-shot learning、图像合成和图像转换等方向。\n\n# Zero-Shot Learning\n\n**《Sketch-a-Classifi"
  },
  {
    "path": "2018/06/29.md",
    "chars": 1912,
    "preview": "**2018-06-29**\n\n这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸表情识别、人脸情绪分类和人脸属性预测。其中一篇是CVPR 2018 workshop。\n\n**《Robust Face Recognition with"
  },
  {
    "path": "2018/07/02.md",
    "chars": 1083,
    "preview": "**2018-07-02**\n\n这篇文章有2篇论文速递，都是图像分割方向，包括运动捕捉图像的语义分割、将FCN和GAN结合的巩膜分割。其中一篇是ACM SIGGRAPH 2018，另一篇是BTAS 2018。\n\n**图像分割（Image S"
  },
  {
    "path": "2018/07/05.md",
    "chars": 2239,
    "preview": "**2018-07-05**\n\n这篇文章有4篇论文速递，都是GAN方向，包括根据文本生成图像和多域图像生成等方向。其中一篇是IJCAI 2018。\n\n# GAN\n\n**《Text to Image Synthesis Using Gener"
  },
  {
    "path": "2018/07/06.md",
    "chars": 1311,
    "preview": "**2018-07-06**\n\n这篇文章有2篇论文速递，都是目标检测方向，一篇是RefineNet，其是SSD算法、RPN网络和FPN算法的结合，另一篇是DES，其是基于SSD网络进行了改进。注意，两篇都是CVPR 2018文章。\n\n# O"
  },
  {
    "path": "2018/07/07.md",
    "chars": 3601,
    "preview": "**2018-07-07**\n\n这篇文章有 2篇论文速递，都是图像分割方向（CVPR 2018），一篇提出CCB-Cut损失，另一篇是对FCN网络进行了改进。注意，两篇都是CVPR 2018文章。\n\n# Image Segmentation"
  },
  {
    "path": "2018/07/19.md",
    "chars": 1315,
    "preview": "**2018-07-19**\n\n这篇文章有 2篇论文速递，都是ECCV 2018 paper，一篇关于语义分割方向，另一篇是关于深度预测方向。\n\n# Semantic Segmentation\n\n**《Effective Use of Sy"
  },
  {
    "path": "2018/07/23.md",
    "chars": 1335,
    "preview": "**2018-07-23**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出卷积块注意力模块，它可以无缝地集成到任何CNN架构中；另一篇是利用 GAN技术实现多视图3D重建。\n\n# CNN\n\n**《CBAM: Convo"
  },
  {
    "path": "2018/07/27.md",
    "chars": 1317,
    "preview": "**2018-07-27**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出对目标周围的视觉上下文建模，来实现目标检测数据集的增广；另一篇是提出一种综合贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每"
  },
  {
    "path": "2018/07/31.md",
    "chars": 1398,
    "preview": "**2018-07-31**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出semi-convolutional等创新点来改进Mask RCNN；另一篇是提出CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网"
  },
  {
    "path": "2018/08/03.md",
    "chars": 1220,
    "preview": "**2018-08-03**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于卷积神经网络（CNN）的密度估计方法来解决图像中人群计数的问题；另一篇是提出实时立体匹配的端到端深度架构StereoNet，实现了亚像素"
  },
  {
    "path": "2018/08/07.md",
    "chars": 1285,
    "preview": "**2018-08-07**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的网格自动编码的卷积神经网络，用于生成3D人脸；另一篇提出新的RFNet，实现看图说话（image caption）。\n\n# 3D Face\n"
  },
  {
    "path": "2018/08/11.md",
    "chars": 1500,
    "preview": "**2018-08-11**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于Disentangled Representations网络，实现图像到图像转换；另一篇提出新的SPG masks，可有效地生成高质量的"
  },
  {
    "path": "2018/08/15.md",
    "chars": 2830,
    "preview": "**2018-08-15**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成；另一篇提出利用FiLM来调节语言上基于图像的卷积网络计算，解决视推理问题。\n"
  },
  {
    "path": "2018/08/25.md",
    "chars": 3637,
    "preview": "**2018-08-21**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的弱监督和半监督框架可实现含无限数量标签的语义分割；另一篇提出使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来"
  },
  {
    "path": "2018/10/12.md",
    "chars": 3776,
    "preview": "**2018-10-12**\n\n这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出IoU-Net，用来学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU。 网络获得了定位置信度，通过保留精确的定位边界框"
  },
  {
    "path": "2018/10/17.md",
    "chars": 3950,
    "preview": "**2018-10-17**\n\n这篇文章介绍两篇 ECCV 2018关于语义分割（Semantic Segmentation）最新的 paper，一篇提出双边分割网络（Bilateral Segmentation Network，BiSeN"
  },
  {
    "path": "2018/11/05-09.md",
    "chars": 6026,
    "preview": "**2018-11-05~2018-11-09**\n\n这篇文章介绍43篇论文，涉及CNN、图像分类、数据增广、Face、图像分割、OCR、GAN、风格迁移、目标跟踪、数据集和姿态估计等方向。\n\n# **数据集**\n\n**《The Open "
  },
  {
    "path": "2018/11/19.md",
    "chars": 1806,
    "preview": "**2018-11-19**\n\n这篇文章介绍12篇论文，涉及CNN、Face、3D、OCR、GAN和目标检测等方向。\n\n# CNN\n\n**《GPipe: Efficient Training of Giant Neural Networks"
  },
  {
    "path": "2018/11/20.md",
    "chars": 63030,
    "preview": "**2018-11-20**\n\n这篇文章介绍46篇论文，涉及CNN、Face、图像分类、目标检测、图像分割、GAN、Re-ID、SLAM和迁移学习等方向。\n\n# CNN\n\n**《Deeper Interpretability of Deep"
  },
  {
    "path": "2018/12/10.md",
    "chars": 13266,
    "preview": "**【计算机视觉论文速递】2018-12-10**\n\n本文分享共计12篇论文，涉及图像分类、目标检测、图像分割、GAN和三维重建等方向。\n\n[TOC]\n\n# Image Classification\n\n**《Variational Sacc"
  },
  {
    "path": "2018/12/17-21.md",
    "chars": 11004,
    "preview": "【计算机视觉论文速递】2018-12-17~12-21\n\n- [ ] 2018-12-17\n- [ ] 2018-12-18\n- [ ] 2018-12-19\n- [ ] 2018-12-20\n- [x] 2018-12-21\n\n本文分享共"
  },
  {
    "path": "2018/12/24-28.md",
    "chars": 52229,
    "preview": "【计算机视觉论文速递】2018-12-24~12-28\n\n- [x] 2018-12-24\n- [ ] 2018-12-25\n- [ ] 2018-12-26\n- [x] 2018-12-27\n- [ ] 2018-12-28\n\n本文分享共"
  },
  {
    "path": "2018/12/31.md",
    "chars": 21874,
    "preview": "【计算机视觉论文速递】2018-12-31\n\n- [x] 2018-12-31\n\n本文分享共16篇论文，涉及CNN、语义分割、GAN、3D和显著性目标检测等方向。\n\n[TOC]\n\n# CNN\n\n# Face\n\n**《Deception De"
  },
  {
    "path": "2018/cvpr2018-paper-list.csv",
    "chars": 264761,
    "preview": "Paper ID,Type,Title,Author(s)\r5,Poster,Single-Shot Refinement Neural Network for Object Detection,\"Shifeng Zhang, CBSR, "
  },
  {
    "path": "2018-Paper.md",
    "chars": 3648,
    "preview": "[2018-12-31](2018/12/31.md): 16篇论文速递，涉及CNN、语义分割、GAN、3D和显著性目标检测等方向。\n\n[2018-12-24~12-28](2018/12/24-28.md): 涉及CNN、目标检测、目标跟"
  },
  {
    "path": "2019/01/01-04.md",
    "chars": 68032,
    "preview": "【计算机视觉论文速递】2019-01-01~01-04\n\n- [x] 2019-01-01\n- [ ] 2019-01-02\n- [x] 2019-01-03\n- [x] 2019-01-04\n\n本文分享共52篇论文，涉及人脸识别、图像分类"
  },
  {
    "path": "2019/03/12.md",
    "chars": 13762,
    "preview": "【计算机视觉论文速递】2019-03-12\n\n- [x] 2019-03-12\n\n本文分享共10篇论文（含5篇CVPR 2019），涉及目标检测、人脸检测和语义分割等方向。\n\n[TOC]\n\n# 目标检测\n\n**《ScratchDet: Ex"
  },
  {
    "path": "2019-Paper.md",
    "chars": 449,
    "preview": "[2019-03-12](2019/03/12.md)：本文分享共10篇论文（含5篇CVPR 2019），涉及目标检测和语义分割等。\n\n[2019-01-01~01-04](2019/01/01-04.md): 52篇论文速递，涉及人脸识别"
  },
  {
    "path": "2020-Paper.md",
    "chars": 859,
    "preview": "- 2020-12-10：[这三篇目标检测论文刚刚开源了！AutoAssign/可变形DETR/DeFCN](https://mp.weixin.qq.com/s/nAp6O1KDew7FSYSlcfIJOg)\n- 2020-09-24：["
  },
  {
    "path": "2021-Paper.md",
    "chars": 11225,
    "preview": "- 2021-02-22\n  - [更深、更轻量级的Transformer！Facebook提出：DeLighT](https://mp.weixin.qq.com/s/WzpCfog3iSqQLlra5CB4pA)\n  - [思谋科技春招"
  },
  {
    "path": "2023-Paper.md",
    "chars": 1345,
    "preview": "- 2023-07-07\n  - [面试被问到了：手撕Transformer。。。](https://mp.weixin.qq.com/s/4Yg3Vj2p0PFpfqJyjf3BMA)\n  - [首篇综述！Open Vocabulary学"
  },
  {
    "path": "README.md",
    "chars": 4572,
    "preview": "# daily-paper-computer-vision\n**记录每天整理的计算机视觉/深度学习/机器学习相关方向的论文**\n\n- [CV 优质论文速递](#PaperDaily)\n- [CV 顶会/顶刊（2017-2023）](#Top"
  }
]

About this extraction

This page contains the full source code of the amusi/daily-paper-computer-vision GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 44 files (586.2 KB), approximately 160.6k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo