Repository: amusi/daily-paper-computer-vision
Branch: master
Commit: b61562b6f6ab
Files: 44
Total size: 586.2 KB

Directory structure:
gitextract_lcikzkm9/

├── 2018/
│   ├── 05/
│   │   ├── 16.md
│   │   ├── 19.md
│   │   ├── 22.md
│   │   ├── 24.md
│   │   └── 29.md
│   ├── 06/
│   │   ├── 06.md
│   │   ├── 08.md
│   │   ├── 11.md
│   │   ├── 13.md
│   │   ├── 15.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   └── 29.md
│   ├── 07/
│   │   ├── 02.md
│   │   ├── 05.md
│   │   ├── 06.md
│   │   ├── 07.md
│   │   ├── 19.md
│   │   ├── 23.md
│   │   ├── 27.md
│   │   └── 31.md
│   ├── 08/
│   │   ├── 03.md
│   │   ├── 07.md
│   │   ├── 11.md
│   │   ├── 15.md
│   │   └── 25.md
│   ├── 10/
│   │   ├── 12.md
│   │   └── 17.md
│   ├── 11/
│   │   ├── 05-09.md
│   │   ├── 19.md
│   │   └── 20.md
│   ├── 12/
│   │   ├── 10.md
│   │   ├── 17-21.md
│   │   ├── 24-28.md
│   │   └── 31.md
│   └── cvpr2018-paper-list.csv
├── 2018-Paper.md
├── 2019/
│   ├── 01/
│   │   └── 01-04.md
│   └── 03/
│       └── 12.md
├── 2019-Paper.md
├── 2020-Paper.md
├── 2021-Paper.md
├── 2023-Paper.md
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: 2018/05/16.md
================================================
**2018-05-16**

Summary: 有4篇论文速递信息，涉及单目图像深度估计、6-DoF跟踪、图像合成和动作捕捉等方向（含1篇CVPR 2018论文和1篇ICRA 2018论文）。

# Depth Estimation


[1]《Dual CNN Models for Unsupervised Monocular Depth Estimation》

2018 arXiv

Abstract：立体视觉中的深度估计问题已经取得了很多进展。虽然通过利用监督深度学习的深度估计来观察到非常令人满意的表现。这种方法需要大量的标定好的真实数据（ground truth）以及深度图，这些图准备非常费时费力，并且很多时候在实际情况下不可用。因此，无监督深度估计是利用双目立体图像摆脱深度图ground truth的最新方法。在无监督深度计算中，通过基于极线几何约束（epipolar geometry constraints）以图像重构损失对CNN进行训练来生成视差图像。需要解决使用CNN的有效方法以及调查该问题的更好的损失（loss）。在本文中，提出了一种基于双重（dual）CNN的模型，用于无监督深度估计，每个视图具有6个损失（DNM6）和单个CNN，以生成相应的视差图。所提出的双CNN模型也通过利用交叉差异扩大了12个损失（DNM12）。所提出的DNM6和DNM12模型在KITTI驾驶和Cityscapes城市数据库上进行了试验，并与最近最先进的无监督深度估计结果进行了比较。
arXiv：https://arxiv.org/abs/1804.06324
github：https://github.com/ishmav16/Dual-CNN-Models-for-Unsupervised-Monocular-Depth-Estimation/tree/master/DNM6
注：无监督学习，哎呦喂！


# 6-DoF Tracking

[2]《Egocentric 6-DoF Tracking of Small Handheld Objects》

2018 arXiv

Abstract：虚拟和增强现实技术在过去几年中有了显著性增长。这种系统的关键部分是能够在3D空间中跟踪头戴式显示器和控制器的姿态。我们从自我中心相机（egocentric camera perspectives）的角度解决了手持式控制器高效的6-DoF跟踪问题。我们收集了HMD控制器数据集，该数据集由超过540,000个立体图像对组成，标记有手持控制器的完整6-DoF姿态 我们提出的SSD-AF-Stereo3D模型在3D关键点预测中实现33.5毫米的平均平均误差，并与控制器上的IMU传感器结合使用，以实现6-DoF跟踪。我们还介绍了基于模型的完整6-DoF跟踪方法的结果。 我们的所有型号都受到实时移动CPU inference的严格限制。
arXiv：https://arxiv.org/abs/1804.05870


# Image Synthesis

[3]《Geometry-aware Deep Network for Single-Image Novel View Synthesis》
CVPR 2018
Abstract：本文从单个图像解决了新颖视图合成的问题。特别是，我们针对的是具有丰富几何结构的真实场景，这是一个具有挑战性的任务，因为这些场景的外观变化很大，并且缺乏简单的3D模型来表示它们。现代的，基于学习的方法主要集中于外观来合成新颖的视图，因此倾向于产生与底层场景结构不一致的预测。相反，在本文中，我们建议利用场景的三维几何来合成一种新颖的视图。具体而言，我们通过固定数量的平面逼近真实世界的场景，并学习预测一组单应性（homographies）及其相应的区域蒙版/掩膜（masks），以将输入图像转换为新颖视图。为此，我们开发了一个新的区域感知型几何变换网络（region-aware geometric transform network），在一个通用框架中执行这些多任务。我们在户外KITTI和室内ScanNet数据集上的结果证明了我们网络在生成场景几何的高质量合成视图方面的有效性，从而超越了最先进的方法。
arXiv：https://arxiv.org/abs/1804.06008


# Motion Capture

[4]《Human Motion Capture Using a Drone》
ICRA 2018
Abstract：目前的动作捕捉（MoCap）系统通常需要标记和多个校准摄像头，这些摄像头只能在受限环境中使用。在这项工作中，我们介绍了一款基于无人机的3D人体模型系统。该系统只需要具有自主飞行无人机和板载RGB相机，并可用于各种室内和室外环境。重建算法被开发用于从无人机记录的视频恢复全身运动。 我们认为，除了跟踪移动主体的能力之外，飞行无人机还提供快速变化的视点，这对于运动重建是有益的。 我们使用我们新的DroCap数据集评估拟议系统的准确性，并使用消费无人机在野外证明其适用。
arXiv：https://arxiv.org/abs/1804.06112
注：脑洞好大的研究，很cool

================================================
FILE: 2018/05/19.md
================================================
**2018-05-19**

Summary: 这篇文章有4篇论文速递信息，涉及人脸识别（综述）、人脸检测、3D 目标检测和姿态估计和目标检测等方向（含2篇CVPR 2018）。

# Face

[1]《Deep Face Recognition: A Survey》
Abstract：在图形处理单元（GPU）、大量待标注数据和更高级算法的驱动下，深度学习使得计算机视觉领域受到了极大的冲击，并且使包括人脸识别（FR）在内的实际应用受益匪浅。Deep FR 方法利用深层网络学习更多的不同（discriminative）表征，显著地改善现有技术并超越人类表现（97.53％）。在本文中，我们提供深度FR方法的全面调查，包括数据，算法和场景。首先，我们总结了常用的训练和测试数据集。然后，数据预处理方法分为两类：“一对多增强”和“多对一标准化”。其次，对于算法，我们总结了现有技术方法中使用的不同网络架构和损失函数。第三，我们回顾了深度FR中的几个场景，比如视频FR，3D FR和不同年龄段（Cross-Age） FR。最后，强调了当前方法的一些潜在缺陷和几个未来方向。
arXiv：https://arxiv.org/abs/1804.06655
注：综述性文章，实属好评！

[2]《SFace: An Efficient Network for Face Detection in Large Scale Variations》
Abstract：人脸检测是许多应用程序（如人脸识别）的基础研究主题。特别是最近卷积神经网络的发展取得了令人印象深刻的进展。然而，广泛存在于高分辨率图像/视频中的大范围变化的问题在文献中尚未得到很好的解决。在本文中，我们提出了一种名为SFace的新算法，它有效地集成了基于 Anchor 的方法和无 Anchor 方法来解决尺度（scale）问题。还引入了称为4K-Face的新数据集来评估具有极大尺度变化的人脸检测的性能。SFace架构在新的4K-Face基准测试中显示出可喜的成果。 此外，我们的方法可以以每秒50帧（fps）的速度运行，标准WIDER FACE数据集的准确率为80％AP，其速度比现有算法高出近一个数量级，同时达到了比较性能。
arXiv：https://arxiv.org/abs/1804.06559


# 3D Object Detection and Pose Estimation

[3]《Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation》
CVPR 2018 Workshop on Real World Challenges and New Benchmarks for Deep Learning in Robotic Vision
Abstarct：本文提出了一个名为Falling Things（FAT）的新数据集，用于推进机器人技术环境下的物体检测（Object Detectiion）和3D姿态估计的最新技术。通过对复杂构图和高图形质量的对象模型和背景进行综合组合，我们能够为所有图像中的所有对象生成具有精确三维姿态标注的照片真实感图像。我们的数据集包含来自YCB数据集的21个家庭对象的60k注释照片。对于每个图像，我们为所有对象提供3D姿势，每像素类分割以及2D / 3D边界框坐标。为了便于测试不同的输入模式，我们提供单目和立体双目 RGB图像以及配准（registered）的密集深度图像。 我们详细描述了数据的生成过程和统计分析。
arXiv：https://arxiv.org/abs/1804.06534
datasets：http://research.nvidia.com/publication/2018-06_Falling-Things


# Object Detection

[4]《Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization》
CVPR 2018 Workshop on Autonomous Driving
Abstract：我们提出了一种用于训练用于使用合成图像进行物体检测的深度神经网络的系统。为了解决真实世界数据的变化问题，系统依赖于域随机化技术（domain randomization），其中模拟器（simulator）的参数（例如照明，姿态，物体纹理等）以非现实的方式随机化，迫使神经网络学习感兴趣对象的基本特征。我们探索这些参数的重要性，表明可以仅使用非艺术性生成的合成数据生成具有引人注目的性能的网络。通过对实际数据进行额外的微调，网络比单独使用真实数据的性能更好。这个结果为使用低成本的合成数据训练神经网络提供了可能性，同时避免了收集大量手工标注的真实世界数据或生成高保真度合成世界（high-fidelity synthetic worlds）的需求 - 这两者都是许多应用的瓶颈。该方法在KITTI数据集上对汽车的边界框检测进行评估。
arXiv：https://arxiv.org/abs/1804.06516

================================================
FILE: 2018/05/22.md
================================================
**2018-05-22**

Summary: 这篇文章有4篇论文速递信息，涉及图像分割、视频分割、目标追踪和异常检测等方向。

#Image Segmentation

**[1]《Deep Object Co-Segmentation》**

Abstract：这项工作提出了一种深度对象共分割（DOCS）方法，用于分割一对图像中同一类的共同对象。这意味着该方法学习忽略常见或不常见的背景内容，并专注于对象。 如果在图像对中呈现多个对象类，则将它们共同提取为前景。为了解决这个任务，我们提出了一个基于CNN的连体编码器 - 解码器架构。编码器提取前景对象的高级语义特征，互相关层检测公共对象，最后，解码器为每个图像生成输出前景掩膜。为了训练我们的模型，我们编译了一个大对象协同分割数据集，该数据集由来自PASCAL VOC数据集的图像对与普通对象掩膜组成。我们评估了常用数据集的共分割任务方法，并观察到我们的方法对于看到和看不见的对象类，始终优于其它方法
arXiv：https://arxiv.org/abs/1804.06423

注：联合分割，很cool！


#Video Segmentation

**[2]《Superframes, A Temporal Video Segmentation》**

Abstract：视频分割的目标是将视频数据转换为一组可以轻松解释为视频构建块的具体运动集群。有一些类似主题的作品，比如检测视频中的场景剪辑，但很少有关于将视频数据聚类到所需数量的紧凑片段的具体研究。与从我们称之为超帧的低级分组过程获得的具有感知上有意义的实体一起工作会更直观，更高效。本文提出了一种新的简单而有效的技术来检测视频中类似内容模式的超帧。我们计算内容运动的相似度以获得连续帧之间的变化强度。在现有的使用深度模型的光流技术的帮助下，所提出的方法能够有效地执行更精确的运动估计。我们还提出了两个衡量和比较各种数据库上不同算法性能的标准。来自基准数据库的视频实验结果证明了该方法的有效性。


arXiv：https://arxiv.org/abs/1804.06642


#Object Tracking

**[3]《Unveiling the Power of Deep Tracking》**

Abstract：在通用目标跟踪领域，已经利用深度特征进行了许多尝试。尽管载有很多期望，但与仅基于手工特征（handcrafted feature）的方法相比，深度跟踪器仍未达到出色的性能水平。在本文中，我们调查了这个关键问题，并提出了解决深度特征追踪真实潜力的方法。我们系统地研究了深和浅特征的特征，以及它们与跟踪精度和鲁棒性的关系。我们将有限的数据和低空间分辨率确定为主要挑战，并提出策略以在集成深度特征进行跟踪时解决这些问题。此外，我们提出了一种新颖的自适应融合方法，利用深和浅特征的互补特性来提高鲁棒性和准确性。对四个具有挑战性的数据集进行了大量实验。在VOT2017上，我们的方法明显优于EAO中最高性能跟踪器，相对增益为17％。

arXiv：https://arxiv.org/abs/1804.06833

注：上述所说的深度特征，应该就是深度神经网络提取的特征


#Anomaly detection

**[4]《Temporal Unknown Incremental Clustering (TUIC) Model for Analysis of Traffic Surveillance Videos》**

Abstract：优化的场景表示是用于检测实况视频异常的框架的重要特征。检测实时视频异常的挑战之一是以非参数方式实时检测对象。另一个挑战是有效地表示跨帧的对象状态。在本文中，提出了一种基于Gibbs抽样的启发式模型，称为时间未知增量聚类（TUIC），用于将像素与运动聚类。首先使用光流检测像素运动，并且已经应用贝叶斯算法将后续帧中属于相似群集的像素相关联。该算法速度快并且在Θ（kn）时间产生准确的结果，其中k是簇的数量，n是像素的数量。我们使用公开可用的数据集进行的实验验证表明，所提出的框架具有很好的潜力来为实时交通分析开辟新的机会。

arXiv：https://arxiv.org/abs/1804.06680

注：用于交通监控视频分析

================================================
FILE: 2018/05/24.md
================================================
**2018-05-24**

Summary: 这篇文章有5篇论文速递信息，涉及活体检测、SFM、视差估计、Zero-short Learning和3D shape等方向。

# 活体检测

**[1]《Liveness Detection Using Implicit 3D Features》**

Abstract：欺骗攻击（Spoofing attacks）是对现代人脸识别系统的威胁。在这项工作中，我们提出了一种简单而有效的活体检测（Liveness Detection）方法来增强2D人脸识别方法，并使其能够抵御欺骗攻击。我们表明，通过使用额外的光源，例如闪光灯，可以减少欺骗攻击的风险。从在不同照明下拍摄的一对输入图像中，我们定义含有面部三维信息的判别特征。此外，我们表明，当考虑多个光源时，我们能够验证哪一个已被激活。这使得设计高度安全的主动光认证框架成为可能。最后，进一步研究如何在不使用3D重建的情况下使用3D特征，我们引入了一种近似的基于视差的隐式3D特征，该特征从未校准的立体对摄像头中获得。评估实验表明，所提出的方法在几乎没有特征提取延迟的情况下在具有挑战性的场景中产生了state-of-the-art的结果。

arXiv：https://arxiv.org/abs/1804.06702


# **从运动到结构(SFM)**

**[2]《Structure from Recurrent Motion: From Rigidity to Recurrency》**

**CVPR 2018**

Abstract：本文提出了一种新的非刚性运动结构（NRSfM）方法，该方法从长单目视频序列观察非刚性物体执行循环和可能重复的动态行为。从传统的使用线性低阶或低阶形状模型的NRSfM任务出发，我们的方法利用了形状反复的性质（即许多变形形状倾向于及时重复）。我们表明，反复发生（recurrency）实际上是一种广义的rigidity。基于此，我们将NRSfM问题简化为刚性问题，只要满足某些重复性条件。鉴于这种减少，标准的刚性SfM技术可直接应用于（不作任何改变）重构非刚性动态形状。为了实现这个想法作为一种实用的方法，本文开发了用于自动重复检测的高效算法，以及通过刚性检查进行摄像机视图聚类。对模拟序列和实际数据的实验证明了该方法的有效性。由于本文提供了一种反思运动结构的新视角，我们希望它能激发该领域的其他新问题。

arXiv：https://arxiv.org/abs/1804.06510

注：NRSfM，很棒的研究！


# 视差估计

**[3]《Variational Disparity Estimation Framework for Plenoptic Image》**

**ICME 2017**

Abstract：本文提出了一个精确估计全光（plenoptic）图像视差图的计算框架。所提出的框架基于变分原理（variational principle ）并提供 intrinsic sub-pixel precision。这个框架中引入的光场运动张量（light-field motion tensor）允许我们结合先进的强大数据项，并为不同颜色通道提供明确的处理。我们的框架中嵌入了扭曲策略（warping strategy）来解决大位移（displacememt）问题。我们还表明，通过应用简单的正则化项和导向中值滤波，可以大大提高遮挡区域位移场的精度。我们通过与Lytro软件和合成的（synthetic）、现实的世界数据集进行深入比较，展示了所提出的框架的出色表现。

arXiv：https://arxiv.org/abs/1804.06633

github：https://github.com/hieuttcse/variational_plenoptic_disparity_estimation


# Zero-short Learning

**[4]《Zero-shot Learning with Complementary Attributes》**

Abstract：Zero-shot Learning（ZSL）旨在通过属性使用不相交的可见对象识别未曾见过的对象，以将语义信息从训练数据传输到测试数据。ZSL的泛化性能受属性的控制，这些属性代表了所看到的类和没看到过的类之间的相关性。在本文中，我们提出了一种新的ZSL方法，使用互补属性（complementary attributes）作为原始属性的补充。我们首先用它们的补充形式来扩展属性，然后使用训练数据对原始属性和互补属性进行预分类。在对每个属性进行排序后，我们使用排名聚合框架来计算最高排序被指定为测试样本标签的测试类别中的优化排名。我们凭经验证明，互补属性对ZSL模型有一个有效的改进。实验结果表明，我们的方法优于标准ZSL数据集上的最新方法。

arXiv：https://arxiv.org/abs/1804.06505


# 3D Shape

**[5]《Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines》**

**ACM Transactions on Graphics 2018**

Abstract：我们提出了一个半监督的协同分析方法，用于从投影特征线学习3D形状风格（style），实现只有弱监督的 style patch定位。鉴于跨越多个对象类别和风格的3D形状集合，我们对每个3D形状的投影要素线执行样式协同分析，然后将学习风格要素反向投影到3D形状上。我们的核心分析流程始于中级 patch 抽样和预选候选风格 batch。然后通过拼接卷积编码投影特征。多视点特征集成和风格聚类是在部分共享潜在因子（PSLF）学习框架下进行的，这是一种多视点特征学习方案。PSLF通过从多个视图中提取一致和互补的特征信息，同时从候选中选择 style patches，实现有效的多视点特征融合。我们的风格分析方法支持无监督分析和半监督分析。对于后者，我们的方法接受用户指定的形状标签和风格排列的三元组（style-ranked triplets ）作为聚类约束条件。我们演示了3D形状样式分析和 patch定位的结果以及对最先进方法的改进。我们还通过我们的风格分析提供了几个应用程序。

arXiv：https://arxiv.org/abs/1804.06579

================================================
FILE: 2018/05/29.md
================================================
**2018-05-29**

Summary: 这篇文章有4篇论文速递信息，涉及图像分类、视频分类和语义分割等方向（含一篇ICLR 2018和一篇CVPR 2018）。

# Image Classification

**[1]《IamNN: Iterative and Adaptive Mobile Neural Network for Efficient Image Classification》**

ICLR 2018 Workshop track

Abstract：深度残差网络（ResNets）近期在深度学习方面取得突破。ResNets的核心思想是在图层之间建立 shortcut，使网络更加深入，同时易于优化，避免梯度消失。这些 shotcut 连接具有有趣的副作用（side-effects），使ResNets的行为与其他典型网络架构不同。在这项工作中，我们使用这些属性来设计基于ResNet但具有参数共享和自适应计算时间的网络。 所得到的网络比原始网络小得多，并且可以使计算成本适应输入图像的复杂度。

arXiv：https://arxiv.org/abs/1804.10123


**[2]《Progressive Neural Networks for Image Classification》**

Abstract：现有的深度神经网络的推理/推断结构（inference structures）和计算复杂性一旦被训练，就被固定，并且对于所有测试图像保持相同。然而，实际上，为深度神经网络建立渐进式结构（ progressive structure）是非常需要的，其能够针对具有不同视觉识别复杂度的图像调整其推理过程和复杂度。在这项工作中，我们为深度神经网络开发了一个集成置信分析和决策策略学习的多阶段渐进结构。这个新的框架由一系列网络单元组成，以顺序方式激活，逐渐增加复杂性和视觉识别能力。我们在CIFAR-10和ImageNet数据集上的广泛实验结果表明，所提出的渐进深度神经网络能够获得10倍以上的复杂度可扩展性，同时使用满足不同复杂度的单一网络模型实现最先进的性能，精度要求。

arXiv：https://arxiv.org/abs/1804.09803

注：很有意思的研究~


# Video Classification

**[3]《Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification》**

IJCAI 2018

Abstract：视频表示学习（Video representation learning）是分类任务的一个重要问题。最近，出现了一种被称为自监督学习的无监督范式（unsupervised paradigm），它通过解决辅助任务探索了海量数据中蕴含的固有监督信号，用于特征学习。然而，当扩展到视频分类时，这方面的现有方法受到两个限制。首先，他们只关注单个任务，而忽视不同任务特定功能之间的互补性，从而导致视频表现不理想。其次，高计算和内存成本阻碍了它们在现实世界中的应用。在本文中，我们提出了一个基于图的 distillation 框架来解决这些问题：（1）我们提出了logits图和表示图来传递来自多个自我监督任务的知识，前者通过解决多个自我监督任务来提取分类器级知识，分配联合匹配问题，后者从成对集合表示中提取内部特征知识，应对不同特征之间异质性的挑战; （2）采用 teacher-student 框架的建议可以显著地减少 teachers从教学中学到的冗余，从而形成一个轻量级的 student模型，更有效地解决分类任务。在3个视频数据集上的实验结果验证了我们的提议不仅有助于学习更好的视频表示，还可以压缩模型以加快推断速度。

arXiv：https://arxiv.org/abs/1804.10069


# Semantic Segmentation

**[4]《Fully Convolutional Adaptation Networks for Semantic Segmentation》**

CVPR 2018, Rank 1 in Segmentation Track of Visual Domain Adaptation Challenge 2017

Abstract：深度神经网络的最新进展令人信服地证明了在大型数据集上学习视觉模型的高能力。尽管如此，收集专家标记的数据集尤其是像素级注释（ pixel-level）是一个代价非常高的过程。一个吸引人的选择是呈现合成数据（例如电脑游戏）并自动生成 ground truth。然而，简单地应用在合成图像上学习的模型可能导致由于域偏移（domain shift）导致的真实图像上的高泛化误差。在本文中，我们从视觉外观水平和表示水平域适应（visual appearance-level and representation-level domain adaptation）的角度来解决这个问题。前者将源域图像调整为显示为从目标域中的“样式”中绘制，后者尝试学习域不变表示。具体来说，我们提出了完全卷积自适应网络（FCAN），这是一种结合了外观自适应网络（AAN）和表示自适应网络（RAN）的新型深度语义分割体系结构。AAN在像素空间中学习从一个域到另一个域的转换，并且RAN在对抗学习方式下被优化以最大程度地愚弄具有所学习的源和目标表示的域鉴别器。从GTA5（游戏视频）到城市风景（城市街道场景）的语义分割转换进行了大量实验，并且我们的建议与最先进的无监督自适应技术进行比较时取得了优异的结果。更为显著的是，我们获得了一项新纪录：在无人监督的环境下，BDDS（驾驶摄像头视频）的47.5％的mIoU。

arXiv：https://arxiv.org/abs/1804.08286

注：提出了Fully Convolutional Adaptation Networks (FCAN）网络（FCN的改进版），其结合了Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN).

================================================
FILE: 2018/06/06.md
================================================
**2018-06-06**

Summary: 这篇文章有4篇论文速递信息，涉及目标跟踪、GAN、Zero-Shot Learning、视频分类和行人重识别等方向（含一篇IJCAI 2018和一篇IROS 2018 submission ）。

# Object Tracking

**《Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline》**
**IROS 2018 submission** 
Abstract：在过去的十年中，很多机器人都被部署在户外，行人检测和跟踪是此类部署的重要组成部分。最重要的是，经常需要运行模块来分析人员并提取更高级别的属性，如年龄和性别，或者动态信息，如注视（gaze）和姿势。后者对于构建一个反应性的（reactive）社交机器人 - 人员交互尤为必要。
在本文中，我们将这些组件组合在完全模块化的检测跟踪分析流程（pipeline）中，称为DetTA。我们通过使用一致的跟踪ID来对分析模块的观测结果进行时间滤波（temporal filtering），在头部和骨骼姿态的例子中研究这种集成的好处，显示出在具有挑战性的真实世界情景中略有改善。我们还研究了所谓“自由飞行”（free-light）模式的潜力，其中人员属性的分析仅依赖于过滤器对特定帧的预测。在这里，我们的研究表明，这极大地提高了运行时间，而预测质量保持稳定。当在移动平台上运行许多分析组件时，特别是在代价很高的深度学习方法时代，这种见解对于降低功耗和共享宝贵的（GPU）显存尤其重要。
arXiv：https://arxiv.org/abs/1804.10134
github：https://github.com/sbreuers/detta


# GAN & Zero-Shot Learning & Video Classfication

**《Visual Data Synthesis via GAN for Zero-Shot Video Classification》**
**IJCAI 2018**
Abstract：视频分类中的Zero-Shot Learning（ZSL）是一个有前途的研究方向，旨在应对视频类别爆炸性增长带来的挑战。大多数现有的方法通过学习视觉和语义空间之间的投影（projection）来利用看不见的相关性。然而，这种基于投影的范式不能充分利用数据分布中隐含的区分信息，而且常常遭受由“异质性差距”（heterogeneity gap）引起的信息退化问题。在本文中，我们通过GAN提出了一个可视化数据合成框架来解决这些问题。具体而言，利用语义知识和视觉分布来合成未见类别的视频特征，并将ZSL转化为具有综合特征的典型监督问题。首先，我们提出了多层次的语义推理来促进视频特征合成，通过特征层次和标签层次语义推断来捕获联合视觉语义分布所蕴含的判别信息。其次，我们提出匹配感知互信息相关来克服信息降级问题，该问题通过互信息捕获匹配和不匹配视觉语义对中的看不见的相关性，为 zero-shot 合成过程提供鲁棒的引导信号。在四个视频数据集上的实验结果表明，我们的方法可以显着提高 zero-shot 视频分类性能。
arXiv：https://arxiv.org/abs/1804.10073
注：作为CV初学者表示，最近Zero-Shot Learning的曝光量很足啊！GAN的魅力依旧那么强！

# Generative Model

**《Generative Model for Heterogeneous Inference》**

**arXiv 2018**

Abstract：例如生成对抗网络（GAN）和变分自编码（VAE）等生成模型（GM）近年来蓬勃发展，并在生成新样本方面取得了高质量的结果。特别是在计算机视觉领域，GMs已经被用于图像修复（image inpainting），去噪（denoising）等领域，其可以被视为从观察像素到被破坏的像素的推断（inference）。然而，图像是分层结构的，与许多具有非分层特征的真实世界推断场景截然不同。这些推断方案包含异构随机变量和不规则的相互依赖。传统上它们是由贝叶斯网络（BN）建模的。然而，BN模型的学习和推理是NP-hard的，因此BN中的随机变量数量受到很大限制。在本文中，我们采用典型的GMs来实现多项式时间（polynomial time）的异构学习（heterogeneous learning ）和推理。我们还提出了一个扩展的自回归（EAR）模型和一个带有对抗损失的EARA模型，并给出了它们有效性的理论结果。对几个BN数据集的实验表明，与其他GM相比，我们提出的EAR模型在大多数情况下实现了最佳性能。除黑箱（black box）分析外，我们还对GMs的马尔可夫边界推理进行了一系列白盒（white box）分析实验，并给出了理论结果。
arXiv：https://arxiv.org/abs/1804.09858
注：很硬的论文啊！

# Re-ID

**《Domain Adaptation through Synthesis for Unsupervised Person Re-identification》**

**arXiv 2018**

Abstract：监控摄像机照明（illumination）的巨大差异使得行人重识别问题极具挑战性。目前的大规模重识别（re-identification）数据集具有大量的训练主题，但缺乏光照条件的多样性。因此，训练好的模型需要微调（fine-tuning）才能在看不见的照明条件下变得有效。为了缓解这个问题，我们引入了一个新的合成数据集，其中包含数百个照明条件。具体而言，我们使用100个虚拟人照亮多个HDR环境地图，这些地图可精确模拟真实的室内和室外照明。为了在看不见的照明条件下获得更好的准确性，我们提出了一种新颖的领域适应技术，它利用我们的合成数据并以完全无监督的方式进行微调。我们的方法比半监督和无监督的最先进的方法具有更高的准确性，并且与监督技术非常具有竞争力。
arXiv：https://arxiv.org/abs/1804.10094

================================================
FILE: 2018/06/08.md
================================================
**2018-06-08**

Summary: 这篇文章有4篇论文速递信息，涉及胶囊网络、迁移学习、优化CNN和手指检测等方向（含一篇NIPS 2017、一篇ICMR 2018和一篇 VCIP 2017 ）。

# Capusle Networks & Transfer Learning

**《Capsule networks for low-data transfer learning》**

Abstract：我们提出了一个基于胶囊网络（capsule network）的框架，用于通过少数例子将学习推广到新数据。使用生成（generative）和非生成胶囊网络与中间路由（intermediate routing），我们能够生成比相似卷积神经网络快25倍的新信息。我们在缺少一位数字的multiMNIST数据集上训练网络。在网络达到其最大精度后，我们将1-100个缺失数字的样本放入训练集，并测量返回到可比较的准确度所需的批次数。然后我们讨论胶囊网络带来的低数据传输学习的改进，并为胶囊网络研究提出未来的发展方向。

arXiv：https://arxiv.org/abs/1804.10172

注：最近感觉Capsule Network不是很火了啊~

# CNN

**《Competitive Learning Enriches Learning Representation and Accelerates the Fine-tuning of CNNs》**

**NIPS 2017**

Abstract：在这项研究中，我们提出将竞争性学习整合到卷积神经网络（CNN）中以改善表示学习和微调（fine-tuning）效率。传统的CNN使用反向传播学习，它可以通过区分任务实现强大的表示学习。但是，它需要大量标记数据，并且标记数据的获取比未标记数据的难得多。因此，有效使用未标记的数据对于DNN越来越重要。为了解决这个问题，我们将无监督的竞争学习引入卷积层，并利用未标记的数据进行有效的表示学习。使用玩具（toy）模型的验证实验的结果表明，强表示（strong representation ）学习使用未标记的数据有效地将图像的基础提取到卷积滤波器中，并且加快了后续监督的反向传播学习的微调的速度。当滤波器数量足够大时，杠杆作用更明显，并且在这种情况下，在微调的初始阶段误差率急剧下降。因此，所提出的方法扩大了CNN中的滤波器的数量，并且使得更加详细和通用的表示。它不仅可以提供一个深层广泛的神经网络的可能性。

arXiv：https://arxiv.org/abs/1804.09859

# Visual Estimation

**《Visual Estimation of Building Condition with Patch-level ConvNets》**

**ICMR 2018**

Abstract：建筑物的状况（condition）是房地产估价的重要因素。目前，房地产估价师对房地产估价具有一定的主观性。我们提出了一种新颖的基于视觉的方法，用于从建筑物的外部视图评估建筑物状况。为此，我们开发了一种多尺度基于patch模式的提取方法，并将其与卷积神经网络相结合，从视觉线索估计建筑物状况。我们的评估显示，视觉估计的建筑条件可以作为评估师对状况估计的proxy。

arXiv：https://arxiv.org/abs/1804.10113

注：计算机视觉和房地产估价有个约会！脑洞真大啊！

# Finger Detection

**《Two-Stream Binocular Network: Accurate Near Field Finger Detection Based On Binocular Images》**

VCIP 2017

Abstract：指尖检测（Fingertip ）在人机交互中起着重要作用。先前的工作是将双目（binocular）图像转换为深度图像。 然后使用基于深度的手姿势估计方法来预测指尖的三维位置。与以前的工作不同，我们提出了一个新的框架，名为双流双目网络（TSBnet），直接从双目图像中检测指尖。TSBnet首先共享左右图像低级特征的卷积图层。然后分别提取双流卷积网络中的高层特征。此外，我们添加了一个新层：双目距离测量层，以提高我们模型的性能。为了验证我们的方案，我们构建了一个双目手图像数据集，包含训练集中的约117k对图像和测试集中的10k对图像。 我们的方法在我们的测试装置上实现了10.9mm的平均误差，比以前的工作性能要好5.9mm（相对35.1％）。

arXiv：https://arxiv.org/abs/1804.10160

IEEE：https://ieeexplore.ieee.org/abstract/document/8305146/

Datasets：https://sites.google.com/view/thuhand17


================================================
FILE: 2018/06/11.md
================================================
**2018-06-11**

Summary: 这篇文章有4篇论文速递信息，涉及CNN pruning、新的人脸识别数据集、森林树木分类和交通标志检测等方向。

# CNN 

**《Accelerator-Aware Pruning for Convolutional Neural Networks》**

submitted to IEEE Transactions on Circuits and Systems for Video Technology

Abstract：卷积神经网络在计算机视觉任务中表现出巨大的性能，但是它们过多的权重和运算阻止了它们在嵌入式环境中被采用。其中一个解决方案涉及修剪（pruning），其中一些不重要的权重被迫为零。已经提出了许多修剪方案，但主要集中在修剪权重的数目上。之前的修剪方案几乎不考虑ASIC或FPGA加速器体系结构。当修剪后的网络运行在加速器上时，缺乏体系结构考虑会导致一些低效率问题，包括内部缓冲区失调和负载不平衡。本文提出了反映加速器体系结构的新修剪方案。在所提出的方案中，执行修剪使得对于与同时取得的激活相对应的每个权重组保留相同数量的权重。通过这种方式，修剪方案解决了无效率问题。即使受到约束，所提出的修剪方案也达到了与先前的无约束修剪方案类似的修剪比例，不仅在AlexNet和VGG16中，而且在像ResNet这样的最先进的非常深的网络中。此外，所提出的方案在已经修剪通道的细长网络中展现出可比的修剪比率。除了提高以前稀疏加速器的效率外，还将显示所提出的修剪方案可用于减少稀疏加速器的逻辑复杂度。

arXiv：https://arxiv.org/abs/1804.09862

# Face

**《Pushing the Limits of Unconstrained Face Detection: a Challenge Dataset and Baseline Results》**

Abstract：人脸识别在过去几年中取得了巨大的进步，每年都有新的里程碑被超越。虽然诸如尺度（scale），姿态（pose），外观上的巨大变化等许多挑战已被成功解决，但仍存在若干问题，这些问题未被现有方法或数据集专门捕获。在这项工作中，我们确定需要研究界关注的下一组挑战，并收集涉及这些问题的新图像数据集，例如基于差的天气，运动模糊，焦点模糊等。我们证明，在最先进的探测器和真实世界需求的性能方面存在相当大的差距。因此，为了进一步加强对无约束人脸检测的研究，我们提出了一种新的带注释的无约束人脸检测数据集（UFDD），其中有几个挑战和基准最近的方法。此外，我们对这些方法的结果和失败案例进行了深入分析。数据集以及baseline 结果将在适当的时候公布。

arXiv：https://arxiv.org/abs/1804.10275

# Image Classification

**《Automatic classification of trees using a UAV onboard camera and deep learning》**

Abstract：使用遥感数据自动分类树木一直是许多科学家和土地使用管理者的梦想。最近，无人驾驶飞行器（UAV）一直被认为是遥感森林的一种易于使用且具有成本效益的工具，深度学习因其在机器视觉方面的能力而备受关注。在这项研究中，我们使用商业的无人机和公开数据进行深度学习，我们构建了用于树木自动分类的机器视觉系统。在我们的方法中，我们将森林的无人机摄影图像分割成单独的树冠（tree crowns）并进行基于对象的深度学习。结果，该系统能够以89.0％的准确度对7种树木类型进行分类。该性能值得注意，因为我们只使用标准无人机的基本RGB图像。相比之下，大多数以前的研究使用昂贵的硬件，如多光谱成像器来提高性能。这一结果意味着我们的方法有可能以具有成本效益的方式对单个树木进行分类。这可以成为许多森林研究人员和管理人员的有用工具。

arXiv：https://arxiv.org/abs/1804.10390

# Traffic Sign Detection

**《Localized Traffic Sign Detection with Multi-scale Deconvolution Networks》**

Abstract：通过深度学习进行有效的交通标志检测对自动驾驶起着至关重要的作用。 但是，不同的国家有不同的交通标志集合，使得本地化的交通标志识别模型训练成为一项繁琐而艰巨的任务。为解决计算复杂算法需要花费大量时间和检测局部交通标志的模糊和亚像素图像的比率低的问题，我们提出了多尺度卷积网络（Multi-Scale Deconvolution Networks，MDN），它将多尺度卷积神经网络 解卷积子网络，导致高效可靠的本地化交通标志识别模型的培训。与中国交通标志数据集（CTSD）和德国交通标志基准（GTSRB）等本地化交通标志基准的经典算法相比，所提出的MDN是有效的。

arXiv：https://arxiv.org/abs/1804.10428

================================================
FILE: 2018/06/13.md
================================================
**2018-06-13**

Summary: 这篇文章有4篇论文速递信息，都是图像分割（image segmentation）方向，其实3篇是对U-Net网络进行了改进。

# Image Segmentation

**《dhSegment: A generic deep-learning approach for document segmentation》**

Abstract：In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks. We show that a single CNN-architecture can be used across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach.

arXiv：https://arxiv.org/abs/1804.10371

注：为什么翻译，因为......原文很稳


**《Automatic Pixelwise Object Labeling for Aerial Imagery Using Stacked U-Nets》**

Abstract：航空影像中物体标记的自动化是一项具有许多实际应用的计算机视觉任务。像能源勘探这样的领域需要一种自动化方法来每天处理连续的图像流（stream）。在本文中，我们提出了一个 pipeline来解决这个问题，使用一堆（a stack of）端到端的卷积神经网络（U-Net架构）。每个网络都作为后一个处理器工作。我们的模型在两个不同的数据集上胜过当前的最新技术：Inria Aerial Image Labeling数据集和Massachusetts Buildings数据集，每个数据集都具有不同的特征，如空间分辨率，物体形状和尺度。此外，我们通过处理子采样图像并稍后向上采样按像素标记来实验验证计算时间节省。节省的这些资源对分割质量的影响可以忽略不计。虽然本文进行的实验仅涵盖航空影像，但所呈现的技术是通用的并且可以处理其他类型的影像。

arXiv：https://arxiv.org/abs/1803.04953


**《Stacked U-Nets: A No-Frills Approach to Natural Image Segmentation》**

Abstract：许多成像任务需要有关图像中所有像素的全局信息。传统的自下而上分类网络通过降低分辨率来globalize information; 特征被池化并下采样为单个输出。但是对于语义分割和对象检测任务，网络必须提供更高分辨率的像素级输出。为了在保持解决方案的同时globalize information，许多研究人员提出了包含复杂的辅助模块，但这些代价是网络规模和计算成本大幅增加的代价。本文提出堆叠式网络（SUNets，stacked u-nets ），它在保持分辨率的同时迭代地组合不同分辨率尺度的特征。SUNets在能够处理自然图像复杂性的深层网络架构中充分利用了U-net的information globalization信息全球化能力。 使用少量参数，SUNets在语义分割任务上表现出色。

arXiv：https://arxiv.org/abs/1804.10343

code：https://github.com/shahsohil/sunets

注：待重点研究


**《Stack-U-Net: Refinement Network for Image Segmentation on the Example of Optic Disc and Cup》**

Abstract：在这项工作中，我们提出了一个特殊的级联网络图像分割，它是基于U-Net网络作为构建模块和迭代改进的思想。该模型主要用于获得更高的识别质量，用于寻找 borders of the optic disc and cup。与单个U-Net和最新的方法相比，无需增加数据集的数量即可实现非常高的分割质量。我们的实验包括与公共数据库DRIONS-DB，RIM-ONE v.3，DRISHTI-GS上最著名的方法的比较，以及与加利福尼亚大学旧金山医学院合作收集的私人数据集的评估。提出了对体系结构细节的分析，并且认为该模型可以用于广泛范围的类似性质的图像分割问题。

arXiv：https://arxiv.org/abs/1804.11294

Amusi 总结：U-Net在图像分割领域（特别是医学领域）真的可以为所欲为啊！啊！啊！

================================================
FILE: 2018/06/15.md
================================================
**2018-06-15**

Summary: 这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸检测和人脸表情识别。其中一篇是CVPR 2018。

[TOC]

# Face Recognition

**《Scalable Angular Discriminative Deep Metric Learning for Face Recognition》**

Abstract：随着深度学习的发展，深度度量学习（DML）在人脸识别方面取得了很大的进步。具体而言，在训练过程中广泛使用的softmax损失通常会带来较大的类内（intra-class）变化，并且仅在测试过程中利用特征归一化（feature normalization）来计算这些配对相似性（pair similarities）。为弥补差距，we impose the intra-class cosine similarity between the features and weight vectors in softmax loss larger than a margin in the training step，并从四个方面扩展。首先，我们探索一个硬采样（hard sample）策略的效果。为缓解调整边缘超参数的人力劳动（human labor），提出了一种自适应边缘更新策略。然后，给出一个规范化版本以充分利用余弦相似性约束。此外，我们增强了前一个约束，迫使类内余弦相似度大于指数（exponential）特征投影空间中具有余量的平均类间余弦相似度。在Labeled Face in the Wild（LFW），Youtube人脸（YTF）和IARPA Janus Benchmark A（IJB-A）数据集上的大量实验表明，所提出的方法优于主流DML方法并接近最先进的性能。

arXiv：https://arxiv.org/abs/1804.10899

注：感觉这篇论文很硬很硬啊！

# Facial Expression Recognition

**《Unsupervised Features for Facial Expression Intensity Estimation over Time》**

CVPR 2018

Abstract：脸部形状和人物运动的多样性是面部表情自动分析的最大挑战之一。在本文中，我们提出描述表达强度（expression intensity）随时间变化的特征（feature），同时对人和所表达的类型不变。我们的功能是适应整体表达 trajectory的多点动态加权组合。我们在几个都与时间分析面部表情有关的任务上评估我们的方法。所提出的特征与用于表达强度估计的最先进的方法进行比较，其表现优于其。我们使用我们提出的特征来暂时对齐记录的3D面部表情的多个序列。此外，我们展示了我们的特征如何用于揭示面部表情中人的特定差异。此外，我们应用我们的特征来识别基于动作单元标签的脸部视频序列中的局部变化。对于所有的实验，我们的特征证明对噪声和异常值具有很强的鲁棒性，使其适用于各种面部运动分析应用。

arXiv：https://arxiv.org/abs/1805.00780

注：哇，这个feature很棒棒哦！

**《Local Learning with Deep and Handcrafted Features for Facial Expression Recognition》**
Abstract：我们提出了一种方法，将卷积神经网络（CNN）学习的自动特征（automatic）与由视觉词袋（BOVW）模型计算的手工特征（handcrafted features）相结合，以获得面部表情识别中的最新结果。为了获得自动特征，我们试验了多种CNN体系结构，预先训练的模型和训练过程，例如，Dense-Sparse-Dense。融合这两种特征后，我们采用local 学习框架来预测每个测试图像的类别标签。local 学习框架基于三个步骤。首先，应用k最近邻模型来为输入测试图像选择最近的训练样本。其次，在所选择的训练样本上训练一对一支持向量机（SVM）分类器。最后，SVM分类器仅用于为其训练的测试图像预测类标签。尽管之前已经将local 学习与手工特征结合使用，但据我们所知，它从未与深层特征结合使用。 2013年面部表情识别（FER）挑战数据集和FER +数据集的实验表明我们的方法达到了最新的结果。 2013年FER数据集的最高准确率为75.42％，FER +数据集的最高准确率为86.71％，两组数据均超过所有竞争对手近2％。

arXiv：https://arxiv.org/abs/1804.10892

# Face Detection

**《Precise Box Score: Extract More Information from Datasets to Improve the Performance of Face Detection》**

Abstract：对于基于R-CNN框架的人脸检测网络的训练，如果与 ground-truth相交的 IoUs高于第一阈值（例如0.7），则将 anchor定分配为正样本；并且如果它们的IoU低于第二阈值（例如0.3）则为负样本。根据上述标签训练人脸检测模型。但是，本文不使用IoU在第一阈值和第二阈值之间的anchor。我们提出了一种新的训练策略，Precise Box Score(PBS)，来训练目标检测模型。所提出的训练策略使用具有介于第一和第二阈值之间的IoU的anchor，其可以一致地提高人脸检测的性能。我们提出的训练策略从数据集中提取更多信息，更好地利用现有数据集。此外，我们还介绍了一种简单而有效的模型压缩方法（SEMCM），它可以进一步提高面部检测器的性能。实验结果表明，基于我们提出的方案，人脸检测网络的性能可以持续提高。

arXiv：https://arxiv.org/abs/1804.10743

注：厉害了，不知道将Precise Box Score 应用到通用型目标检测上，效果会怎样？

================================================
FILE: 2018/06/19.md
================================================
**2018-06-19**

Summary: 这篇文章有4篇论文速递，都是目标检测方向，包括行人检测、车辆检测、指纹检测和目标跟踪等。

# Object Detection

**《Remote Detection of Idling Cars Using Infrared Imaging and Deep Networks》**

Abstract：怠速车辆（Idling vehicles）通过废气排放浪费能源并污染环境。在一些国家，禁止将车辆空转超过预定的时间，并且执法机构需要自动检测怠速车辆。我们提出第一个使用红外（IR）成像和深度网络来检测空转车的自动系统。

我们依靠怠速和停车时空热特征的差异，并使用长波红外摄像机监测车内温度。我们将怠速车检测问题制定为IR图像序列中的时空事件检测，并采用深度网络进行时空建模。我们收集了第一个IR图像序列数据集，用于怠速汽车检测。首先，我们使用卷积神经网络在每个红外图像中检测汽车，该网络在规则的RGB图像上进行预先训练，并在IR图像上进行微调以获得更高的准确性。然后，我们跟踪检测到的汽车随着时间的推移，以识别停放的汽车。最后，我们使用每辆停放汽车的3D时空红外图像体积作为卷积和循环网络的输入，以将它们分类为空闲或不空闲。我们对各种卷积和循环体系结构的时间和时空建模方法进行了广泛的经验性评估。我们在我们的IR图像序列数据集上呈现出有前景的实验结果。

arXiv：https://arxiv.org/abs/1804.10805

注：怠速车辆（Idling vehicles）简单理解就是启动的车辆在原地不动的状态，感觉像是空转。

**《MV-YOLO: Motion Vector-aided Tracking by Semantic Object Detection》**

Abstract：目标跟踪是许多可视化分析系统的基石。近年来，虽然在这方面取得了相当大的进展，但在实际视频中进行稳健，高效和准确的跟踪仍然是一项挑战。在本文中，我们提出了一种混合跟踪器，利用压缩视频流中的运动信息和作用于解码帧的通用语义对象检测器，构建适用于多种可视化分析应用的快速高效的跟踪引擎。所提出的方法与OTB跟踪数据集上的几个常见的跟踪器进行了比较。结果表明所提出的方法在速度和准确性方面的优点。所提出的方法相对于大多数现有跟踪器的另一个优点是其简单性和部署效率，这归因于其重用并重新利用系统中可能已存在的资源和信息，这是由于其他原因。

arXiv：https://arxiv.org/abs/1805.00107

**《Altered Fingerprints: Detection and Localization》**

Abstract：Fingerprint alteration（也称为模糊呈现攻击）是有意篡改或破坏真实的 friction ridge patterns以避免AFIS识别。本文提出了一种检测和定位指纹变化的方法。我们的主要贡献是：（i）设计和训练指纹图像上的CNN模型和图像中以细节点为中心的局部斑块，以检测和定位指纹变化区域，以及（ii）训练生成对抗网络（GAN）合成变化的指纹其特征与真实改变的指纹相似。成功训练的GAN可以缓解研究中改变指纹图像的有限可用性。来自270个科目的4,815个改变指纹的数据库和相同数量的滚动指纹图像用于训练和测试我们的模型。所提出的方法在错误检测率（FDR）为2％时实现99.24％的真实检测率（TDR），优于公布的结果。改变后的指纹检测和定位模型和代码以及合成生成的改变后的指纹数据集将是开源的。

arXiv：https://arxiv.org/abs/1805.00911

**《Real-Time Human Detection as an Edge Service Enabled by a Lightweight CNN》**

IEEE EDGE 2018

Abstract：边缘计算（Edge computing）允许更多计算任务在网络边缘的分布式节点上发生。今天，许多对延迟敏感的任务关键型应用程序可以利用这些边缘设备来缩短时间延迟，甚至可以通过现场存在实现实时的在线决策。智能监控中的人体检测，行为识别和预测属于这一类别，在这种情况下，大量视频流数据的转换会花费宝贵的时间，并给通信网络带来沉重的压力。人们普遍认为，视频处理和目标检测是计算密集型且太昂贵而无法由资源有限的边缘设备来处理。受 depthwise separable 卷积和S ingle Shot Multi-Box Detector (SSD)的启发，本文介绍了一种轻量级卷积神经网络（LCNN）。通过缩小分类器的搜索空间以专注于监控视频帧中的人体对象，所提出的LCNN算法能够以对于边缘设备的负担得起的计算工作量来检测行人。原型已经在使用OpenCV库的边缘节点（Raspberry PI 3）上实现，使用真实世界的监控视频流可以获得令人满意的性能。实验研究验证了LCNN的设计，并表明它是在边缘计算密集型应用的有前景的方法。

arXiv：https://arxiv.org/abs/1805.00330

================================================
FILE: 2018/06/23.md
================================================
**2018-06-23**

这篇文章有4篇论文速递，都是CVPR 2018论文，包括zero-shot learning、图像合成和图像转换等方向。

# Zero-Shot Learning

**《Sketch-a-Classifier: Sketch-based Photo Classifier Generation》**

**CVPR 2018 Spotlight**

Abstract：当代深度学习技术已经使图像识别成为合理可靠的技术。然而，训练有效的照片分类器通常需要大量的样本，这些样本限制了图像识别的可扩展性和适用于图像可能不可用的情况。这激发了zero-shot learning，通过从文本等其他形式的知识迁移来解决问题。在本文中，我们研究了一种合成图像分类器的替代方法：几乎直接从用户的想象中，通过自由手绘草图。This approach doesn't require the category to be nameable or describable via attributes as per zero-shot learning。我们通过训练{模型回归}网络来实现这一点，从{手绘草图}空间映射到照片分类器的空间。事实证明，这种映射可以以与类别无关的方式学习，允许用户合成用于新类别的照片分类器，而不需要注释的训练照片。 {我们还证明，这种分类器生成的方式也可以用来增强现有照片分类器的粒度（granularity ），或者作为name-based 的 zero-short learning的补充。

arXiv：https://arxiv.org/abs/1804.11182


# Image Synthesis

**《Conditional Image-to-Image Translation》**

**CVPR 2018 Poster**

Abstract：生成对抗网络（GAN）和对偶学习（dual learning）已经广泛应用于图像到图像的转换任务。然而，现有模型缺乏控制目标域中的 translation结果的能力，并且它们的结果通常缺乏多样性（diversity），因为固定图像通常导致（几乎）确定translation 结果。在本文中，我们研究了一个新问题，即有条件的图像到图像转换（conditional image-to-image translation），即将图像从源域转换到目标域中给定图像上的目标域。它要求生成的图像应从目标域继承条件图像（conditional image）的某些特定于域的功能。因此，改变目标域中的条件图像将导致来自源域的固定输入图像的各种 translation结果，并且因此条件输入图像有助于控制 translation结果。我们用基于GAN和对偶学习的不成对（unpaired）数据解决了这个问题。我们将两个条件 translation 模型（一个从A域到B域，另一个从B域到A域）转换为输入组合和重构，同时保留域独立特征。我们对男性的脸部进行实验，从女性的脸部 translation 和边缘到鞋子和书包的 translation。结果证明了我们提出的方法的有效性。

arXiv：https://arxiv.org/abs/1805.00251


**《Semi-parametric Image Synthesis》**

**CVPR 2018 Oral**

Abstract：我们提出了一种半参数（semi-parametric）方法从语义布局进行照片图像合成。该方法结合了参数和非参数（parametric and nonparametric）技术的互补优势。非参数组件是由一组训练图像构成的图像片段的 memory bank。在测试阶段，给定一个新的语义布局，the memory bank is used to retrieve photographic references that are provided as source material to a deep network。该合成是通过利用提供的照相材料（photographic material）的深层网络进行的。在多个语义分割数据集上进行的实验表明，所提出的方法比最近的纯参数化技术产生更为真实的图像。

arXiv：https://arxiv.org/abs/1804.10992

github：https://github.com/xjqicuhk/SIMS

video：https://www.youtube.com/watch?v=U4Q98lenGLQ&feature=youtu.be


**《Learning to Sketch with Shortcut Cycle Consistency》**

**CVPR 2018 Poster**

Abstract：看到的是素描（sketch） - 自由手写素描自然地建立人与机器视觉之间的联系。在本文中，我们提出了一种将对象照片翻译为素描的新颖方法，模仿人类素描绘制过程。这是一项非常具有挑战性的任务，因为照片和素描域的差异很大。此外，即使在参考照片中描绘相同的对象实例时，素描也展现出不同程度的复杂性和抽象性。这意味着即使有照片素描对，他们也只能提供弱的监督信号来学习翻译模型。与现有的解决D（E（照片）） - >草图问题的有监督方法相比，其中E（⋅）和D（⋅）分别表示编码器和解码器，我们利用反问题（例如D （素描）） - >照片），并结合无监督的域内重建学习任务，所有这些都在多任务学习框架内完成。与基于循环一致性的现有无监督方法（即D（E（D（E（photo）））） - > photo）相比，我们引入了在编码器瓶颈处强制执行的快捷方式一致性（例如D（E（photo）） - >照片）利用额外的自我监督。定性和定量结果都表明，所提出的模型优于一些最先进的替代方案。我们还表明，合成素描可用于训练更好的细粒度素描图像检索（FG-SBIR）模型，有效缓解素描数据稀缺的问题。

arXiv：https://arxiv.org/abs/1805.00247


这里提一下经典论文
**《Image-to-Image Translation with Conditional Adversarial Networks》**

homepage：https://phillipi.github.io/pix2pix/

arXiv：https://arxiv.org/abs/1611.07004

github：https://github.com/phillipi/pix2pix


================================================
FILE: 2018/06/29.md
================================================
**2018-06-29**

这篇文章有4篇论文速递，都是人脸方向，包括人脸识别、人脸表情识别、人脸情绪分类和人脸属性预测。其中一篇是CVPR 2018 workshop。

**《Robust Face Recognition with Deeply Normalized Depth Images》**

Abstract：已经证明深度信息对于面部识别是有用的。然而，现有的基于深度图像的面部识别方法仍然受到噪声深度值和变化的姿势和表情的影响。在本文中，我们提出了一种新的方法，用于将面部深度图像归一化为正面姿势和中性表情（neutral expression），并从归一化的深度图像中提取鲁棒特征。该方法通过两个深度卷积神经网络（DCNN），归一化网络（NetN）和特征提取网络（NetF）来实现。给定面部深度图像，NetN首先将其转换为HHA图像，通过DCNN从该图像重建3D面部。 NetN然后从重构的3D脸部生成姿势 - 表达归一化（PEN）深度图像。 PEN深度图像最终传递给NetF，NetF通过另一个DCNN提取强大的特征表示以进行人脸识别。我们的初步评估结果证明了所提出的方法在识别具有深度图像的任意姿势和表情的面部方面的优越性。

arXiv：https://arxiv.org/abs/1805.00406


**《Which Facial Expressions Can Reveal Your Gender? A Study With 3D Faces》**

Abstract：人类在外表和行为方面都表现出丰富的性别暗示。在计算机视觉领域，已经广泛研究了面部外观的性别线索（cue），而基于面部行为的性别识别研究仍然很少。在这项工作中，我们首先证明面部表情会影响3D面部中呈现的性别模式，并且在同一表达式中训练和测试时性别识别性能会提高。此外，我们设计的实验直接提取面部表情形成的形态变化作为特征，用于基于表达的性别识别。实验结果表明，在快乐和厌恶表达中，性别可以相当准确地被识别，而惊喜和悲伤表达不会传达很多与性别相关的信息。这是文献中第一部用3D面部研究基于表达的性别分类的工作，揭示了不同类型表达中包含的性别模式的强度，即快乐，厌恶，惊喜和悲伤的表达。

arXiv：https://arxiv.org/abs/1805.00371


**《I Know How You Feel: Emotion Recognition with Facial Landmarks》**

CVPR WiCV workshop 2018

Abstract：对于许多计算机视觉算法而言，人类情感（human emotions）的分类仍然是一项重要且具有挑战性的任务，尤其是在人类机器人的日常生活中与人类共存的时代。当前提出的用于情绪识别的方法使用多层卷积网络来解决该任务，该网络没有明确地推断出分类阶段中的任何面部特征。在这项工作中，我们假设一种根本不同的方法来解决情绪识别任务，该方法依赖于将面部标志作为分类损失函数的一部分。为此，我们扩展了最近提出的深度对齐网络（Deep Alignment Network ，DAN），该网络在最近的面部关键点识别挑战中实现了最佳的结果，其中包含与面部特征相关的术语。 由于这个简单的修改，我们的名为EmotionalDAN的模型能够在两个具有挑战性的基准数据集上超过最先进的情感分类方法达5％。

arXiv：https://arxiv.org/abs/1805.00326


**《A Deep Face Identification Network Enhanced by Facial Attributes Prediction》**

Abstract：在本文中，我们提出了一个新的深层框架，可以预测面部属性并将其作为一种 soft modality来提高面部识别性能。我们的模型是一个端到端框架，它由卷积神经网络（CNN）组成，其输出分为两个独立的分支;第一个分支预测面部属性，而第二个分支标识面部图像。与现有的仅使用共享CNN特征空间共同训练这两个任务的多任务方法相反，我们将预测属性与脸部模态的特征相融合，以提高人脸识别性能。实验结果表明，该模型为人脸识别和人脸属性预测性能带来了好处，特别是在性别预测等身份人脸属性的情况下。我们在由身份和面部属性注释的两个标准数据集上测试了我们的模型。实验结果表明，该模型优于目前大多数现有的人脸识别和属性预测方法。

arXiv：https://arxiv.org/abs/1805.00324

================================================
FILE: 2018/07/02.md
================================================
**2018-07-02**

这篇文章有2篇论文速递，都是图像分割方向，包括运动捕捉图像的语义分割、将FCN和GAN结合的巩膜分割。其中一篇是ACM SIGGRAPH 2018，另一篇是BTAS 2018。

**图像分割（Image Segmentation）**

**《Dilated Temporal Fully-Convolutional Network for Semantic Segmentation of Motion Capture Data》**

ACM SIGGRAPH 2018

Abstract：运动捕捉序列的语义分割在许多数据驱动的运动合成框架中起着关键作用。 这是一个预处理步骤，其中运动捕捉序列的长记录被划分为较小的段。之后，可以将诸如统计建模的其他方法应用于每组结构相似的段以学习抽象运动流形。然而，分段任务通常仍然是手动任务，这增加了生成大规模运动数据库的工作量和成本。因此，我们提出了一种使用扩张的时间完全卷积网络的运动捕捉数据的语义分段的自动框架。我们的模型优于action segmentation中的最先进模型，以及用于序列建模的三个网络。 我们进一步显示我们的模型对高噪音训练标签是鲁棒的。

arXiv：https://arxiv.org/abs/1806.09174


**《Fully Connected Networks and Generative Neural Networks Applied to Sclera Segmentation》**

BTAS 2018

Abstract：由于世界对安全系统的需求，生物识别技术可被视为计算机视觉研究的重要课题。其中一种引起关注的生物识别形式是基于巩膜的识别。进行这种类型识别的最初和最重要的步骤是分割感兴趣的区域，即巩膜（sclera）。在此背景下，本文介绍了基于完全连接网络（FCN）和生成对抗网络（GAN）的两种方法。FCN类似于常见的卷积神经网络，然而全连接的层（即分类层）从网络的末端被移除并且通过组合来自不同卷积层的输出层来产生输出。GAN基于博弈论，我们有两个网络彼此竞争以产生最佳分割。为了与baselines 进行公平的比较以及对提出的方法进行定量和客观的评估，我们向科学界提供了来自两个数据库的新的1,300个手动分割图像。这些实验在UBIRIS.v2和MICHE数据库上进行，我们命题的最佳表现配置分别实现了F分数的87.48％和88.32％。

arXiv：https://arxiv.org/abs/1806.08722

================================================
FILE: 2018/07/05.md
================================================
**2018-07-05**

这篇文章有4篇论文速递，都是GAN方向，包括根据文本生成图像和多域图像生成等方向。其中一篇是IJCAI 2018。

# GAN

**《Text to Image Synthesis Using Generative Adversarial Networks》**

Abstract：从自然语言生成图像是最近条件生成模型的主要应用之一。除了测试我们对条件性，高维度分布进行建模的能力之外，文本到图像合成还具有许多令人兴奋和实际的应用，例如照片编辑或计算机辅助内容创建。使用生成对抗网络（GAN）已经取得了最新进展。本文首先对这些主题进行介绍，并讨论了现有技术模型的现状。此外，本文提出了Wasserstein GAN-CLS，这是一种基于Wasserstein距离的条件图像生成的新模型，可以保证稳定性。然后，展示了Wasserstein GAN-CLS的新型损失函数如何用于条件渐进式生长（Conditional Progressive Growing）GAN。与建议的损失相结合，该模型将仅使用句子级视觉语义的模型的最佳初始得分（在加州理工学院数据集上）提高了7.07％。唯一比有条件的Wasserstein渐进式增长GAN表现更好的模型是最近提出的使用词级视觉语义（word-level visual semantics）的AttnGAN。

arXiv：https://arxiv.org/abs/1805.00676

注：超级重磅文章！整整72页！


**《Transferring GANs: generating images from limited data》**

Abstract：通过微调将预训练网络的知识传递到新域是基于判别模型的应用的广泛使用的实践。据我们所知，这种做法尚未在生成性深层网络的背景下（the context of generative deep networks）进行研究。因此，我们研究应用于生成对抗网络的图像生成的域自适应（domain adaptation）。我们评估域适应的几个方面，包括目标域大小的影响，源域和目标域之间的相对距离，以及条件GAN的初始化。我们的结果表明，使用来自预训练网络的知识可以缩短收敛时间并且可以显著提高所生成图像的质量，尤其是当目标数据有限时。我们表明，即使在没有条件训练的情况下训练预训练模型，也可以为条件GAN绘制这些结论。我们的结果还表明，密度（density）可能比多样性更重要，具有一个或几个密集采样类的数据集可能比更多不同的数据集（如ImageNet或Places）更好的源模型。

arXiv：https://arxiv.org/abs/1805.01677


**《MEGAN: Mixture of Experts of Generative Adversarial Networks for Multimodal Image Generation》**

IJCAI 2018

Abstract：最近，生成对抗网络（GAN）在生成逼真图像方面表现出了很好的表现。然而，他们经常难以在给定数据集中学习复杂的基础模态（underlying modalities），导致生成质量差的图像。为了解决这个问题，我们提出了一种称为mixture of experts GAN（MEGAN）的新方法，这是一种多生成网络的集合方法。MEGAN中的每个生成网络专门用于生成具有特定模态子集的图像，例如图像类。我们提出的模型不是采用多个模态的手工聚类的单独步骤，而是通过 gating networks对多个生成网络的端到端学习进行训练， gating networks负责为给定条件选择合适的生成网络。我们采用分类重新参数化技巧，在选择生成器的同时保持梯度流动的分类决策。我们证明了个体生成器学习数据的不同且显著的子部分，并且对于CelebA获得了0.2470的多尺度结构相似性（MS-SSIM）得分，并且在CIFAR-10中获得了8.33的竞争性无监督初始得分。

arXiv：https://arxiv.org/abs/1805.02481v2


**《Unpaired Multi-Domain Image Generation via Regularized Conditional GANs》**

Abstract：在本文中，我们研究了多域（multi-domain）图像生成的问题，其目的是从不同的域生成相应的图像对。随着近年来生成模型的发展，图像生成取得了很大进展，并已应用于各种计算机视觉任务。然而，由于难以学习不同域图像的对应性，尤其是当未给出配对样本的信息时，多域图像生成可能无法实现期望的性能。为了解决这个问题，我们提出了规则化条件GAN（RegCGAN），它能够学习在没有配对训练数据的情况下生成相应的图像。 RegCGAN基于条件GAN，我们引入两个正则化器来指导模型学习不同域的相应语义。我们对未给出配对训练数据的若干任务评估所提出的模型，包括边缘和照片的生成，具有不同属性的面部的生成等。实验结果表明我们的模型可以成功地生成所有这些的相应图像，同时优于 baseline方法。我们还介绍了将RegCGAN应用于无监督域自适应的方法。

arXiv：https://arxiv.org/abs/1805.02456

================================================
FILE: 2018/07/06.md
================================================
**2018-07-06**

这篇文章有2篇论文速递，都是目标检测方向，一篇是RefineNet，其是SSD算法、RPN网络和FPN算法的结合，另一篇是DES，其是基于SSD网络进行了改进。注意，两篇都是CVPR 2018文章。

# Object Detection

《Single-Shot Refinement Neural Network for Object Detection》

CVPR 2018

Abstract：对于目标检测，两阶段方法（例如，更快的R-CNN）已经实现了最高精度，而一阶段方法（例如，SSD）具有高效率的优点。为了继承两者的优点，同时克服它们的缺点，在本文中，我们提出了一种新的基于single-shot的检测器，称为RefineDet，它比两阶段方法获得更好的精度，并保持一阶段方法的检测效率。 RefineDet由两个相互连接的模块组成，即 anchor refinement 模块和目标检测模块。具体地，前者旨在（1）过滤掉negative anchor以减少分类器的搜索空间，以及（2）粗略地调整anchor的位置和大小以为随后的回归器提供更好的初始化。后一模块将精细anchor作为前者的输入，以进一步改进回归并预测多类别标签。同时，我们设计了一个传输连接块来传输锚点细化模块中的特征，以预测对象检测模块中对象的位置，大小和类别标签。多任务丢失功能使我们能够以端到端的方式训练整个网络。 PASCAL VOC 2007，PASCAL VOC 2012和MS COCO的大量实验证明，RefineDet可以高效地实现最先进的检测精度。

arXiv：https://arxiv.org/abs/1711.06897

github：https://github.com/sfzhang15/RefineDet

注：之后会推出该论文的精读文章！

《Single-Shot Object Detection with Enriched Semantics》

CVPR 2018

Abstract：我们提出了一种新颖的 single-shot 目标检测网络，名为“Detection with Enriched  semantics”（DES）。我们的动机是通过语义分割分支和全局激活模块来丰富典型深度检测器内目标检测特征的语义。分割分支由弱分割ground-truth监督，即，不需要额外的注释。与此同时，我们采用全局激活模块，以自我监督的方式学习通道和对象类之间的关系。PASCAL VOC和MS COCO检测数据集的综合实验结果证明了该方法的有效性。特别是，使用基于VGG16的DES，我们在VOC2007测试中实现了81.7的mAP，在COCO测试开发上实现了32.8的mAP，在Titan Xp GPU上每个图像的推断速度为31.5毫秒。 使用较低分辨率的版本，我们在VOC2007上实现了79.7的mAP，每张图像的推断速度为13.0毫秒。

arXiv：https://arxiv.org/abs/1712.00433

注：之后会推出该论文的精读文章！

================================================
FILE: 2018/07/07.md
================================================
**2018-07-07**

这篇文章有 2篇论文速递，都是图像分割方向（CVPR 2018），一篇提出CCB-Cut损失，另一篇是对FCN网络进行了改进。注意，两篇都是CVPR 2018文章。

# Image Segmentation


**《Compassionately Conservative Balanced Cuts for Image Segmentation》**

CVPR 2018

Abstract：The Normalized Cut (NCut) objective function, widely used in data clustering and image segmentation, quantifies the cost of graph partitioning in a way that biases clusters or segments that are balanced towards having lower values than unbalanced partitionings. However, this bias is so strong that it avoids any singleton partitions, even when vertices are very weakly connected to the rest of the graph. Motivated by the B\"uhler-Hein family of balanced cut costs, we propose the family of Compassionately Conservative Balanced (CCB) Cut costs, which are indexed by a parameter that can be used to strike a compromise between the desire to avoid too many singleton partitions and the notion that all partitions should be balanced. We show that CCB-Cut minimization can be relaxed into an orthogonally constrained ℓτ-minimization problem that coincides with the problem of computing Piecewise Flat Embeddings (PFE) for one particular index value, and we present an algorithm for solving the relaxed problem by iteratively minimizing a sequence of reweighted Rayleigh quotients (IRRQ). Using images from the BSDS500 database, we show that image segmentation based on CCB-Cut minimization provides better accuracy with respect to ground truth and greater variability in region size than NCut-based image segmentation.

arXiv：https://arxiv.org/abs/1803.09903


**《Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation》**

CVPR 2018

Abstract：With pervasive applications of medical imaging in health-care, biomedical image segmentation plays a central role in quantitative analysis, clinical diagno- sis, and medical intervention. Since manual anno- tation su ers limited reproducibility, arduous e orts, and excessive time, automatic segmentation is desired to process increasingly larger scale histopathological data. Recently, deep neural networks (DNNs), par- ticularly fully convolutional networks (FCNs), have been widely applied to biomedical image segmenta- tion, attaining much improved performance. At the same time, quantization of DNNs has become an ac- tive research topic, which aims to represent weights with less memory (precision) to considerably reduce memory and computation requirements of DNNs while maintaining acceptable accuracy. In this paper, we apply quantization techniques to FCNs for accurate biomedical image segmentation. Unlike existing litera- ture on quantization which primarily targets memory and computation complexity reduction, we apply quan- tization as a method to reduce over tting in FCNs for better accuracy. Speci cally, we focus on a state-of- the-art segmentation framework, suggestive annotation [22], which judiciously extracts representative annota- tion samples from the original training dataset, obtain- ing an e ective small-sized balanced training dataset. We develop two new quantization processes for this framework: (1) suggestive annotation with quantiza- tion for highly representative training samples, and (2) network training with quantization for high accuracy. Extensive experiments on the MICCAI Gland dataset show that both quantization processes can improve the segmentation performance, and our proposed method exceeds the current state-of-the-art performance by up to 1%. In addition, our method has a reduction of up to 6.4x on memory usage.

arXiv：https://arxiv.org/abs/1803.04907

================================================
FILE: 2018/07/19.md
================================================
**2018-07-19**

这篇文章有 2篇论文速递，都是ECCV 2018 paper，一篇关于语义分割方向，另一篇是关于深度预测方向。

# Semantic Segmentation

**《Effective Use of Synthetic Data for Urban Scene Semantic Segmentation》**

ECCV 2018

Abstract：训练深度网络以执行语义分割需要大量标记数据。为了减轻注释真实图像的手动工作，研究人员研究了合成数据的使用，这些数据可以自动标记。不幸的是，在合成数据上训练的网络在真实图像上表现得相对较差。虽然这可以通过域适应（domain adaptation）来解决，但是现有方法都需要在训练期间访问真实图像。在本文中，我们介绍了一种截然不同的处理合成图像的方法，这种方法不需要在训练时看到任何真实的图像。Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently。特别是，前者应该以基于检测的方式处理，以更好地解释这样的事实：虽然它们在合成图像中的纹理不是照片般逼真的，但它们的形状看起来很自然。我们的实验证明了我们的方法对Cityscapes和CamVid的有效性，仅对合成数据进行了训练。

arXiv：https://arxiv.org/abs/1807.06132

注：domain adaptation这个概念最近很火！

# Stereo

**《ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems》**

ECCV 2018

Abstract：在本文中，我们介绍ActiveStereoNet，这是active立体声系统的第一个深度学习解决方案。由于缺乏ground truth，我们的方法是完全自监督的，但它产生精确的深度，子像素精度为像素的1/30;它没有遭受常见的过度平滑问题;它保留了边缘;它明确地处理遮挡。我们引入了一种新的重建损失（reconstruction loss），它对噪声和无纹理patches更加稳健，并且对于光照变化是不变的。使用具有自适应支持权重方案的基于窗口的成本聚合来优化所提出的损失。这种成本聚合是边缘保留并使损失函数平滑，这是使网络达到令人信服的结果的关键。最后，我们展示了预测无效区域（如遮挡）的任务如何在没有ground truth的情况下进行端到端的训练。该component对于减少模糊至关重要，特别是改善了深度不连续性的预测。对真实和合成数据进行广泛的定量和定性评估，证明了在许多具有挑战性的场景中的最新技术成果。

arXiv：https://arxiv.org/abs/1807.06009

================================================
FILE: 2018/07/23.md
================================================
**2018-07-23**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出卷积块注意力模块，它可以无缝地集成到任何CNN架构中；另一篇是利用 GAN技术实现多视图3D重建。

# CNN

**《CBAM: Convolutional Block Attention Module》**

**ECCV 2018**

Abstract：我们提出了卷积块注意力模块（CBAM，Convolutional Block Attention Module ），这是一种用于前馈卷积神经网络的简单而有效的注意力（attention）模块。给定中间特征图，我们的模块沿着两个单独的维度（通道和空间）顺序地（sequentially）推断注意力图，然后将注意力图乘以输入特征图以进行自适应特征细化。由于CBAM是一个轻量级的通用模块，它可以无缝地集成到任何CNN架构中，代价可以忽略不计，并且可以与基本CNN一起进行端到端的训练。 我们通过对ImageNet-1K，MS~COCO检测和VOC~2007检测数据集的大量实验来验证我们的CBAM。 我们的实验表明，各种模型在分类和检测性能方面均有一定的改进，证明了CBAM的广泛适用性。 代码和模型将随后公开提供。

arXiv：[链接：https://arxiv.org/abs/1807.06521](https://arxiv.org/abs/1807.06521)

注：很棒的论文，相信可以帮助一波同学写论文（划水）

# Multi-View Reconstruction

**《Specular-to-Diffuse Translation for Multi-View Reconstruction》**

**ECCV 2018** 

Abstract：大多数多视图3D重建算法，特别是当使用来自阴影的形状提示时，假设对象外观主要是漫射的（predominantly diffuse）。为了缓解这种限制，我们引入了S2Dnet，一种生成的对抗网络，用于将具有镜面反射的物体的多个视图转换为漫反射（ diffuse），从而可以更有效地应用多视图重建方法。我们的网络将无监督的图像到图像转换扩展到多视图“镜面到漫反射”的转换。为了在多个视图中保留对象外观，我们引入了一个多视图一致性损失（MVC，Multi-View Coherence loss），用于评估视图转换后局部patches的相似性和faithfulness。我们的MVC损失确保在图像到图像转换下保留多视图图像之间的局部对应的相似性。因此，与几种单视图 baseline 技术相比，我们的网络产生了明显更好的结果。此外，我们使用基于物理的渲染精心设计并生成大型综合训练数据集。在测试过程中，我们的网络仅将原始光泽图像作为输入，无需额外信息，如分割掩模或光照估计。结果表明，使用我们的网络过滤的图像可以显著地改善多视图重建。我们还展示了在现实世界训练和测试数据上的出色表现。

arXiv：[链接：https://arxiv.org/abs/1807.05439](https://arxiv.org/abs/1807.05439)

================================================
FILE: 2018/07/27.md
================================================
**2018-07-27**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出对目标周围的视觉上下文建模，来实现目标检测数据集的增广；另一篇是提出一种综合贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。

# Data Augmentation

**《Modeling Visual Context is Key to Augmenting Object Detection Datasets》**

ECCV 2018

Abstract：众所周知，用于深度神经网络的数据增广（data augmentation）对于训练视觉识别系统是十分重要的。通过人为增加训练样本的数量，它有助于减少过度拟合并改善泛化。对于物体检测（object detection），用于数据增强的经典方法包括生成通过基本几何变换和原始训练图像的颜色变化获得的图像。在这项工作中，我们更进一步，利用 segmentation annotations 来增加训练数据上存在的对象实例的数量。为了使这种方法获得成功，我们证明，适当地建模对象周围的视觉上下文（ visual context ）对于将它们放置在正确的环境中至关重要。否则，我们会发现之前的策略确实会受到伤害。通过我们的上下文（context）模型，当VOC'12基准测试中很少有标记示例可用时，我们实现了显著的平均精度改进。

arXiv：https://arxiv.org/abs/1807.07428

# Face Recognition

**《From Face Recognition to Models of Identity: A Bayesian Approach to Learning about Unknown Identities from Unsupervised Data》**

ECCV 2018

Abstract：当前的面部识别系统可以在各种成像条件下稳健地识别身份。在这些系统中，通过分类到从监督身份标记获得的已知身份来执行识别。这个当前范例存在两个问题：（1）current systems are unable to benefit from unlabelled data which may be available in large quantities; （2）当前系统将成功识别等同于给定输入图像的标记。另一方面，人类会对完全无监督的个体进行识别，即使没有能够命名该个体，也要认识到他们之前见过的人的身份。我们如何超越当前的分类范式，更加人性化地理解身份？我们提出了一个综合的贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。我们的模型不仅对已知身份获得了良好的识别性能，它还可以从无监督数据中发现新身份，并学习将身份与不同情境联系起来，这取决于哪些身份倾向于一起观察。此外，提出的半监督组件不仅能够处理熟人的名字，而且还能够处理统一框架中未标记的熟悉面孔和完全陌生人。

arXiv：https://arxiv.org/abs/1807.07872

================================================
FILE: 2018/07/31.md
================================================
**2018-07-31**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出semi-convolutional等创新点来改进Mask RCNN；另一篇是提出CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络，实现超分辨率。

# Instance Segmentation

**《Semi-convolutional Operators for Instance Segmentation》**

ECCV 2018

Abstract：目标检测（Object detection）和实例分割（instance segmentation）由基于区域的方法（例如Mask RCNN）主导。然而，人们越来越关注将这些问题减少到像素标记任务，因为后者可以更高效，可以在许多其他任务中使用的图像到图像（image-to-image）网络架构中无缝集成，并且对于不能由边界框近似的目标更加准确。在本文中，我们从理论和经验上表明，使用卷积算子不能轻易地实现构建可以分离对象实例的 dense pixel embeddings 。同时，我们表明简单的修改，我们称之为 semi-convolutional，其在这项任务中有更好的表现。我们证明了这些算子也可用于改进Mask RCNN等方法，展示了比单独使用Mask RCNN可实现的复杂生物形状和PASCAL VOC类别更好的分割。

arXiv：https://arxiv.org/abs/1807.10712

# Super Resolution


**《CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping》**

ECCV 2018

Abstract：The Reference-based Super-resolution (RefSR) super-resolves a low-resolution (LR) image given an external high-resolution (HR) reference image，其中参考图像和LR图像共享相似的视点但具有显著的分辨率间隙 x8。现有的RefSR方法以级联的方式工作，例如 patch匹配，然后是具有两个独立定义的目标函数的合成 pipeline，导致inter-patch misalignment，grid effect and inefficient optimization。为了解决这些问题，我们提出了CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络。我们的网络包含图像编码器（encoder），cross-scale warping layers和融合解码器（decoder）：编码器用于从LR和参考图像中提取多尺度特征;cross-scale warping layers在空间上将参考特征图与LR特征图对齐;解码器最终聚合来自两个域的特征映射以合成HR输出。使用跨尺度变形，我们的网络能够以端到端的方式在像素级执行空间对齐，从而改善现有方案的精度（大约2dB-4dB）和效率（超过100倍） 。

arXiv：https://arxiv.org/abs/1807.10547

================================================
FILE: 2018/08/03.md
================================================
**2018-08-03**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于卷积神经网络（CNN）的密度估计方法来解决图像中人群计数的问题；另一篇是提出实时立体匹配的端到端深度架构StereoNet，实现了亚像素匹配精度的深度预测。

# Crowd Counting

**《Iterative Crowd Counting》**

**ECCV 2018**

Abstract：在这项工作中，我们解决了图像中人群计数的问题。我们提出了一种基于卷积神经网络（CNN）的密度估计方法来解决这个问题。一次性预测高分辨率密度图是一项具有挑战性的任务。因此，我们提出了一个用于生成高分辨率密度图的两分支CNN架构，其中第一个分支生成低分辨率密度图，第二个分支包含来自第一个分支的低分辨率预测和特征图以生成高分辨率密度图。我们还提出了我们方法的多阶段扩展，其中管道中的每个阶段都使用来自所有先前阶段的预测。与目前最佳的人群计数方法的实证比较表明，我们的方法在三个具有挑战性的人群计数基准上实现了最低的平均绝对误差：Shanghaitech，WorldExpo'10和UCF数据集。

arXiv：https://arxiv.org/abs/1807.09959

# Depth Prediction

**《StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction》**

**ECCV 2018**

Abstract：本文介绍了StereoNet，这是第一个用于实时立体匹配的端到端深度架构，在NVidia Titan X上以60 fps运行，可生成高质量，边缘保留，无量化（quantization-free）的视差图。本文的一个重要创新点是网络实现了亚像素匹配精度，而不是传统立体匹配方法的精度。This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision.。通过采用学习的边缘感知上采样功能来实现空间精度。我们的模型使用Siamese网络从左右图像中提取特征。在非常低分辨率的cost volume中计算视差的第一估计，然后分层地通过使用紧凑的像素到像素细化网络的学习的上采样函数来重新引入高频细节。利用颜色输入作为指导，该功能（function）能够产生高质量的边缘感知输出。我们在多个基准测试中获得了最佳的结果。

arXiv：https://arxiv.org/abs/1807.08865

注：哇，实时立体匹配啊！

================================================
FILE: 2018/08/07.md
================================================
**2018-08-07**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的网格自动编码的卷积神经网络，用于生成3D人脸；另一篇提出新的RFNet，实现看图说话（image caption）。

# 3D Face

**《Generating 3D faces using Convolutional Mesh Autoencoders》**

**ECCV 2018**

Abstract：人脸的3D表示（representations）对于计算机视觉问题是有用的，例如3D面部跟踪和从图像重建，以及诸如角色生成和动画的图形应用。传统模型使用线性子空间或高阶张量概括来学习面部的潜在表示（latent representation）。由于这种线性，它们无法捕获极端变形和非线性表达式。为了解决这个问题，我们引入了一个多功能模型（versatile model），该模型使用网格表面上的光谱卷积来学习面部的非线性表示。我们引入了网格采样操作，这种操作能够实现分层网格表示，捕获模型中多个尺度的形状和表达的非线性变化。在variational setting中，我们的模型从多元高斯分布中采样不同的逼真3D人脸。我们的训练数据包括在12个不同subjects中捕获的20,466个极端表情网格。尽管训练数据有限，但我们训练的模型优于最先进的面部模型，重建误差降低50％，而参数减少75％。我们还表明，用我们的自动编码器替换现有最先进的人脸模型的表达空间，可以实现更低的重建误差。

arXiv：https://arxiv.org/abs/1807.10267

github：https://github.com/anuragranj/coma

# Image Captioning

**《Recurrent Fusion Network for Image Captioning》**

**ECCV 2018** 

Abstract：最近，看图说话（Image captioning）已经取得了很大进展，并且所有最先进的模型都采用了编码器 - 解码器框架。在此框架下，输入图像由卷积神经网络（CNN）编码，然后通过递归神经网络（RNN）转换为自然语言。依赖于该框架的现有模型仅使用一种CNN，例如ResNet或Inception-X，其仅从一个特定视点描述图像内容。因此，不能全面地理解输入图像的语义含义，这限制了captioning的性能。在本文中，为了利用来自多个编码器的补充信息，我们提出了一种用于处理看图说话的新型循环融合网络（RFNet）。我们模型中的融合过程可以利用图像编码器的输出之间的相互作用，然后为解码器生成新的紧凑但信息丰富的表示。 MSCOCO数据集上的实验证明了我们提出的RFNet的有效性，它为看图说话（image caption）提供了一种新的先进技术。

arXiv：https://arxiv.org/abs/1807.09986

注：Image Caption挺有意思的！CNN和RNN完美结合~


================================================
FILE: 2018/08/11.md
================================================
**2018-08-11**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的基于Disentangled Representations网络，实现图像到图像转换；另一篇提出新的SPG masks，可有效地生成高质量的目标定位图。

# Image to Image Translation

**《Diverse Image-to-Image Translation via Disentangled Representations》**

**ECCV 2018（oral）**

Abstract：图像到图像转换旨在学习两个视觉域之间的映射。许多应用存在两个主要挑战：1）缺少对齐的训练对（aligned training pairs）2）来自单个输入图像的多个可能输出。在这项工作中，我们提出了一种基于disentangled representation的方法，用于在没有成对训练图像的情况下产生多样化的输出。为了实现多样性（diversity），我们提出将图像嵌入到两个空间中：a domain-invariant content space capturing shared information across domains and a domain-specific attribute space。我们的模型采用从给定输入中提取的编码内容特征和从属性空间采样的属性向量，以在测试时产生不同的输出。为了处理不成对的训练数据，我们引入了新的基于disentangled representations的cross-cycle consistency loss。定性结果表明，我们的模型可以在无需配对训练数据的情况下，在各种任务上生成多样且逼真的图像。对于定量比较，我们使用感知距离度量（perceptual distance metric）来衡量用户研究和多样性的真实性。与MNIST-M和LineMod数据集上的最新技术相比，我们将所提出的模型应用于域适应并显示出最佳效果（SOTA）。

arXiv：https://arxiv.org/abs/1808.00948

homepage：http://vllab.ucmerced.edu/hylee/DRIT/

github：https://github.com/HsinYingLee/DRIT

# Object Localization

**《Self-produced Guidance for Weakly-supervised Object Localization》**

**ECCV 2018**

Abstract：弱监督方法通常基于分类网络产生的注意力图（attention maps）生成定位结果。然而，注意力图表现出对象的最具辨别力的部分，这些部分是小的和稀疏的。我们建议生成自生导引（generate Self-produced Guidance ，SPG）掩模，其将前景，感兴趣对象与背景分离，以向分类网络提供像素的空间相关信息。提出了一种分阶段（stagewise）方法，以结合高置性对象区域来学习SPG掩模。注意力图中的高置信区域用于逐步学习SPG掩模。然后将掩模用作辅助像素级监督，以便于分类网络的训练。对ILSVRC的广泛实验表明，SPG可有效地生成高质量的对象定位图。特别是，提出的SPG在ILSVRC验证集上实现了43.83％的Top-1定位错误率，这是一种新的SOTA错误率。

arXiv：https://arxiv.org/abs/1807.08902


================================================
FILE: 2018/08/15.md
================================================
**2018-08-15**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成；另一篇提出利用FiLM来调节语言上基于图像的卷积网络计算，解决视推理问题。

# VAE

**《MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics》**

**ECCV 2018**

Abstract：Long-term human motion can be represented as a series of motion modes---motion sequences that capture short-term temporal dynamics---with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis.

摘要：长期（long-term）人体运动可以表示为一系列运动模式 - 捕捉短期时间动态的运动序列 - 它们之间的过渡。我们利用这种结构，提出了一种新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成。我们的模型联合学习运动模式的特征嵌入（可以从中重建运动序列）和表示一个运动模式到下一个运动模式的转换的特征变换。我们的模型能够从相同的输入生成"未来"的多种多样且可信的运动序列。我们将此方法应用于面部和全身运动，并演示了基于类比的运动传递和视频合成等应用。

arXiv：https://arxiv.org/abs/1808.04545

# Visual Reasoning

**《Visual Reasoning with Multi-hop Feature Modulation》**

**ECCV 2018**

Abstract：Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt --- on-par with single-hop FiLM generation --- while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

摘要：最近计算机视觉和自然语言处理方面的突破激发了人们对挑战多模式任务（如视觉问答和视觉对话）的兴趣。对于这样的任务，一种成功的方法是通过特征线性调制（FiLM）层（即，每通道缩放和移位）来调节语言上基于图像的卷积网络计算。我们提出以多跳方式生成在卷积网络的层次结构上的FiLM层的参数，而不是像在先前的工作中那样一次生成。通过在参与语言输入和生成FiLM层参数之间交替，这种方法能够更好地扩展到具有较长输入序列的设置，例如对话（dialogue）。我们证明了多跳FiLM生成实现了短输入序列任务的最新技术参考 - 与单跳FiLM生成相媲美 - 同时也明显优于先前的先进技术GuessWhat上的单跳FiLM生成？！视觉对话任务。

arXiv：https://arxiv.org/abs/1808.04446

注：Amusi觉得将CV与NLP结合有非常大的研究意义和前景。

================================================
FILE: 2018/08/25.md
================================================
**2018-08-21**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出新的弱监督和半监督框架可实现含无限数量标签的语义分割；另一篇提出使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。

# Semantic Segmentation

**《Concept Mask: Large-Scale Segmentation from Semantic Concepts》**

**ECCV 2018**

Abstract：Existing works on semantic segmentation typically consider a small number of labels, ranging from tens to a few hundreds. With a large number of labels, training and evaluation of such task become extremely challenging due to correlation between labels and lack of datasets with complete annotations. We formulate semantic segmentation as a problem of image segmentation given a semantic concept, and propose a novel system which can potentially handle an unlimited number of concepts, including objects, parts, stuff, and attributes. We achieve this using a weakly and semi-supervised framework leveraging multiple datasets with different levels of supervision. We first train a deep neural network on a 6M stock image dataset with only image-level labels to learn visual-semantic embedding on 18K concepts. Then, we refine and extend the embedding network to predict an attention map, using a curated dataset with bounding box annotations on 750 concepts. Finally, we train an attention-driven class agnostic segmentation network using an 80-category fully annotated dataset. We perform extensive experiments to validate that the proposed system performs competitively to the state of the art on fully supervised concepts, and is capable of producing accurate segmentations for weakly learned and unseen concepts.

摘要：关于语义分割的现有工作通常考虑少量标签，范围从几十到几百。由于标签之间的相关性以及缺少具有完整注释的数据集，因此对于大量标签，对此类任务的训练和评估变得极具挑战性。我们将语义分割表示为给定语义概念的图像分割问题，并提出一种新颖的系统，它可以处理无限数量的概念，包括对象，部件，东西和属性。我们使用弱监督和半监督框架来实现这一目标，该框架利用具有不同监督级别的多个数据集。我们首先在6M图像数据集上训练深度神经网络，仅使用图像级标签来学习18K概念的视觉语义嵌入。然后，我们使用带有750个概念的边界框注释的curated 数据集来优化和扩展嵌入网络以预测注意力图。最后，我们使用80类完全注释的数据集训练注意力驱动的类不可知分割网络。我们进行了大量实验，以验证所提出的系统在完全监督的概念上与现有技术相比具有竞争力，并且能够为弱学习和看不见的概念产生准确的分割。

arXiv：https://arxiv.org/abs/1808.06032

# Monocular Depth Estimation

**《Learning Monocular Depth by Distilling Cross-domain Stereo Networks》**

**ECCV 2018**

Abstract：Monocular depth estimation aims at estimating a pixelwise depth map for a single image, which has wide applications in scene understanding and autonomous driving. Existing supervised and unsupervised methods face great challenges. Supervised methods require large amounts of depth measurement data, which are generally difficult to obtain, while unsupervised methods are usually limited in estimation accuracy. Synthetic data generated by graphics engines provide a possible solution for collecting large amounts of depth data. However, the large domain gaps between synthetic and realistic data make directly training with them challenging. In this paper, we propose to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Cross-domain synthetic data could be fully utilized in this novel framework. Different strategies are proposed to ensure learned depth perception capability well transferred across different domains. Our extensive experiments show state-of-the-art results of monocular depth estimation on KITTI dataset.

摘要：单目深度估计旨在估计单个图像的像素深度图，其在场景理解和自动驾驶中具有广泛的应用。现有的监督和无监督方法面临巨大挑战。监督方法需要大量深度测量数据，这些数据通常难以获得，而无监督方法通常在估计精度方面受到限制。合成数据为收集大量深度数据提供了可能的解决方案。然而，合成数据和实际数据之间存在较大的域（domain）差距，这使得直接训练具有一定挑战性。在本文中，我们建议使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。跨域合成数据可以在这个新颖的框架中得到充分利用。提出了不同的策略来确保学习深度感知能力在不同域之间良好地传递。我们的广泛实验显示了KITTI数据集上单目深度估计的最新结果。

arXiv：https://arxiv.org/abs/1808.06586

================================================
FILE: 2018/10/12.md
================================================
**2018-10-12**

这篇文章介绍两篇 ECCV 2018最新的 paper，一篇提出IoU-Net，用来学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU。 网络获得了定位置信度，通过保留精确的定位边界框来改进NMS。 此外，提出了一种基于优化的边界框细化方法，其中将预测的IoU表示为目标；另一篇提出DetNet，这是一种专门用于物体检测的新型 backbone 网络。

# Object Detection

**《Acquisition of Localization Confidence for Accurate Object Detection》**

Abstract：Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

摘要：现代的基于CNN的物体检测器依靠边界框回归和非最大抑制(NMS)来定位对象。 虽然类标签的概率自然反映了分类置信度(classification confidence)，但缺乏定位置信度(localization confidence)。 这使得正确定位的边界框在迭代回归期间 degenerate 或甚至在NMS期间被抑制。 在本文中，**我们提出了IoU-Net学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU**。 网络获得了定位置信度，通过保留精确的定位边界框来改进NMS。 此外，提出了一种基于优化的边界框细化方法，其中将预测的IoU表示为目标。 MS-COCO数据集上的大量实验表明了IoU-Net的有效性，以及它与几种最先进的物体探测器的兼容性和适应性。

arXiv：https://arxiv.org/abs/1807.11590

注：源码未放出

**《DetNet: A Backbone network for Object Detection》**

Abstract：Recent CNN based object detectors, no matter one-stage methods like YOLO, SSD, and RetinaNe or two-stage detectors like Faster R-CNN, R-FCN and FPN are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection. 1. Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. 2. Object detection not only needs to recognize the category of the object instances but also spatially locate the position. Large downsampling factor brings large valid receptive field, which is good for image classification but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet~(4.8G FLOPs) backbone. The code will be released for the reproduction.

摘要：最近的基于CNN的物体探测器，无论是像YOLO，SSD和RetinaNet这样的one-stage方法，还是像Faster R-CNN，R-FCN和FPN这样的two-stage探测器，都经常试图直接从ImageNet预先训练好的图像模型中进行微调分类。关于专门为物体检测设计的 backbone 特征提取器的讨论很少。更重要的是，**图像分类和对象检测的任务之间存在若干差异**。(1)最近的物体探测器如FPN和RetinaNet通常涉及额外的阶段，以防止图像分类的任务处理各种尺度的物体。 (2)目标检测不仅需要识别对象实例的类别，还需要在空间上定位位置。较大的下采样因子带来了较大的有效感受野，有利于图像分类，但会损害对象定位能力。由于图像分类和物体检测之间存在差距，本文提出了DetNet，这是一种专门用于物体检测的新型 backbone 网络。此外，DetNet还包括针对传统backbone网络的额外阶段，用于图像分类，同时在更深层中保持高空间分辨率。在没有任何其它tricks的情况下，基于我们的DetNet~（4.8G FLOP）backbone，在MSCOCO基准测试中获得了目标检测和实例分割的最优结果。

arXiv：https://arxiv.org/abs/1804.06215

注：源码未放出

================================================
FILE: 2018/10/17.md
================================================
**2018-10-17**

这篇文章介绍两篇 ECCV 2018关于语义分割（Semantic Segmentation）最新的 paper，一篇提出双边分割网络（Bilateral Segmentation Network，BiSeNet）在不牺牲空间分辨率（spatial resolution）的情况下来实现实时inference速度；另一篇提出UDA框架和CBST框架，并引入空间先验（spatial prior）来细化生成的标签。

# Semantic Segmentation

**《BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation》**

Abstract：Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.

摘要：语义分割（semantic segmentation）需要丰富的空间信息和相当大的感受野（receptive field）。但是，现代方法通常会牺牲空间分辨率（spatial resolution）来实现实时inference速度，从而导致性能不佳。在本文中，我们通过一种新颖的双边分割网络（Bilateral Segmentation Network，BiSeNet）来解决这一难题。我们首先设计一个小步幅的 Spatial Path，以保留空间信息并生成高分辨率特征。同时，采用具有快速下采样策略的 Context Path 来获得足够的感受野。在这两条 path 的顶部，我们引入了一个新的特征融合模块（Feature Fusion Module），以有效地结合特征。所提出的BiSeNet框架在Cityscapes，CamVid和COCO-Stuff数据集上的速度和分割性能之间取得了适当的平衡。具体来说，对于2048x1024输入，我们在Cityscapes测试数据集上实现了68.4％的Mean IOU，在一块NVIDIA Titan XP卡上的速度为105 FPS，这明显快于当前其它可比的方法。

arXiv：http://arxiv.org/abs/1808.00897

注：源码未放出

**《Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training》**

Abstract：Recent deep networks achieved state of the art performance on a variety of semantic segmentation tasks. Despite such progress, these models often face challenges in real world “wild tasks” where large difference between labeled training/source data and unseen test/target data exists. In particular, such difference is often referred to as “domain gap”, and could cause significantly decreased performance which cannot be easily remedied by further increasing the representation power. Unsupervised domain adaptation (UDA) seeks to overcome such problem without target domain labels. In this paper, we propose a novel UDA framework based on an iterative self-training (ST) procedure, where the problem is formulated as latent variable loss minimization, and can be solved by alternatively generating pseudo labels on target data and re-training the model with these labels. On top of ST, we also propose a novel classbalanced self-training (CBST) framework to avoid the gradual dominance of large classes on pseudo-label generation, and introduce spatial priors to refine generated labels. Comprehensive experiments show that the proposed methods achieve state of the art semantic segmentation performance under multiple major UDA settings.

摘要：最近的深度网络在各种语义分割任务上实现了最先进的性能。尽管取得了这些进展，但这些模型经常面临现实世界“wild tasks”中的挑战，其中存在标记的训练/源数据与看不见的测试/目标数据之间的巨大差异。特别地，这种差异通常被称为“domain gap”，并且可能导致显著的性能降低。这并不能通过进一步增加表示能力而容易地补救。无监督域适应（Unsupervised Domain Adaptation，UDA）试图在没有目标域标签的情况下克服这种问题。在本文中，我们提出了一种基于迭代自训练（Self-training，ST）过程的新型UDA框架，其中该问题被公式化为潜在变量损失最小化，并且可以通过在目标数据上交替生成伪标签（pseudo labels）并重新训练来解决。带有这些标签的模型。在ST之上，我们还提出了一种新颖的类平衡自我训练（Class Balanced Self-training，CBST）框架，avoid the gradual dominance of large classes on pseudo-label generation，并引入空间先验（spatial prior）来细化生成的标签。综合实验表明，所提出的方法在多个主要UDA设置下实现了最先进的语义分割性能。

paper：http://openaccess.thecvf.com/content_ECCV_2018/papers/Yang_Zou_Unsupervised_Domain_Adaptation_ECCV_2018_paper.pdf

================================================
FILE: 2018/11/05-09.md
================================================
**2018-11-05~2018-11-09**

这篇文章介绍43篇论文，涉及CNN、图像分类、数据增广、Face、图像分割、OCR、GAN、风格迁移、目标跟踪、数据集和姿态估计等方向。

# **数据集**

**《The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale》**

IJCV

arXiv：https://arxiv.org/abs/1811.00982

数据集官网：https://storage.googleapis.com/openimages/web/index.html

注：920w 幅图像


**《Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning》**

CVPR 2018

arXiv：https://arxiv.org/abs/1811.02307

datasets：https://usa.honda-ri.com/hdd


# **CNN**

**《Invertible Residual Networks》**

arXiv：https://arxiv.org/abs/1811.00995


**《You Only Search Once: Single Shot Neural Architecture Search via Direct Sparse Optimization》**

ICLR2019 Submission

arXiv：https://arxiv.org/abs/1811.01567

注：图森中实习生的work，超越NAS


**《Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance》**

Submitted to IJCV 2018

arXiv：https://arxiv.org/abs/1811.01335


**《Activation Functions: Comparison of trends in Practice and Research for Deep Learning》**

arXiv：https://arxiv.org/abs/1811.03378


**《Microscopic Nuclei Classification, Segmentation and Detection with improved Deep Convolutional Neural Network (DCNN) Approaches》**

arXiv：https://arxiv.org/abs/1811.03447


**《ColorUNet: A convolutional classification approach to colorization》**

arXiv：https://arxiv.org/abs/1811.03120


**《ExGate: Externally Controlled Gating for Feature-based Attention in Artificial Neural Networks》**

arXiv：https://arxiv.org/abs/1811.03403


# **图像分类**

**《Learning from Large-scale Noisy Web Data with Ubiquitous Reweighting for Image Classification》**

arXiv：https://arxiv.org/abs/1811.00700


**数据增广**

**《Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond》**

TPAMI 

arXiv：https://arxiv.org/abs/1811.02545


# **Face**

**《Exposing DeepFake Videos By Detecting Face Warping Artifacts》**

arXiv：https://arxiv.org/abs/1811.00656


**《Exposing Deep Fakes Using Inconsistent Head Poses》**

arXiv：https://arxiv.org/abs/1811.00661


**《Fast Face Image Synthesis with Minimal Training》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.01474

datasets：https://cvrl.nd.edu/projects/data/


**《Facial Landmark Detection for Manga Images》**

arXiv：https://arxiv.org/abs/1811.03214


# **特定目标检测**

**《Real-time Driver Drowsiness Detection for Android Application Using Deep Neural Networks Techniques》**

arXiv：https://arxiv.org/abs/1811.01627


**《Query-based Logo Segmentation》**

arXiv：https://arxiv.org/abs/1811.01395


# **图像分割**

**《Prediction Error Meta Classification in Semantic Segmentation: Detection via Aggregated Dispersion Measures of Softmax Probabilities》**

arXiv：https://arxiv.org/abs/1811.00648


**《Unsupervised RGBD Video Object Segmentation Using GANs》**

ACCV workshop

arXiv：https://arxiv.org/abs/1811.01526


**《DUNet: A deformable network for retinal vessel segmentation》**

arXiv：https://arxiv.org/abs/1811.01206


**《Ischemic Stroke Lesion Segmentation in CT Perfusion Scans using Pyramid Pooling and Focal Loss》**

2018 MICCAI workshop

arXiv：https://arxiv.org/abs/1811.01085


**《An End-to-end Approach to Semantic Segmentation with 3D CNN and Posterior-CRF in Medical Images》**

NIPS 2018 Workshop

arXiv：https://arxiv.org/abs/1811.03549


**《Adaptive Semantic Segmentation with a Strategic Curriculum of Proxy Labels》**

arXiv：https://arxiv.org/abs/1811.03542


**《Deep Semantic Instance Segmentation of Tree-like Structures Using Synthetic Data》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.03208


# **GAN**

**《Improving GAN with neighbors embedding and gradient matching》**

AAAI 2019

arXiv：https://arxiv.org/abs/1811.01333


**《A General Theory of Equivariant CNNs on Homogeneous Spaces》**

arXiv：https://arxiv.org/abs/1811.02017


**《Triple consistency loss for pairing distributions in GAN-based face synthesis》**

arXiv：https://arxiv.org/abs/1811.03492

github：https://github.com/ESanchezLozano/GANnotation

youtube：https://youtu.be/-8r7zexg4yg


**OCR**

**《Auto-ML Deep Learning for Rashi Scripts OCR》**

arXiv：https://arxiv.org/abs/1811.01290


# **不规则文字识别**

**《Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition》**

arXiv：https://arxiv.org/abs/1811.00751


# **风格迁移**

**《Evolvement Constrained Adversarial Learning for Video Style Transfer》**

arXiv：https://arxiv.org/abs/1811.02476


# **竞赛Workshop**

**《Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge》**

arXiv：https://arxiv.org/abs/1811.00662


# **姿态估计**

**《Improving Multi-Person Pose Estimation using Label Correction》**

arXiv：https://arxiv.org/abs/1811.03331


# **目标跟踪**

**《High Speed Tracking With A Fourier Domain Kernelized Correlation Filter》**

arXiv：https://arxiv.org/abs/1811.03236


# **Zero-shot Learning**

**《Model Selection for Generalized Zero-shot Learning》**

arXiv：https://arxiv.org/abs/1811.03252


# **3D**

**《SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection》**

arXiv：https://arxiv.org/abs/1811.01571


# **滤波**

**《Fast Adaptive Bilateral Filtering》**

TIP

arXiv：https://arxiv.org/abs/1811.02308


**《Fast High-Dimensional Bilateral and Nonlocal Means Filtering》**

TIP

arXiv：https://arxiv.org/abs/1811.02363


# **其它**

**《Continual Occlusions and Optical Flow Estimation》**

ACCV 2018

arXiv：https://arxiv.org/abs/1811.01602


**《Texture Synthesis Guided Deep Hashing for Texture Image Retrieval》**

arXiv：https://arxiv.org/abs/1811.01401


**《Semantic bottleneck for computer vision tasks》**

ACCV 2018

arXiv：https://arxiv.org/abs/1811.02234


**《3DCapsule: Extending the Capsule Architecture to Classify 3D Point Clouds》**

WACV 2019

arXiv：https://arxiv.org/abs/1811.02191


**《Automatic Thresholding of SIFT Descriptors》**

ICIP 2016

arXiv：https://arxiv.org/abs/1811.03173


**《DragonPaint: Rule based bootstrapping for small data with an application to cartoon coloring》**

arXiv：https://arxiv.org/abs/1811.03151

================================================
FILE: 2018/11/19.md
================================================
**2018-11-19**

这篇文章介绍12篇论文，涉及CNN、Face、3D、OCR、GAN和目标检测等方向。

# CNN

**《GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism》**

arXiv：https://arxiv.org/abs/1811.06965

**《Residual Convolutional Neural Network Revisited with Active Weighted Mapping》**

arXiv：https://arxiv.org/abs/1811.06878

注：咦，给ResNet加权重！

**《DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks》**

submitted to CVPR19

arXiv：https://arxiv.org/abs/1811.06783

# Face
**《Image Pre-processing Using OpenCV Library on MORPH-II Face Database》**

arXiv：https://arxiv.org/abs/1811.06934

# 3D
**《The Perfect Match: 3D Point Cloud Matching with Smoothed** 
**Densities》**

arXiv：https://arxiv.org/abs/1811.06879

# 目标检测

**《DeRPN: Taking a further step toward more general object detection》**

AAAI 2019

arXiv：https://arxiv.org/abs/1811.06700

github：https://github.com/HCIILAB/DeRPN

**《Improving Fingerprint Pore Detection with a Small FCN》**

arXiv：https://arxiv.org/abs/1811.06846

github：https://github.com/gdahia/fingerprint-pore-detection

注：NB，指纹孔都能检测，密集恐惧症者勿入！

**《Detecting The Objects on The Road Using Modular Lightweight Network》**

arXiv：https://arxiv.org/abs/1811.06641

# GAN
**《Conditional GANs for Multi-Illuminant Color Constancy: Revolution or Yet Another Approach?》**

arXiv：https://arxiv.org/abs/1811.06604

# Other

**《Automatic Paper Summary Generation from Visual and Textual Information》**

ICMV 2018

arXiv：https://arxiv.org/abs/1811.06943

github：https://cvpaperchallenge.github.io/AutoPaperSummaryGen/

注：自动生成论文概要！这么NB的么！

**《Anomaly Detection using Deep Learning based Image Completion》**

ICMLA 2018

arXiv：https://arxiv.org/abs/1811.06861

**《Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road》**

arXiv：https://arxiv.org/abs/1811.06666

================================================
FILE: 2018/11/20.md
================================================
**2018-11-20**

这篇文章介绍46篇论文，涉及CNN、Face、图像分类、目标检测、图像分割、GAN、Re-ID、SLAM和迁移学习等方向。

# CNN

**《Deeper Interpretability of Deep Networks》**

arXiv：https://arxiv.org/abs/1811.07807

> Deep Convolutional Neural Networks (CNNs) have been one of the most influential recent developments in computer vision, particularly for categorization. There is an increasing demand for explainable AI as these systems are deployed in the real world. However, understanding the information represented and processed in CNNs remains in most cases challenging. Within this paper, we explore the use of new information theoretic techniques developed in the field of neuroscience to enable novel understanding of how a CNN represents information. We trained a 10-layer ResNet architecture to identify 2,000 face identities from 26M images generated using a rigorously controlled 3D face rendering model that produced variations of intrinsic (i.e. face morphology, gender, age, expression and ethnicity) and extrinsic factors (i.e. 3D pose, illumination, scale and 2D translation). With our methodology, we demonstrate that unlike human's network overgeneralizes face identities even with extreme changes of face shape, but it is more sensitive to changes of texture. To understand the processing of information underlying these counterintuitive properties, we visualize the features of shape and texture that the network processes to identify faces. Then, we shed a light into the inner workings of the black box and reveal how hidden layers represent these features and whether the representations are invariant to pose. We hope that our methodology will provide an additional valuable tool for interpretability of CNNs.

**《Deep Shape-from-Template: Wide-Baseline, Dense and Fast Registration and Deformable Reconstruction from a Single Image》**

arXiv：https://arxiv.org/abs/1811.07791

> We present Deep Shape-from-Template (DeepSfT), a novel Deep Neural Network (DNN) method for solving real-time automatic registration and 3D reconstruction of a deformable object viewed in a single monocular image.DeepSfT advances the state-of-the-art in various aspects. Compared to existing DNN SfT methods, it is the first fully convolutional real-time approach that handles an arbitrary object geometry, topology and surface representation. It also does not require ground truth registration with real data and scales well to very complex object models with large numbers of elements. Compared to previous non-DNN SfT methods, it does not involve numerical optimization at run-time, and is a dense, wide-baseline solution that does not demand, and does not suffer from, feature-based matching. It is able to process a single image with significant deformation and viewpoint changes, and handles well the core challenges of occlusions, weak texture and blur. DeepSfT is based on residual encoder-decoder structures and refining blocks. It is trained end-to-end with a novel combination of supervised learning from simulated renderings of the object model and semi-supervised automatic fine-tuning using real data captured with a standard RGB-D camera. The cameras used for fine-tuning and run-time can be different, making DeepSfT practical for real-world use. We show that DeepSfT significantly outperforms state-of-the-art wide-baseline approaches for non-trivial templates, with quantitative and qualitative evaluation.

**《Do Normalization Layers in a Deep ConvNet Really Need to Be Distinct?》**

arXiv：https://arxiv.org/abs/1811.07727

> Yes, they do. This work investigates a perspective for deep learning: whether different normalization layers in a ConvNet require different normalizers. This is the first step towards understanding this phenomenon. We allow each convolutional layer to be stacked before a switchable normalization (SN) that learns to choose a normalizer from a pool of normalization methods. Through systematic experiments in ImageNet, COCO, Cityscapes, and ADE20K, we answer three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? Our results suggest that (1) using distinct normalizers improves both learning and generalization of a ConvNet; (2) the choices of normalizers are more related to depth and batch size, but less relevant to parameter initialization, learning rate decay, and solver; (3) different tasks and datasets have different behaviors when learning to select normalizers.

**《Self-Referenced Deep Learning》**

arXiv：https://arxiv.org/abs/1811.07598

> Knowledge distillation is an effective approach to transferring knowledge from a teacher neural network to a student target network for satisfying the low-memory and fast running requirements in practice use. Whilst being able to create stronger target networks compared to the vanilla non-teacher based learning strategy, this scheme needs to train additionally a large teacher model with expensive computational cost. In this work, we present a Self-Referenced Deep Learning (SRDL) strategy. Unlike both vanilla optimisation and existing knowledge distillation, SRDL distils the knowledge discovered by the in-training target model back to itself to regularise the subsequent learning procedure therefore eliminating the need for training a large teacher model. SRDL improves the model generalisation performance compared to vanilla learning and conventional knowledge distillation approaches with negligible extra computational cost. Extensive evaluations show that a variety of deep networks benefit from SRDL resulting in enhanced deployment performance on both coarse-grained object categorisation tasks (CIFAR10, CIFAR100, Tiny ImageNet, and ImageNet) and fine-grained person instance identification tasks (Market-1501).

**《Multimodal Densenet》**

arXiv：https://arxiv.org/abs/1811.07407

> Humans make accurate decisions by interpreting complex data from multiple sources. Medical diagnostics, in particular, often hinge on human interpretation of multi-modal information. In order for artificial intelligence to make progress in automated, objective, and accurate diagnosis and prognosis, methods to fuse information from multiple medical imaging modalities are required. However, combining information from multiple data sources has several challenges, as current deep learning architectures lack the ability to extract useful representations from multimodal information, and often simple concatenation is used to fuse such information. In this work, we propose Multimodal DenseNet, a novel architecture for fusing multimodal data. Instead of focusing on concatenation or early and late fusion, our proposed architectures fuses information over several layers and gives the model flexibility in how it combines information from multiple sources. We apply this architecture to the challenge of polyp characterization and landmark identification in endoscopy. Features from white light images are fused with features from narrow band imaging or depth maps. This study demonstrates that Multimodal DenseNet outperforms monomodal classification as well as other multimodal fusion techniques by a significant margin on two different datasets.

**《RePr: Improved Training of Convolutional Filters》**

arXiv：https://arxiv.org/abs/1811.07275

> A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by the network's filters. Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory requirements at run-time. We attempt to address this problem from another angle - not by changing the network structure but by altering the training method. We show that by temporarily pruning and then restoring a subset of the model's filters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved generalization. We show that the existing model-pruning criteria are not optimal for selecting filters to prune in this context and introduce inter-filter orthogonality as the ranking criteria to determine under-expressive filters. Our method is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the performance across a variety of tasks, especially when applied to smaller networks.

**《PydMobileNet: Improved Version of MobileNets with Pyramid Depthwise Separable Convolution》**

arXiv：https://arxiv.org/abs/1811.07083

> Convolutional neural networks (CNNs) have shown remarkable performance in various computer vision tasks in recent years. However, the increasing model size has raised challenges in adopting them in real-time applications as well as mobile and embedded vision applications. Many works try to build networks as small as possible while still have acceptable performance. The state-of-the-art architecture is MobileNets. They use Depthwise Separable Convolution (DWConvolution) in place of standard Convolution to reduce the size of networks. This paper describes an improved version of MobileNet, called Pyramid Mobile Network. Instead of using just a 3×3 kernel size for DWConvolution like in MobileNet, the proposed network uses a pyramid kernel size to capture more spatial information. The proposed architecture is evaluated on two highly competitive object recognition benchmark datasets (CIFAR-10, CIFAR-100). The experiments demonstrate that the proposed network achieves better performance compared with MobileNet as well as other state-of-the-art networks. Additionally, it is more flexible in fine-tuning the trade-off between accuracy, latency and model size than MobileNets.

# Face

**《Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition》**

arXiv：https://arxiv.org/abs/1811.07770

> Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.

# 图像分类

**《High Order Neural Networks for Video Classification》**

arXiv：https://arxiv.org/abs/1811.07519

> Capturing spatiotemporal correlations is an essential topic in video classification. In this paper, we present high order operations as a generic family of building blocks for capturing high order correlations from high dimensional input video space. We prove that several successful architectures for visual classification tasks are in the family of high order neural networks, theoretical and experimental analysis demonstrates their underlying mechanism is high order. We also proposal a new LEarnable hiGh Order (LEGO) block, whose goal is to capture spatiotemporal correlation in a feedforward manner. Specifically, LEGO blocks implicitly learn the relation expressions for spatiotemporal features and use the learned relations to weight input features. This building block can be plugged into many neural network architectures, achieving evident improvement without introducing much overhead. On the task of video classification, even using RGB only without fine-tuning with other video datasets, our high order models can achieve results on par with or better than the existing state-of-the-art methods on both Something-Something (V1 and V2) and Charades datasets.

**《DeepConsensus: using the consensus of features from multiple layers to attain robust image classification》**

arXiv：https://arxiv.org/abs/1811.07266

> We consider a classifier whose test set is exposed to various perturbations that are not present in the training set. These test samples still contain enough features to map them to the same class as their unperturbed counterpart. Current architectures exhibit rapid degradation of accuracy when trained on standard datasets but then used to classify perturbed samples of that data. To address this, we present a novel architecture named DeepConsensus that significantly improves generalization to these test-time perturbations. Our key insight is that deep neural networks should directly consider summaries of low and high level features when making classifications. Existing convolutional neural networks can be augmented with DeepConsensus, leading to improved resistance against large and small perturbations on MNIST, EMNIST, FashionMNIST, CIFAR10 and SVHN datasets.


# 目标检测

**《Weakly Supervised Soft-detection-based Aggregation Method for Image Retrieval》**

arXiv：https://arxiv.org/abs/1811.07619

> In recent year, the compact representations based on activations of Convolutional Neural Network (CNN) achieve remarkable performance in image retrieval. Some interested object only takes up a small part of the whole image. Therefore, it is significant to extract the discriminative representations that contain regional information of pivotal small object. In this paper, we propose a novel weakly supervised soft-detection-based aggregation (SDA) method free from bounding box annotations for image retrieval. In order to highlight the certain discriminative pattern of objects and suppress the noise of background, we employ trainable soft region proposals that indicate the probability of interested object and reflect the significance of candidate regions. 
We conduct comprehensive experiments on standard image retrieval datasets. Our weakly supervised SDA method achieves state-of-the-art performance on most benchmarks. The results demonstrate that the proposed SDA method is effective for image retrieval.

**《Fast Efficient Object Detection Using Selective Attention》**

arXiv：https://arxiv.org/abs/1811.07502

> Deep learning object detectors achieve state-of-the-art accuracy at the expense of high computational overheads, impeding their utilization on embedded systems such as drones. A primary source of these overheads is the exhaustive classification of typically 10^4-10^5 regions per image. Given that most of these regions contain uninformative background, the detector designs seem extremely superfluous and inefficient. In contrast, biological vision systems leverage selective attention for fast and efficient object detection. Recent neuroscientific findings shedding new light on the mechanism behind selective attention allowed us to formulate a new hypothesis of object detection efficiency and subsequently introduce a new object detection paradigm. To that end, we leverage this knowledge to design a novel region proposal network and empirically show that it achieves high object detection performance on the COCO dataset. Moreover, the model uses two to three orders of magnitude fewer computations than state-of-the-art models and consequently achieves inference speeds exceeding 500 frames/s, thereby making it possible to achieve object detection on embedded systems.

**《FotonNet: A HW-Efficient Object Detection System Using 3D-Depth Segmentation and 2D-DNN Classifier》**

arXiv：https://arxiv.org/abs/1811.07493

> Object detection and classification is one of the most important computer vision problems. Ever since the introduction of deep learning \cite{krizhevsky2012imagenet}, we have witnessed a dramatic increase in the accuracy of this object detection problem. However, most of these improvements have occurred using conventional 2D image processing. Recently, low-cost 3D-image sensors, such as the Microsoft Kinect (Time-of-Flight) or the Apple FaceID (Structured-Light), can provide 3D-depth or point cloud data that can be added to a convolutional neural network, acting as an extra set of dimensions. In our proposed approach, we introduce a new 2D + 3D system that takes the 3D-data to determine the object region followed by any conventional 2D-DNN, such as AlexNet. In this method, our approach can easily dissociate the information collection from the Point Cloud and 2D-Image data and combine both operations later. Hence, our system can use any existing trained 2D network on a large image dataset, and does not require a large 3D-depth dataset for new training. Experimental object detection results across 30 images show an accuracy of 0.67, versus 0.54 and 0.51 for RCNN and YOLO, respectively.

**《R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy》**

arXiv：https://arxiv.org/abs/1811.07126

> Object detection plays a vital role in natural scene and aerial scene and is full of challenges. Although many advanced algorithms have succeeded in the natural scene, the progress in the aerial scene has been slow due to the complexity of the aerial image and the large degree of freedom of remote sensing objects in scale, orientation, and density. In this paper, a novel multi-category rotation detector is proposed, which can efficiently detect small objects, arbitrary direction objects, and dense objects in complex remote sensing images. Specifically, the proposed model adopts a targeted feature fusion strategy called inception fusion network, which fully considers factors such as feature fusion, anchor sampling, and receptive field to improve the ability to handle small objects. Then we combine the pixel attention network and the channel attention network to weaken the noise information and highlight the objects feature. Finally, the rotational object detection algorithm is realized by redefining the rotating bounding box. Experiments on public datasets including DOTA, NWPU VHR-10 demonstrate that the proposed algorithm significantly outperforms state-of-the-art methods. The code and models will be available at https://github.com/DetectionTeamUCAS/R2CNN-Plus-Plus_Tensorflow.


# Saliency Detection

**《Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection》**

arXiv：https://arxiv.org/abs/1811.07480

> The existing still-static deep learning based saliency researches do not consider the weighting and highlighting of extracted features from different layers, all features contribute equally to the final saliency decision-making. Such methods always evenly detect all "potentially significant regions" and unable to highlight the key salient object, resulting in detection failure of dynamic scenes. In this paper, based on the fact that salient areas in videos are relatively small and concentrated, we propose a \textbf{key salient object re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up feature guidance} to improve detection accuracy in video scenes. KSORA includes two sub-modules (WFE and KOS): WFE processes local salient feature selection using bottom-up strategy, while KOS ranks each object in global fashion by top-down statistical knowledge, and chooses the most critical object area for local enhancement. The proposed KSORA can not only strengthen the saliency value of the local key salient object but also ensure global saliency consistency. Results on three benchmark datasets suggest that our model has the capability of improving the detection accuracy on complex scenes. The significant performance of KSORA, with a speed of 17FPS on modern GPUs, has been verified by comparisons with other ten state-of-the-art algorithms.

# 场景文本检测

**《Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks》**

arXiv：https://arxiv.org/abs/1811.07432

> Recently, semantic segmentation and general object detection frameworks have been widely adopted by scene text detecting tasks. However, both of them alone have obvious shortcomings in practice. In this paper, we propose a novel end-to-end trainable deep neural network framework, named Pixel-Anchor, which combines semantic segmentation and SSD in one network by feature sharing and anchor-level attention mechanism to detect oriented scene text. To deal with scene text which has large variances in size and aspect ratio, we combine FPN and ASPP operation as our encoder-decoder structure in the semantic segmentation part, and propose a novel Adaptive Predictor Layer in the SSD. Pixel-Anchor detects scene text in a single network forward pass, no complex post-processing other than an efficient fusion Non-Maximum Suppression is involved. We have benchmarked the proposed Pixel-Anchor on the public datasets. Pixel-Anchor outperforms the competing methods in terms of text localization accuracy and run speed, more specifically, on the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.8768 at 10 FPS for 960 x 1728 resolution images.

**《Improving Rotated Text Detection with Rotation Region Proposal Networks》**

arXiv：https://arxiv.org/abs/1811.07031

> A significant number of images shared on social media platforms such as Facebook and Instagram contain text in various forms. It's increasingly becoming commonplace for bad actors to share misinformation, hate speech or other kinds of harmful content as text overlaid on images on such platforms. A scene-text understanding system should hence be able to handle text in various orientations that the adversary might use. Moreover, such a system can be incorporated into screen readers used to aid the visually impaired. In this work, we extend the scene-text extraction system at Facebook, Rosetta, to efficiently handle text in various orientations. Specifically, we incorporate the Rotation Region Proposal Networks (RRPN) in our text extraction pipeline and offer practical suggestions for building and deploying a model for detecting and recognizing text in arbitrary orientations efficiently. Experimental results show a significant improvement on detecting rotated text.


# 图像分割

**《OrthoSeg: A Deep Multimodal Convolutional Neural Network for Semantic Segmentation of Orthoimagery》**

arXiv：https://arxiv.org/abs/1811.07859

> This paper addresses the task of semantic segmentation of orthoimagery using multimodal data e.g. optical RGB, infrared and digital surface model. We propose a deep convolutional neural network architecture termed OrthoSeg for semantic segmentation using multimodal, orthorectified and coregistered data. We also propose a training procedure for supervised training of OrthoSeg. The training procedure complements the inherent architectural characteristics of OrthoSeg for preventing complex co-adaptations of learned features, which may arise due to probable high dimensionality and spatial correlation in multimodal and/or multispectral coregistered data. OrthoSeg consists of parallel encoding networks for independent encoding of multimodal feature maps and a decoder designed for efficiently fusing independently encoded multimodal feature maps. A softmax layer at the end of the network uses the features generated by the decoder for pixel-wise classification. The decoder fuses feature maps from the parallel encoders locally as well as contextually at multiple scales to generate per-pixel feature maps for final pixel-wise classification resulting in segmented output. We experimentally show the merits of OrthoSeg by demonstrating state-of-the-art accuracy on the ISPRS Potsdam 2D Semantic Segmentation dataset. Adaptability is one of the key motivations behind OrthoSeg so that it serves as a useful architectural option for a wide range of problems involving the task of semantic segmentation of coregistered multimodal and/or multispectral imagery. Hence, OrthoSeg is designed to enable independent scaling of parallel encoder networks and decoder network to better match application requirements, such as the number of input channels, the effective field-of-view, and model capacity.

**《M2U-Net: Effective and Efficient Retinal Vessel Segmentation for Resource-Constrained Environments》**

arXiv：https://arxiv.org/abs/1811.07738

> In this paper, we present a novel neural network architecture for retinal vessel segmentation that improves over the state of the art on two benchmark datasets, is the first to run in real time on high resolution images, and its small memory and processing requirements make it deployable in mobile and embedded systems. 
The M2U-Net has a new encoder-decoder architecture that is inspired by the U-Net. It adds pretrained components of MobileNetV2 in the encoder part and novel contractive bottleneck blocks in the decoder part that, combined with bilinear upsampling, drastically reduce the parameter count to 0.55M compared to 31.03M in the original U-Net. 
We have evaluated its performance against a wide body of previously published results on three public datasets. On two of them, the M2U-Net achieves new state-of-the-art performance by a considerable margin. When implemented on a GPU, our method is the first to achieve real-time inference speeds on high-resolution fundus images. We also implemented our proposed network on an ARM-based embedded system where it segments images in between 0.6 and 15 sec, depending on the resolution. Thus, the M2U-Net enables a number of applications of retinal vessel structure extraction, such as early diagnosis of eye diseases, retinal biometric authentication systems, and robot assisted microsurgery.

# 目标跟踪

**《Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling》**

arXiv：https://arxiv.org/abs/1811.07498

> It remains a huge challenge to design effective and efficient trackers under complex scenarios, including occlusions, illumination changes and pose variations. To cope with this problem, a promising solution is to integrate the temporal consistency across consecutive frames and multiple feature cues in a unified model. Motivated by this idea, we propose a novel correlation filter-based tracker in this work, in which the temporal relatedness is reconciled under a multi-task learning framework and the multiple feature cues are modeled using a multi-view learning approach. We demonstrate the resulting regression model can be efficiently learned by exploiting the structure of blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm is developed thereafter for efficient online tracking. Meanwhile, we incorporate an adaptive scale estimation mechanism to strengthen the stability of scale variation tracking. We implement our tracker using two types of features and test it on two benchmark datasets. Experimental results demonstrate the superiority of our proposed approach when compared with other state-of-the-art trackers. project homepage：http://bmal.hust.edu.cn/project/KMF2JMTtracking.html

**《Deep Siamese Networks with Bayesian non-Parametrics for Video Object Tracking》**

arXiv：https://arxiv.org/abs/1811.07386

> We present a novel algorithm utilizing a deep Siamese neural network as a general object similarity function in combination with a Bayesian optimization (BO) framework to encode spatio-temporal information for efficient object tracking in video. In particular, we treat the video tracking problem as a dynamic (i.e. temporally-evolving) optimization problem. Using Gaussian Process priors, we model a dynamic objective function representing the location of a tracked object in each frame. By exploiting temporal correlations, the proposed method queries the search space in a statistically principled and efficient way, offering several benefits over current state of the art video tracking methods.

**《Exploit the Connectivity: Multi-Object Tracking with TrackletNet》**

arXiv：https://arxiv.org/abs/1811.07258

> Multi-object tracking (MOT) is an important and practical task related to both surveillance systems and moving camera applications, such as autonomous driving and robotic vision. However, due to unreliable detection, occlusion and fast camera motion, tracked targets can be easily lost, which makes MOT very challenging. Most recent works treat tracking as a re-identification (Re-ID) task, but how to combine appearance and temporal features is still not well addressed. In this paper, we propose an innovative and effective tracking method called TrackletNet Tracker (TNT) that combines temporal and appearance information together as a unified framework. First, we define a graph model which treats each tracklet as a vertex. The tracklets are generated by appearance similarity with CNN features and intersection-over-union (IOU) with epipolar constraints to compensate camera movement between adjacent frames. Then, for every pair of two tracklets, the similarity is measured by our designed multi-scale TrackletNet. Afterwards, the tracklets are clustered into groups which represent individual object IDs. Our proposed TNT has the ability to handle most of the challenges in MOT, and achieve promising results on MOT16 and MOT17 benchmark datasets compared with other state-of-the-art methods.


# GAN

**《Injecting and removing malignant features in mammography with CycleGAN: Investigation of an automated adversarial attack using neural networks》**

arXiv：https://arxiv.org/abs/1811.07767

> Purpose To train a cycle-consistent generative adversarial network (CycleGAN) on mammographic data to inject or remove features of malignancy, and to determine whether these AI-mediated attacks can be detected by radiologists. Material and Methods From the two publicly available datasets, BCDR and INbreast, we selected images from cancer patients and healthy controls. An internal dataset served as test data, withheld during training. We ran two experiments training CycleGAN on low and higher resolution images (256×256 px and 512×408 px). Three radiologists read the images and rated the likelihood of malignancy on a scale from 1-5 and the likelihood of the image being manipulated. The readout was evaluated by ROC analysis (Area under the ROC curve = AUC). Results At the lower resolution, only one radiologist exhibited markedly lower detection of cancer (AUC=0.85 vs 0.63, p=0.06), while the other two were unaffected (0.67 vs. 0.69 and 0.75 vs. 0.77, p=0.55). Only one radiologist could discriminate between original and modified images slightly better than guessing/chance (0.66, p=0.008). At the higher resolution, all radiologists showed significantly lower detection rate of cancer in the modified images (0.77-0.84 vs. 0.59-0.69, p=0.008), however, they were now able to reliably detect modified images due to better visibility of artifacts (0.92, 0.92 and 0.97). Conclusion A CycleGAN can implicitly learn malignant features and inject or remove them so that a substantial proportion of small mammographic images would consequently be misdiagnosed. At higher resolutions, however, the method is currently limited and has a clear trade-off between manipulation of images and introduction of artifacts.

**《SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint》**

arXiv：https://arxiv.org/abs/1811.07630

> We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together in an adversarial architecture with additional cycle consistency losses. To train, SEIGAN needs only bounding box supervision and does not require pairing or ground truth masks. SEIGAN produces better generated images (evaluated by human assessors) than other approaches and produces high-quality segmentation masks, improving over other adversarially trained approaches and getting closer to the results of fully supervised training.

**《GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint》**

arXiv：https://arxiv.org/abs/1811.07296

> We know SGAN may have a risk of gradient vanishing. A significant improvement is WGAN, with the help of 1-Lipschitz constraint on discriminator to prevent from gradient vanishing. Is there any GAN having no gradient vanishing and no 1-Lipschitz constraint on discriminator? We do find one, called GAN-QP. 
To construct a new framework of Generative Adversarial Network (GAN) usually includes three steps: 1. choose a probability divergence; 2. convert it into a dual form; 3. play a min-max game. In this articles, we demonstrate that the first step is not necessary. We can analyse the property of divergence and even construct new divergence in dual space directly. As a reward, we obtain a simpler alternative of WGAN: GAN-QP. We demonstrate that GAN-QP have a better performance than WGAN in theory and practice.

# 3D

**《Modeling Local Geometric Structure of 3D Point Clouds using Geo-CNN》**

arXiv：https://arxiv.org/abs/1811.07782

> Recent advances in deep convolutional neural networks (CNNs) have motivated researchers to adapt CNNs to directly model points in 3D point clouds. Modeling local structure has been proven to be important for the success of convolutional architectures, and researchers exploited the modeling of local point sets in the feature extraction hierarchy. However, limited attention has been paid to explicitly model the geometric structure amongst points in a local region. To address this problem, we propose Geo-CNN, which applies a generic convolution-like operation dubbed as GeoConv to each point and its local neighborhood. Local geometric relationships among points are captured when extracting edge features between the center and its neighboring points. We first decompose the edge feature extraction process onto three orthogonal bases, and then aggregate the extracted features based on the angles between the edge vector and the bases. This encourages the network to preserve the geometric structure in Euclidean space throughout the feature extraction hierarchy. GeoConv is a generic and efficient operation that can be easily integrated into 3D point cloud analysis pipelines for multiple applications. We evaluate Geo-CNN on ModelNet40 and KITTI and achieve state-of-the-art performance.

**《PointConv: Deep Convolutional Networks on 3D Point Clouds》**

arXiv：https://arxiv.org/abs/1811.07246

> Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and the density functions through kernel density estimation. A novel reformulation is proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.

**《Topology-Aware Non-Rigid Point Cloud Registration》**

arXiv：https://arxiv.org/abs/1811.07014

> In this paper, we introduce a non-rigid registration pipeline for pairs of unorganized point clouds that may be topologically different. Standard warp field estimation algorithms, even under robust, discontinuity-preserving regularization, tend to produce erratic motion estimates on boundaries associated with "close-to-open" topology changes. We overcome this limitation by exploiting backward motion: in the opposite motion direction, a "close-to-open' event becomes "open-to-close", which is by default handled correctly. At the core of our approach lies a general, topology-agnostic warp field estimation algorithm, similar to those employed in recently introduced dynamic reconstruction systems from RGB-D input. We improve motion estimation on boundaries associated with topology changes in an efficient post-processing phase. Based on both forward and (inverted) backward warp hypotheses, we explicitly detect regions of the deformed geometry that undergo topological changes by means of local deformation criteria and broadly classify them as "contacts" or `separations'. Subsequently, the two motion hypotheses are seamlessly blended on a local basis, according to the type and proximity of detected events. Our method achieves state-of-the-art motion estimation accuracy on the MPI Sintel dataset. Experiments on a custom dataset with topological event annotations demonstrate the effectiveness of our pipeline in estimating motion on event boundaries, as well as promising performance in explicit topological event detection.


# Re-ID

**《Past, Present, and Future Approaches Using Computer Vision for Animal Re-Identification from Camera Trap Data》**

arXiv：https://arxiv.org/abs/1811.07749

> The ability of a researcher to re-identify (re-ID) an individual animal upon re-encounter is fundamental for addressing a broad range of questions in the study of ecosystem function, community and population dynamics, and behavioural ecology. In this review, we describe a brief history of camera traps for re-ID, present a collection of computer vision feature engineering methodologies previously used for animal re-ID, provide an introduction to the underlying mechanisms of deep learning relevant to animal re-ID, highlight the success of deep learning methods for human re-ID, describe the few ecological studies currently utilizing deep learning for camera trap analyses, and our predictions for near future methodologies based on the rapid development of deep learning methods. By utilizing novel deep learning methods for object detection and similarity comparisons, ecologists can extract animals from an image/video data and train deep learning classifiers to re-ID animal individuals beyond the capabilities of a human observer. This methodology will allow ecologists with camera/video trap data to re-identify individuals that exit and re-enter the camera frame. Our expectation is that this is just the beginning of a major trend that could stand to revolutionize the analysis of camera trap data and, ultimately, our approach to animal ecology.

**《CA3Net: Contextual-Attentional Attribute-Appearance Network for Person Re-Identification》**

arXiv：https://arxiv.org/abs/1811.07544

> Person re-identification aims to identify the same pedestrian across non-overlapping camera views. Deep learning techniques have been applied for person re-identification recently, towards learning representation of pedestrian appearance. This paper presents a novel Contextual-Attentional Attribute-Appearance Network (CA3Net) for person re-identification. The CA3Net simultaneously exploits the complementarity between semantic attributes and visual appearance, the semantic context among attributes, visual attention on attributes as well as spatial dependencies among body parts, leading to discriminative and robust pedestrian representation. Specifically, an attribute network within CA3Net is designed with an Attention-LSTM module. It concentrates the network on latent image regions related to each attribute as well as exploits the semantic context among attributes by a LSTM module. An appearance network is developed to learn appearance features from the full body, horizontal and vertical body parts of pedestrians with spatial dependencies among body parts. The CA3Net jointly learns the attribute and appearance features in a multi-task learning manner, generating comprehensive representation of pedestrians. Extensive experiments on two challenging benchmarks, i.e., Market-1501 and DukeMTMC-reID datasets, have demonstrated the effectiveness of the proposed approach.

**《Re-Identification with Consistent Attentive Siamese Networks》**

arXiv：https://arxiv.org/abs/1811.07487

> We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets, and establish a new state of the art, with our proposed method resulting in mAP performance improvements of 6.4%, 4.2%, and 1.2% respectively.


# SLAM

**《Collaborative Dense SLAM》**

arXiv：https://arxiv.org/abs/1811.07632

> In this paper, we present a new system for live collaborative dense surface reconstruction. Cooperative robotics, multi participant augmented reality and human-robot interaction are all examples of situations where collaborative mapping can be leveraged for greater agent autonomy. Our system builds on ElasticFusion to allow a number of cameras starting with unknown initial relative positions to maintain local maps utilising the original algorithm. Carrying out visual place recognition across these local maps the system can identify when two maps overlap in space, providing an inter-map constraint from which the system can derive the relative poses of the two maps. Using these resulting pose constraints, our system performs map merging, allowing multiple cameras to fuse their measurements into a single shared reconstruction. The advantage of this approach is that it avoids replication of structures subsequent to loop closures, where multiple cameras traverse the same regions of the environment. Furthermore, it allows cameras to directly exploit and update regions of the environment previously mapped by other cameras within the system. We provide both quantitative and qualitative analyses using the synthetic ICL-NUIM dataset and the real-world Freiburg dataset including the impact of multi-camera mapping on surface reconstruction accuracy, camera pose estimation accuracy and overall processing time. We also include qualitative results in the form of sample reconstructions of room sized environments with up to 3 cameras undergoing intersecting and loopy trajectories.

# 迁移学习

**《An Efficient Transfer Learning Technique by Using Final Fully-Connected Layer Output Features of Deep Networks》**

arXiv：https://arxiv.org/abs/1811.07459

> In this paper, we propose a computationally efficient transfer learning approach using the output vector of final fully-connected layer of deep convolutional neural networks for classification. Our proposed technique uses a single layer perceptron classifier designed with hyper-parameters to focus on improving computational efficiency without adversely affecting the performance of classification compared to the baseline technique. Our investigations show that our technique converges much faster than baseline yielding very competitive classification results. We execute thorough experiments to understand the impact of similarity between pre-trained and new classes, similarity among new classes, number of training samples in the performance of classification using transfer learning of the final fully-connected layer's output features.

**《Transfer Learning with Deep CNNs for Gender Recognition and Age Estimation》**

arXiv：https://arxiv.org/abs/1811.07344

> In this project, competition-winning deep neural networks with pretrained weights are used for image-based gender recognition and age estimation. Transfer learning is explored using both VGG19 and VGGFace pretrained models by testing the effects of changes in various design schemes and training parameters in order to improve prediction accuracy. Training techniques such as input standardization, data augmentation, and label distribution age encoding are compared. Finally, a hierarchy of deep CNNs is tested that first classifies subjects by gender, and then uses separate male and female age models to predict age. A gender recognition accuracy of 98.7% and an MAE of 4.1 years is achieved. This paper shows that, with proper training techniques, good results can be obtained by retasking existing convolutional filters towards a new purpose.

# 风格迁移

**《GLStyleNet: Higher Quality Style Transfer Combining Global and Local Pyramid Features》**

arXiv：https://arxiv.org/abs/1811.07260

> Recent studies using deep neural networks have shown remarkable success in style transfer especially for artistic and photo-realistic images. However, the approaches using global feature correlations fail to capture small, intricate textures and maintain correct texture scales of the artworks, and the approaches based on local patches are defective on global effect. In this paper, we present a novel feature pyramid fusion neural network, dubbed GLStyleNet, which sufficiently takes into consideration multi-scale and multi-level pyramid features by best aggregating layers across a VGG network, and performs style transfer hierarchically with multiple losses of different scales. Our proposed method retains high-frequency pixel information and low frequency construct information of images from two aspects: loss function constraint and feature fusion. Our approach is not only flexible to adjust the trade-off between content and style, but also controllable between global and local. Compared to state-of-the-art methods, our method can transfer not just large-scale, obvious style cues but also subtle, exquisite ones, and dramatically improves the quality of style transfer. We demonstrate the effectiveness of our approach on portrait style transfer, artistic style transfer, photo-realistic style transfer and Chinese ancient painting style transfer tasks. Experimental results indicate that our unified approach improves image style transfer quality over previous state-of-the-art methods, while also accelerating the whole process in a certain extent. Our code is available at https://github.com/EndyWon/GLStyleNet.


# Image Caption

**《Intention Oriented Image Captions with Guiding Objects》**

arXiv：https://arxiv.org/abs/1811.07662

> Although existing image caption models can produce promising results using recurrent neural networks (RNNs), it is difficult to guarantee that an object we care about is contained in generated descriptions, for example in the case that the object is inconspicuous in image. Problems become even harder when these objects did not appear in training stage. In this paper, we propose a novel approach for generating image captions with guiding objects (CGO). The CGO constrains the model to involve a human-concerned object, when the object is in the image, in the generated description while maintaining fluency. Instead of generating the sequence from left to right, we start description with a selected object and generate other parts of the sequence based on this object. To achieve this, we design a novel framework combining two LSTMs in opposite directions. We demonstrate the characteristics of our method on MSCOCO to generate descriptions for each detected object in images. With CGO, we can extend the ability of description to the objects being neglected in image caption labels and provide a set of more comprehensive and diverse descriptions for an image. CGO shows obvious advantages when applied to the task of describing novel objects. We show experiment results on both MSCOCO and ImageNet datasets. Evaluations show that our method outperforms the state-of-the-art models in the task with average F1 75.8, leading to better descriptions in terms of both content accuracy and fluency.

# Few-Shot Learning

**《Deep Comparison: Relation Columns for Few-Shot Learning》**

arXiv：https://arxiv.org/abs/1811.07100

> Few-shot deep learning is a topical challenge area for scaling visual recognition to open-ended growth in the space of categories to recognise. A promising line work towards realising this vision is deep networks that learn to match queries with stored training images. However, methods in this paradigm usually train a deep embedding followed by a single linear classifier. Our insight is that effective general-purpose matching requires discrimination with regards to features at multiple abstraction levels. We therefore propose a new framework termed Deep Comparison Network(DCN) that decomposes embedding learning into a sequence of modules, and pairs each with a relation module. The relation modules compute a non-linear metric to score the match using the corresponding embedding module's representation. To ensure that all embedding module's features are used, the relation modules are deeply supervised. Finally generalisation is further improved by a learned noise regulariser. The resulting network achieves state of the art performance on both miniImageNet and tieredImageNet, while retaining the appealing simplicity and efficiency of deep metric learning approaches.


# 数据集

《iQIYI-VID: A Large Dataset for Multi-modal Person Identification》

arXiv：https://arxiv.org/abs/1811.07548

> Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for task of person identification in the wild. We further demonstrate that a simple fusion of multi-modal features can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.


# Other

**《Addressing the Invisible: Street Address Generation for Developing Countries with Deep Learning》**

NIPS 2018 Workshop

arXiv：https://arxiv.org/abs/1811.07769

> More than half of the world's roads lack adequate street addressing systems. Lack of addresses is even more visible in daily lives of people in developing countries. We would like to object to the assumption that having an address is a luxury, by proposing a generative address design that maps the world in accordance with streets. The addressing scheme is designed considering several traditional street addressing methodologies employed in the urban development scenarios around the world. Our algorithm applies deep learning to extract roads from satellite images, converts the road pixel confidences into a road network, partitions the road network to find neighborhoods, and labels the regions, roads, and address units using graph- and proximity-based algorithms. We present our results on a sample US city, and several developing cities, compare travel times of users using current ad hoc and new complete addresses, and contrast our addressing solution to current industrial and open geocoding alternatives.

**《Handwriting Recognition of Historical Documents with few labeled data》**

arXiv：https://arxiv.org/abs/1811.07768

> Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated textlines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multiscale data. We also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition.

**《GroundNet: Segmentation-Aware Monocular Ground Plane Estimation with Geometric Consistency》**

arXiv：https://arxiv.org/abs/1811.07222

> We focus on the problem of estimating the orientation of the ground plane with respect to a mobile monocular camera platform (e.g., ground robot, wearable camera, assistive robotic platform). To address this problem, we formulate the ground plane estimation problem as an inter-mingled multi-task prediction problem by jointly optimizing for point-wise surface normal direction, 2D ground segmentation, and depth estimates. Our proposed model -- GroundNet -- estimates the ground normal in two streams separately and then a consistency loss is applied on top of the two streams to enforce geometric consistency. A semantic segmentation stream is used to isolate the ground regions and are used to selectively back-propagate parameter updates only through the ground regions in the image. Our experiments on KITTI and ApolloScape datasets verify that the GroundNet is able to predict consistent depth and normal within the ground region. It also achieves top performance on ground plane normal estimation and horizon line detection.

**《Image-to-GPS Verification Through A Bottom-Up Pattern Matching Network》**

arXiv：https://arxiv.org/abs/1811.07288

> The image-to-GPS verification problem asks whether a given image is taken at a claimed GPS location. In this paper, we treat it as an image verification problem -- whether a query image is taken at the same place as a reference image retrieved at the claimed GPS location. We make three major contributions: 1) we propose a novel custom bottom-up pattern matching (BUPM) deep neural network solution; 2) we demonstrate that the verification can be directly done by cross-checking a perspective-looking query image and a panorama reference image, and 3) we collect and clean a dataset of 30K pairs query and reference. Our experimental results show that the proposed BUPM solution outperforms the state-of-the-art solutions in terms of both verification and localization.

**《Matching RGB Images to CAD Models for Object Pose Estimation》**

arXiv：https://arxiv.org/abs/1811.07249

> We propose a novel method for 3D object pose estimation in RGB images, which does not require pose annotations of objects in images in the training stage. We tackle the pose estimation problem by learning how to establish correspondences between RGB images and rendered depth images of CAD models. During training, our approach only requires textureless CAD models and aligned RGB-D frames of a subset of object instances, without explicitly requiring pose annotations for the RGB images. We employ a deep quadruplet convolutional neural network for joint learning of suitable keypoints and their associated descriptors in pairs of rendered depth images which can be matched across modalities with aligned RGB-D views. During testing, keypoints are extracted from a query RGB image and matched to keypoints extracted from rendered depth images, followed by establishing 2D-3D correspondences. The object's pose is then estimated using the RANSAC and PnP algorithms. We conduct experiments on the recently introduced Pix3D dataset and demonstrate the efficacy of our proposed approach in object pose estimation as well as generalization to object instances not seen during training.

**《Optical Flow Dataset and Benchmark for Visual Crowd Analysis》**

arXiv：https://arxiv.org/abs/1811.07170

> The performance of optical flow algorithms greatly depends on the specifics of the content and the application for which it is used. Existing and well established optical flow datasets are limited to rather particular contents from which none is close to crowd behavior analysis; whereas such applications heavily utilize optical flow. We introduce a new optical flow dataset exploiting the possibilities of a recent video engine to generate sequences with ground-truth optical flow for large crowds in different scenarios. We break with the development of the last decade of introducing ever increasing displacements to pose new difficulties. Instead we focus on real-world surveillance scenarios where numerous small, partly independent, non rigidly moving objects observed over a long temporal range pose a challenge. By evaluating different optical flow algorithms, we find that results of established datasets can not be transferred to these new challenges. In exhaustive experiments we are able to provide new insight into optical flow for crowd analysis. Finally, the results have been validated on the real-world UCF crowd tracking benchmark while achieving competitive results compared to more sophisticated state-of-the-art crowd tracking approaches.

**《Simulating LIDAR Point Cloud for Autonomous Driving using Real-world Scenes and Traffic Flows》**

arXiv：https://arxiv.org/abs/1811.07112

> We present a LIDAR simulation framework that can automatically generate 3D point cloud based on LIDAR type and placement. The point cloud, annotated with ground truth semantic labels, is to be used as training data to improve environmental perception capabilities for autonomous driving vehicles. Different from previous simulators, we generate the point cloud based on real environment and real traffic flow. More specifically we employ a mobile LIDAR scanner with cameras to capture real world scenes. The input to our simulation framework includes dense 3D point cloud and registered color images. Moving objects (such as cars, pedestrians, bicyclists) are automatically identified and recorded. These objects are then removed from the input point cloud to restore a static background (e.g., environment without movable objects). With that we can insert synthetic models of various obstacles, such as vehicles and pedestrians in the static background to create various traffic scenes. A novel LIDAR renderer takes the composite scene to generate new realistic LIDAR points that are already annotated at point level for synthetic objects. Experimental results show that our system is able to close the performance gap between simulation and real data to be 1 ~ 6% in different applications, and for model fine tuning, only 10% ~ 20% extra real data could help to outperform the original model trained with full real dataset.

**《DSCnet: Replicating Lidar Point Clouds with Deep Sensor Cloning》**

arXiv：https://arxiv.org/abs/1811.07070

> Convolutional neural networks (CNNs) have become increasingly popular for solving a variety of computer vision tasks, ranging from image classification to image segmentation. Recently, autonomous vehicles have created a demand for depth information, which is often obtained using hardware sensors such as Light detection and ranging (LIDAR). Although it can provide precise distance measurements, most LIDARs are still far too expensive to sell in mass-produced consumer vehicles, which has motivated methods to generate depth information from commodity automotive sensors like cameras. 
In this paper, we propose an approach called Deep Sensor Cloning (DSC). The idea is to use Convolutional Neural Networks in conjunction with inexpensive sensors to replicate the 3D point-clouds that are created by expensive LIDARs. To accomplish this, we develop a new dataset (DSDepth) and a new family of CNN architectures (DSCnets). While previous tasks such as KITTI depth prediction use an interpolated RGB-D images as ground-truth for training, we instead use DSCnets to directly predict LIDAR point-clouds. When we compare the output of our models to a $75,000 LIDAR, we find that our most accurate DSCnet achieves a relative error of 5.77% using a single camera and 4.69% using stereo cameras.

================================================
FILE: 2018/12/10.md
================================================
**【计算机视觉论文速递】2018-12-10**

本文分享共计12篇论文，涉及图像分类、目标检测、图像分割、GAN和三维重建等方向。

[TOC]

# Image Classification

**《Variational Saccading: Efficient Inference for Large Resolution Images》**

NIPS 2018 Bayesian Deep Learning Workshop

arXiv：https://arxiv.org/abs/1812.03170

> Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224x244 in Resnet models. This limitation excludes the 4000x3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sample from the high-resolution original input distribution, we propose using a smaller proxy distribution to learn the co-ordinates that correspond to regions of interest in the high-dimensional space. We introduce a new principled variational lower bound that captures the relationship of the proxy distribution's posterior and the original image's co-ordinate space in a way that maximizes the conditional classification likelihood. We empirically demonstrate on one synthetic benchmark and one real world large resolution DSLR camera image dataset that our method produces comparable results with 10x faster inference and lower memory consumption than a model that utilizes the entire original input distribution.

**《LNEMLC: Label Network Embeddings for Multi-Label Classifiation》**

arXiv：https://arxiv.org/abs/1812.02956

> Multi-label classification aims to classify instances with discrete non-exclusive labels. Most approaches on multi-label classification focus on effective adaptation or transformation of existing binary and multi-class learning approaches but fail in modelling the joint probability of labels or do not preserve generalization abilities for unseen label combinations. To address these issues we propose a new multi-label classification scheme, LNEMLC - Label Network Embedding for Multi-Label Classification, that embeds the label network and uses it to extend input space in learning and inference of any base multi-label classifier. The approach allows capturing of labels' joint probability at low computational complexity providing results comparable to the best methods reported in the literature. We demonstrate how the method reveals statistically significant improvements over the simple kNN baseline classifier. We also provide hints for selecting the robust configuration that works satisfactorily across data domains.


# Object Detection

**《ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape》**

arXiv：https://arxiv.org/abs/1812.02781

> We present a deep learning method for end-to-end monocular 3D object detection and metric shape retrieval. We propose a novel loss formulation by lifting 2D detection, orientation, and scale estimation into 3D space. Instead of optimizing these quantities separately, the 3D instantiation allows to properly measure the metric misalignment of boxes. We experimentally show that our 10D lifting of sparse 2D Regions of Interests (RoIs) achieves great results both for 6D pose and recovery of the textured metric geometry of instances. This further enables 3D synthetic data augmentation via inpainting recovered meshes directly onto the 2D scenes. We evaluate on KITTI3D against other strong monocular methods and demonstrate that our approach doubles the AP on the 3D pose metrics on the official test set, defining the new state of the art.


# Image Segmentation

《Scale-aware multi-level guidance for interactive instance segmentation》

arXiv：https://arxiv.org/abs/1812.02967

> In interactive instance segmentation, users give feedback to iteratively refine segmentation masks. The user-provided clicks are transformed into guidance maps which provide the network with necessary cues on the whereabouts of the object of interest. Guidance maps used in current systems are purely distance-based and are either too localized or non-informative. We propose a novel transformation of user clicks to generate scale-aware guidance maps that leverage the hierarchical structural information present in an image. Using our guidance maps, even the most basic FCNs are able to outperform existing approaches that require state-of-the-art segmentation networks pre-trained on large scale segmentation datasets. We demonstrate the effectiveness of our proposed transformation strategy through comprehensive experimentation in which we significantly raise state-of-the-art on four standard interactive segmentation benchmarks.

**《A High-Order Scheme for Image Segmentation via a modified Level-Set method》**

arXiv：https://arxiv.org/abs/1812.03026

> The method is based on an adaptive "filtered" scheme recently introduced by the authors. The main feature of the scheme is the possibility to stabilize an a priori unstable high-order scheme via a filter function which allows to combine a high-order scheme in the regularity regions and a monotone scheme elsewhere, in presence of singularities. The filtered scheme considered in this paper uses the local Lax-Friedrichs scheme as monotone scheme and the Lax-Wendroff scheme as high-order scheme but other couplings are possible. Moreover, we introduce also a modified velocity function for the level-set model used in segmentation, this velocity allows to obtain more accurate results with respect to other velocities proposed in the literature. Some numerical tests on synthetic and real images confirm the accuracy of the proposed method and the advantages given by the new velocity.

# GAN

**《Color Constancy by GANs: An Experimental Survey》**

arXiv：https://arxiv.org/abs/1812.03085

> In this paper, we formulate the color constancy task as an image-to-image translation problem using GANs. By conducting a large set of experiments on different datasets, an experimental survey is provided on the use of different types of GANs to solve for color constancy i.e. CC-GANs (Color Constancy GANs). Based on the experimental review, recommendations are given for the design of CC-GAN architectures based on different criteria, circumstances and datasets.

**《StoryGAN: A Sequential Conditional GAN for Story Visualization》**

arXiv：https://arxiv.org/abs/1812.02784

> In this work we propose a new task called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. Therefore, we propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, respectively, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.


# 3D Reconstrcution

**《Real-time Indoor Scene Reconstruction with RGBD and Inertia Input》**

arXiv：https://arxiv.org/abs/1812.03015

> Camera motion estimation is a key technique for 3D scene reconstruction and Simultaneous localization and mapping (SLAM). To make it be feasibly achieved, previous works usually assume slow camera motions, which limits its usage in many real cases. We propose an end-to-end 3D reconstruction system which combines color, depth and inertial measurements to achieve robust reconstruction with fast sensor motions. Our framework extends Kalman filter to fuse the three kinds of information and involve an iterative method to jointly optimize feature correspondences, camera poses and scene geometry. We also propose a novel geometry-aware patch deformation technique to adapt the feature appearance in image domain, leading to a more accurate feature matching under fast camera motions. Experiments show that our patch deformation method improves the accuracy of feature tracking, and our 3D reconstruction outperforms the state-of-the-art solutions under fast camera motions.

**《SeFM: A Sequential Feature Point Matching Algorithm for Object 3D Reconstruction》**

arXiv：https://arxiv.org/abs/1812.02925

> 3D reconstruction is a fundamental issue in many applications and the feature point matching problem is a key step while reconstructing target objects. Conventional algorithms can only find a small number of feature points from two images which is quite insufficient for reconstruction. To overcome this problem, we propose SeFM a sequential feature point matching algorithm. We first utilize the epipolar geometry to find the epipole of each image. Rotating along the epipole, we generate a set of the epipolar lines and reserve those intersecting with the input image. Next, a rough matching phase, followed by a dense matching phase, is applied to find the matching dot-pairs using dynamic programming. Furthermore, we also remove wrong matching dot-pairs by calculating the validity. Experimental results illustrate that SeFM can achieve around 1,000 to 10,000 times matching dot-pairs, depending on individual image, compared to conventional algorithms and the object reconstruction with only two images is semantically visible. Moreover, it outperforms conventional algorithms, such as SIFT and SURF, regarding precision and recall.


# Re-ID

**《Optimizing Speed/Accuracy Trade-Off for Person Re-identification via Knowledge Distillation》**

arXiv：https://arxiv.org/abs/1812.02937

> Finding a person across a camera network plays an important role in video surveillance. For a real-world person re-identification application, in order to guarantee an optimal time response, it is crucial to find the balance between accuracy and speed. We analyse this trade-off, comparing a classical method, that comprises hand-crafted feature description and metric learning, in particular, LOMO and XQDA, with state-of-the-art deep learning techniques, using image classification networks, ResNet and MobileNets. Additionally, we propose and analyse network distillation as a learning strategy to reduce the computational cost of the deep learning approach at test time. We evaluate both methods on the Market-1501 and DukeMTMC-reID large-scale datasets.


# 其它

**《Graph Cut Segmentation Methods Revisited with a Quantum Algorithm》**

arXiv：https://arxiv.org/abs/1812.03050

> The design and performance of computer vision algorithms are greatly influenced by the hardware on which they are implemented. CPUs, multi-core CPUs, FPGAs and GPUs have inspired new algorithms and enabled existing ideas to be realized. This is notably the case with GPUs, which has significantly changed the landscape of computer vision research through deep learning. As the end of Moores law approaches, researchers and hardware manufacturers are exploring alternative hardware computing paradigms. Quantum computers are a very promising alternative and offer polynomial or even exponential speed-ups over conventional computing for some problems. This paper presents a novel approach to image segmentation that uses new quantum computing hardware. Segmentation is formulated as a graph cut problem that can be mapped to the quantum approximation optimization algorithm (QAOA). This algorithm can be implemented on current and near-term quantum computers. Encouraging results are presented on artificial and medical imaging data. This represents an important, practical step towards leveraging quantum computers for computer vision.

**《Neural Image Decompression: Learning to Render Better Image Previews》**

arXiv：https://arxiv.org/abs/1812.02831

> A rapidly increasing portion of Internet traffic is dominated by requests from mobile devices with limited- and metered-bandwidth constraints. To satisfy these requests, it has become standard practice for websites to transmit small and extremely compressed image previews as part of the initial page-load process. Recent work, based on an adaptive triangulation of the target image, has shown the ability to generate thumbnails of full images at extreme compression rates: 200 bytes or less with impressive gains (in terms of PSNR and SSIM) over both JPEG and WebP standards. However, qualitative assessments and preservation of semantic content can be less favorable. We present a novel method to significantly improve the reconstruction quality of the original image with no changes to the encoded information. Our neural-based decoding not only achieves higher PSNR and SSIM scores than the original methods, but also yields a substantial increase in semantic-level content preservation. In addition, by keeping the same encoding stream, our solution is completely inter-operable with the original decoder. The end result is suitable for a range of small-device deployments, as it involves only a single forward-pass through a small, scalable network.


================================================
FILE: 2018/12/17-21.md
================================================
【计算机视觉论文速递】2018-12-17~12-21

- [ ] 2018-12-17
- [ ] 2018-12-18
- [ ] 2018-12-19
- [ ] 2018-12-20
- [x] 2018-12-21

本文分享共计9篇论文，涉及CNN、GAN、姿态估计和Meta-Learning等方向。

[TOC]

# CNN

**《DAC: Data-free Automatic Acceleration of Convolutional Networks》**

WACV 2019

arXiv：https://arxiv.org/abs/1812.08374

> Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance.

# GAN

**《RankGAN: A Maximum Margin Ranking GAN for Generating Faces》**

ACCV 2018 Best Student Paper Award

arXiv：https://arxiv.org/abs/1812.08196

> We present a new stage-wise learning paradigm for training generative adversarial networks (GANs). The goal of our work is to progressively strengthen the discriminator and thus, the generators, with each subsequent stage without changing the network architecture. We call this proposed method the RankGAN. We first propose a margin-based loss for the GAN discriminator. We then extend it to a margin-based ranking loss to train the multiple stages of RankGAN. We focus on face images from the CelebA dataset in our work and show visual as well as quantitative improvements in face generation and completion tasks over other GAN approaches, including WGAN and LSGAN.


# Pose Estimation

**《OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields》**

arXiv：https://arxiv.org/abs/1812.08008

arXiv(old)：https://arxiv.org/abs/1611.08050

datasets：https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/

> Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

# Image Caption

**《Nocaps: novel object captioning at scale》**

arXiv：https://arxiv.org/abs/1812.08658

homepage：https://nocaps.org/

> Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, more than 500 object classes seen in test images have no training captions (hence, nocaps). We evaluate several existing approaches to novel object captioning on our challenging benchmark. In automatic evaluations these approaches show modest improvements over a strong baseline trained only on image-caption data. However, even when using ground-truth object detections, the results are significantly weaker than our human baseline - indicating substantial room for improvement.


# Medical Image Analysis

**《Unconstrained Iris Segmentation using Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.08245

> The extraction of consistent and identifiable features from an image of the human iris is known as iris recognition. Identifying which pixels belong to the iris, known as segmentation, is the first stage of iris recognition. Errors in segmentation propagate to later stages. Current segmentation approaches are tuned to specific environments. We propose using a convolution neural network for iris segmentation. Our algorithm is accurate when trained in a single environment and tested in multiple environments. Our network builds on the Mask R-CNN framework (He et al., ICCV 2017). Our approach segments faster than previous approaches including the Mask R-CNN network. Our network is accurate when trained on a single environment and tested with a different sensors (either visible light or near-infrared). Its accuracy degrades when trained with a visible light sensor and tested with a near-infrared sensor (and vice versa). A small amount of retraining of the visible light model (using a few samples from a near-infrared dataset) yields a tuned network accurate in both settings. For training and testing, this work uses the Casia v4 Interval, Notre Dame 0405, Ubiris v2, and IITD datasets.

# Meta-Learning

**《Unsupervised Meta-learning of Figure-Ground Segmentation via Imitating Visual Effects》**

arXiv：https://arxiv.org/abs/1812.08442

> This paper presents a "learning to learn" approach to figure-ground image segmentation. By exploring webly-abundant images of specific visual effects, our method can effectively learn the visual-effect internal representations in an unsupervised manner and uses this knowledge to differentiate the figure from the ground in an image. Specifically, we formulate the meta-learning process as a compositional image editing task that learns to imitate a certain visual effect and derive the corresponding internal representation. Such a generative process can help instantiate the underlying figure-ground notion and enables the system to accomplish the intended image segmentation. Whereas existing generative methods are mostly tailored to image synthesis or style transfer, our approach offers a flexible learning mechanism to model a general concept of figure-ground segmentation from unorganized images that have no explicit pixel-level annotations. We validate our approach via extensive experiments on six datasets to demonstrate that the proposed model can be end-to-end trained without ground-truth pixel labeling yet outperforms the existing methods of unsupervised segmentation tasks.


# Other

**《Explanatory Graphs for CNNs》**

arXiv：https://arxiv.org/abs/1812.07997

> This paper introduces a graphical model, namely an explanatory graph, which reveals the knowledge hierarchy hidden inside conv-layers of a pre-trained CNN. Each filter in a conv-layer of a CNN for object classification usually represents a mixture of object parts. We develop a simple yet effective method to disentangle object-part pattern components from each filter. We construct an explanatory graph to organize the mined part patterns, where a node represents a part pattern, and each edge encodes co-activation relationships and spatial relationships between patterns. More crucially, given a pre-trained CNN, the explanatory graph is learned without a need of annotating object parts. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. We transferred part patterns in the explanatory graph to the task of part localization, and our method significantly outperformed other approaches.

**《Deep Paper Gestalt》**

arXiv：https://arxiv.org/abs/1812.08775

github：https://github.com/vt-vl-lab/paper-gestalt

> Recent years have witnessed a significant increase in the number of paper submissions to computer vision conferences. The sheer volume of paper submissions and the insufficient number of competent reviewers cause a considerable burden for the current peer review system. In this paper, we learn a classifier to predict whether a paper should be accepted or rejected based solely on the visual appearance of the paper (i.e., the gestalt of a paper). Experimental results show that our classifier can safely reject 50% of the bad papers while wrongly reject only 0.4% of the good papers, and thus dramatically reduce the workload of the reviewers. We also provide tools for providing suggestions to authors so that they can improve the gestalt of their papers.

**《One-Class Feature Learning Using Intra-Class Splitting》**

arXiv：https://arxiv.org/abs/1812.08468

> This paper proposes a novel generic one-class feature learning method which is based on intra-class splitting. In one-class classification, feature learning is challenging, because only samples of one class are available during training. Hence, state-of-the-art methods require reference multi-class datasets to pretrain feature extractors. In contrast, the proposed method realizes feature learning by splitting the given normal class into typical and atypical normal samples. By introducing closeness loss and dispersion loss, an intra-class joint training procedure between the two subsets after splitting enables the extraction of valuable features for one-class classification. Various experiments on three well-known image classification datasets demonstrate the effectiveness of our method which outperformed other baseline models in 25 of 30 experiments.


================================================
FILE: 2018/12/24-28.md
================================================
【计算机视觉论文速递】2018-12-24~12-28

- [x] 2018-12-24
- [ ] 2018-12-25
- [ ] 2018-12-26
- [x] 2018-12-27
- [ ] 2018-12-28

本文分享共计39篇论文，涉及图像分类、目标检测、语义分割、GAN、姿态估计、SLAM、显著性目标检测和Zero-Shot Learning等方向。

[TOC]

# CNN

**《Slimmable Neural Networks》**

ICLR 2019

arXiv：https://arxiv.org/abs/1812.08928

github：https://github.com/JiahuiYu/slimmable_networks

> We present a simple and general method to train a single neural network executable at different widths (number of channels in a layer), permitting instant and adaptive accuracy-efficiency trade-offs at runtime. Instead of training individual networks with different width configurations, we train a shared network with switchable batch normalization. At runtime, the network can adjust its width on the fly according to on-device benchmarks and resource constraints, rather than downloading and offloading different models. Our trained networks, named slimmable neural networks, achieve similar (and in many cases better) ImageNet classification accuracy than individually trained models of MobileNet v1, MobileNet v2, ShuffleNet and ResNet-50 at different widths respectively. We also demonstrate better performance of slimmable models compared with individual ones across a wide range of applications including COCO bounding-box object detection, instance segmentation and person keypoint detection without tuning hyper-parameters. Lastly we visualize and discuss the learned features of slimmable networks. Code and models are available at: https://github.com/JiahuiYu/slimmable_networks

**《ChamNet: Towards Efficient Network Design through Platform-Aware Model Adaptation》**

arXiv：https://arxiv.org/abs/1812.08934

> This paper proposes an efficient neural network (NN) architecture design methodology called Chameleon that honors given resource constraints. Instead of developing new building blocks or using computationally-intensive reinforcement learning algorithms, our approach leverages existing efficient network building blocks and focuses on exploiting hardware traits and adapting computation resources to fit target latency and/or energy constraints. We formulate platform-aware NN architecture search in an optimization framework and propose a novel algorithm to search for optimal architectures aided by efficient accuracy and resource (latency and/or energy) predictors. At the core of our algorithm lies an accuracy predictor built atop Gaussian Process with Bayesian optimization for iterative sampling. With a one-time building cost for the predictors, our algorithm produces state-of-the-art model architectures on different platforms under given constraints in just minutes. Our results show that adapting computation resources to building blocks is critical to model performance. Without the addition of any bells and whistles, our models achieve significant accuracy improvements against state-of-the-art hand-crafted and automatically designed architectures. We achieve 73.8% and 75.3% top-1 accuracy on ImageNet at 20ms latency on a mobile CPU and DSP. At reduced latency, our models achieve up to 8.5% (4.8%) and 6.6% (9.3%) absolute top-1 accuracy improvements compared to MobileNetV2 and MnasNet, respectively, on a mobile CPU (DSP), and 2.7% (4.6%) and 5.6% (2.6%) accuracy gains over ResNet-101 and ResNet-152, respectively, on an Nvidia GPU (Intel CPU).


**《Learning from Web Data: the Benefit of Unsupervised Object Localization》**

IEEE TIP 2019

arXiv：https://arxiv.org/abs/1812.09232

> Annotating a large number of training images is very time-consuming. In this background, this paper focuses on learning from easy-to-acquire web data and utilizes the learned model for fine-grained image classification in labeled datasets. Currently, the performance gain from training with web data is incremental, like a common saying "better than nothing, but not by much". Conventionally, the community looks to correcting the noisy web labels to select informative samples. In this work, we first systematically study the built-in gap between the web and standard datasets, i.e. different data distributions between the two kinds of data. Then, in addition to using web labels, we present an unsupervised object localization method, which provides critical insights into the object density and scale in web images. Specifically, we design two constraints on web data to substantially reduce the difference of data distributions for the web and standard datasets. First, we present a method to control the scale, localization and number of objects in the detected region. Second, we propose to select the regions containing objects that are consistent with the web tag. Based on the two constraints, we are able to process web images to reduce the gap, and the processed web data is used to better assist the standard dataset to train CNNs. Experiments on several fine-grained image classification datasets confirm that our method performs favorably against the state-of-the-art methods.

# Image Classification

**《Chinese Herbal Recognition based on Competitive Attentional Fusion of Multi-hierarchies Pyramid Features》**

arXiv：https://arxiv.org/abs/1812.09648

github：https://github.com/scut-aitcm/Chinese-Herbs-Dataset

注：中药识别，有点意思的

> Convolution neural netwotks (CNNs) are successfully applied in image recognition task. In this study, we explore the approach of automatic herbal recognition with CNNs and build the standard Chinese herbs datasets firstly. According to the characteristics of herbal images, we proposed the competitive attentional fusion pyramid networks to model the features of herbal image, which mdoels the relationship of feature maps from different levels, and re-weights multi-level channels with channel-wise attention mechanism. In this way, we can dynamically adjust the weight of feature maps from various layers, according to the visual characteristics of each herbal image. Moreover, we also introduce the spatial attention to recalibrate the misaligned features caused by sampling in features amalgamation. Extensive experiments are conducted on our proposed datasets and validate the superior performance of our proposed models. The Chinese herbs datasets will be released upon acceptance to facilitate the research of Chinese herbal recognition.


# Face

**《A Survey to Deep Facial Attribute Analysis》**

arXiv：https://arxiv.org/abs/1812.10265

注：好综述

> Facial attribute analysis has received considerable attention with the development of deep neural networks in the past few years. Facial attribute analysis contains two crucial issues: Facial Attribute Estimation (FAE), which recognizes whether facial attributes are present in given images, and Facial Attribute Manipulation (FAM), which synthesizes or removes desired facial attributes. In this paper, we provide a comprehensive survey on deep facial attribute analysis covering FAE and FAM. First, we present the basic knowledge of the two stages (i.e., data pre-processing and model construction) in the general deep facial attribute analysis pipeline. Second, we summarize the commonly used datasets and performance metrics. Third, we create a taxonomy of the state-of-the-arts and review detailed algorithms in FAE and FAM, respectively. Furthermore, we introduce several additional facial attribute related issues and applications. Finally, the possible challenges and future research directions are discussed.


**《A Smart Security System with Face Recognition》**

arXiv：https://arxiv.org/abs/1812.09127

> Web-based technology has improved drastically in the past decade. As a result, security technology has become a major help to protect our daily life. In this paper, we propose a robust security based on face recognition system (SoF). In particular, we develop this system to giving access into a home for authenticated users. The classifier is trained by using a new adaptive learning method. The training data are initially collected from social networks. The accuracy of the classifier is incrementally improved as the user starts using the system. A novel method has been introduced to improve the classifier model by human interaction and social media. By using a deep learning framework - TensorFlow, it will be easy to reuse the framework to adopt with many devices and applications.

# Object Detection

**《3D multirater RCNN for multimodal multiclass detection and characterisation of extremely small objects》**

MIDL 2019 submission

arXiv：https://arxiv.org/abs/1812.09046

> Extremely small objects (ESO) have become observable on clinical routine magnetic resonance imaging acquisitions, thanks to a reduction in acquisition time at higher resolution. Despite their small size (usually <10 voxels per object for an image of more than 106 voxels), these markers reflect tissue damage and need to be accounted for to investigate the complete phenotype of complex pathological pathways. In addition to their very small size, variability in shape and appearance leads to high labelling variability across human raters, resulting in a very noisy gold standard. Such objects are notably present in the context of cerebral small vessel disease where enlarged perivascular spaces and lacunes, commonly observed in the ageing population, are thought to be associated with acceleration of cognitive decline and risk of dementia onset. In this work, we redesign the RCNN model to scale to 3D data, and to jointly detect and characterise these important markers of age-related neurovascular changes. We also propose training strategies enforcing the detection of extremely small objects, ensuring a tractable and stable training process.

**《Detection of distal radius fractures trained by a small set of X-ray images and Faster R-CNN》**

arXiv：https://arxiv.org/abs/1812.09025

> Distal radius fractures are the most common fractures of the upper extremity in humans. As such, they account for a significant portion of the injuries that present to emergency rooms and clinics throughout the world. We trained a Faster R-CNN, a machine vision neural network for object detection, to identify and locate distal radius fractures in anteroposterior X-ray images. We achieved an accuracy of 96\% in identifying fractures and mean Average Precision, mAP, of 0.866. This is significantly more accurate than the detection achieved by physicians and radiologists. These results were obtained by training the deep learning network with only 38 original images of anteroposterior hands X-ray images with fractures. This opens the possibility to detect with this type of neural network rare diseases or rare symptoms of common diseases , where only a small set of diagnosed X-ray images could be collected for each disease.

**《Practical Adversarial Attack Against Object Detector》**

arXiv：https://arxiv.org/abs/1812.10217

注：很有意思的研究

> In this paper, we proposed the first practical adversarial attacks against object detectors in realistic situations: the adversarial examples are placed in different angles and distances, especially in the long distance (over 20m) and wide angles 120 degree. To improve the robustness of adversarial examples, we proposed the nested adversarial examples and introduced the image transformation techniques. Transformation methods aim to simulate the variance factors such as distances, angles, illuminations, etc., in the physical world. Two kinds of attacks were implemented on YOLO V3, a state-of-the-art real-time object detector: hiding attack that fools the detector unable to recognize the object, and appearing attack that fools the detector to recognize the non-existent object. The adversarial examples are evaluated in three environments: indoor lab, outdoor environment, and the real road, and demonstrated to achieve the success rate up to 92.4% based on the distance range from 1m to 25m. In particular, the real road testing of hiding attack on a straight road and a crossing road produced the success rate of 75% and 64% respectively, and the appearing attack obtained the success rates of 63% and 81% respectively, which we believe, should catch the attention of the autonomous driving community.


# Semantic Segmentation

**《Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes》**

arXiv：https://arxiv.org/abs/1812.09953

> During the last half decade, convolutional neural networks (CNNs) have triumphed over semantic segmentation, which is one of the core tasks in many applications such as autonomous driving and augmented reality. However, to train CNNs requires a considerable amount of data, which is difficult to collect and laborious to annotate. Recent advances in computer graphics make it possible to train CNNs on photo-realistic synthetic imagery with computer-generated annotations. Despite this, the domain mismatch between the real images and the synthetic data hinders the models' performance. Hence, we propose a curriculum-style learning approach to minimizing the domain gap in urban scene semantic segmentation. The curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain; in particular, the first task is to learn global label distributions over images and local distributions over landmark superpixels. These are easy to estimate because images of urban scenes have strong idiosyncrasies (e.g., the size and spatial relations of buildings, streets, cars, etc.). We then train a segmentation network, while regularizing its predictions in the target domain to follow those inferred properties. In experiments, our method outperforms the baselines on two datasets and two backbone networks. We also report extensive ablation studies about our approach.

# GAN

**《Improving MMD-GAN Training with Repulsive Loss Function》**

ICLR 2019

arXiv：https://arxiv.org/abs/1812.09916

> Generative adversarial nets (GANs) are widely used to learn the data sampling process and their performance may heavily depend on the loss functions, given a limited computational budget. This study revisits MMD-GAN that uses the maximum mean discrepancy (MMD) as the loss function for GAN and makes two contributions. First, we argue that the existing MMD loss function may discourage the learning of fine details in data as it attempts to contract the discriminator outputs of real data. To address this issue, we propose a repulsive loss function to actively learn the difference among the real data by simply rearranging the terms in MMD. Second, inspired by the hinge loss, we propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the repulsive loss function. The proposed methods are applied to the unsupervised image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets. Results show that the repulsive loss function significantly improves over the MMD loss at no additional computational cost and outperforms other representative loss functions. The proposed methods achieve an FID score of 16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral normalization.


# Visual Tracking

**《Saliency Guided Hierarchical Robust Visual Tracking》**

arXiv：https://arxiv.org/abs/1812.08973

> A saliency guided hierarchical visual tracking (SHT) algorithm containing global and local search phases is proposed in this paper. In global search, a top-down saliency model is novelly developed to handle abrupt motion and appearance variation problems. Nineteen feature maps are extracted first and combined with online learnt weights to produce the final saliency map and estimated target locations. After the evaluation of integration mechanism, the optimum candidate patch is passed to the local search. In local search, a superpixel based HSV histogram matching is performed jointly with an L2-RLS tracker to take both color distribution and holistic appearance feature of the object into consideration. Furthermore, a linear refinement search process with fast iterative solver is implemented to attenuate the possible negative influence of dominant particles. Both qualitative and quantitative experiments are conducted on a series of challenging image sequences. The superior performance of the proposed method over other state-of-the-art algorithms is demonstrated by comparative study.

# 3D

**《Perceptually-based single-image depth super-resolution》**

arXiv：https://arxiv.org/abs/1812.09874

> RGBD images, combining high-resolution color and lower-resolution depth from various types of depth sensors, are increasingly common. One can significantly improve the resolution of depth images by taking advantage of color information; deep learning methods make combining color and depth information particularly easy. However, fusing these two sources of data may lead to a variety of artifacts. If depth maps are used to reconstruct 3D shapes, e.g., for virtual reality applications, the visual quality of upsampled images is particularly important. To achieve high-quality results, visual metric need to be taken into account. The main idea of our approach is to measure the quality of depth map upsampling using renderings of resulting 3D surfaces. We demonstrate that a simple visual appearance-based loss, when used with either a trained CNN or simply a deep prior, yields significantly improved 3D shapes, as measured by a number of existing perceptual metrics. We compare this approach with a number of existing optimization and learning-based techniques.

**《A Survey on Non-rigid 3D Shape Analysis》**

arXiv：https://arxiv.org/abs/1812.10111

> Shape is an important physical property of natural and manmade 3D objects that characterizes their external appearances. Understanding differences between shapes and modeling the variability within and across shape classes, hereinafter referred to as \emph{shape analysis}, are fundamental problems to many applications, ranging from computer vision and computer graphics to biology and medicine. This chapter provides an overview of some of the recent techniques that studied the shape of 3D objects that undergo non-rigid deformations including bending and stretching. Recent surveys that covered some aspects such classification, retrieval, recognition, and rigid or nonrigid registration, focused on methods that use shape descriptors. Descriptors, however, provide abstract representations that do not enable the exploration of shape variability. In this chapter, we focus on recent techniques that treated the shape of 3D objects as points in some high dimensional space where paths describe deformations. Equipping the space with a suitable metric enables the quantification of the range of deformations of a given shape, which in turn enables (1) comparing and classifying 3D objects based on their shape, (2) computing smooth deformations, i.e. geodesics, between pairs of objects, and (3) modeling and exploring continuous shape variability in a collection of 3D models. This article surveys and classifies recent developments in this field, outlines fundamental issues, discusses their potential applications in computer vision and graphics, and highlights opportunities for future research. Our primary goal is to bridge the gap between various techniques that have been often independently proposed by different communities including mathematics and statistics, computer vision and graphics, and medical image analysis.


# Pose Estimation

**《Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation》**

arXiv：https://arxiv.org/abs/1812.09899

> We propose a novel approach to jointly perform 3D object retrieval and pose estimation from monocular images.In order to make the method robust to real world scene variations in the images, e.g. texture, lighting and background,we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.Our method can then be trained for robustness under real world scene variations without having to render a large training set simulating these variations. Our learned embedding explicitly disentangles a shape vector and a pose vector, which alleviates both pose bias for 3D shape retrieval and categorical bias for pose estimation. Having the learned disentangled embedding, we train a CNN to map the images to the embedding space, and then retrieve the closest 3D shape from the database and estimate the 6D pose of the object using the embedding vectors. Our method achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore outperforms the previous state-of-the-art methods on both tasks.

**《Structure-Aware 3D Hourglass Network for Hand Pose Estimation from Single Depth Image》**

BMVC 2018

arXiv：https://arxiv.org/abs/1812.10320

> In this paper, we propose a novel structure-aware 3D hourglass network for hand pose estimation from a single depth image, which achieves state-of-the-art results on MSRA and NYU datasets. Compared to existing works that perform image-to-coordination regression, our network takes 3D voxel as input and directly regresses 3D heatmap for each joint. To be specific, we use hourglass network as our backbone network and modify it into 3D form. We explicitly model tree-like finger bone into the network as well as in the loss function in an end-to-end manner, in order to take the skeleton constraints into consideration. Final estimation can then be easily obtained from voxel density map with simple post-processing. Experimental results show that the proposed structure-aware 3D hourglass network is able to achieve a mean joint error of 7.4 mm in MSRA and 8.9 mm in NYU datasets, respectively.


# Text

**《TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network》**

ACCV 2018 oral

百度出品，必属精品，Mark！

arXiv：https://arxiv.org/abs/1812.09900

> Reading text from images remains challenging due to multi-orientation, perspective distortion and especially the curved nature of irregular text. Most of existing approaches attempt to solve the problem in two or multiple stages, which is considered to be the bottleneck to optimize the overall performance. To address this issue, we propose an end-to-end trainable network architecture, named TextNet, which is able to simultaneously localize and recognize irregular text from images. Specifically, we develop a scale-aware attention mechanism to learn multi-scale image features as a backbone network, sharing fully convolutional features and computation for localization and recognition. In text detection branch, we directly generate text proposals in quadrangles, covering oriented, perspective and curved text regions. To preserve text features for recognition, we introduce a perspective RoI transform layer, which can align quadrangle proposals into small feature maps. Furthermore, in order to extract effective features for recognition, we propose to encode the aligned RoI features by RNN into context information, combining spatial attention mechanism to generate text sequences. This overall pipeline is capable of handling both regular and irregular cases. Finally, text localization and recognition tasks can be jointly trained in an end-to-end fashion with designed multi-task loss. Experiments on standard benchmarks show that the proposed TextNet can achieve state-of-the-art performance, and outperform existing approaches on irregular datasets by a large margin.


# Saliency

**《SMILER: Saliency Model Implementation Library for Experimental Research》**

arXiv：https://arxiv.org/abs/1812.08848

github：https://github.com/TsotsosLab/SMILER

> The Saliency Model Implementation Library for Experimental Research (SMILER) is a new software package which provides an open, standardized, and extensible framework for maintaining and executing computational saliency models. This work drastically reduces the human effort required to apply saliency algorithms to new tasks and datasets, while also ensuring consistency and procedural correctness for results and conclusions produced by different parties. At its launch SMILER already includes twenty three saliency models (fourteen models based in MATLAB and nine supported through containerization), and the open design of SMILER encourages this number to grow with future contributions from the community. The project may be downloaded and contributed to through its GitHub page: https://github.com/TsotsosLab/SMILER

# SLAM

**《A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation》**

arXiv：https://arxiv.org/abs/1812.10016

> This paper presents a novel framework for simultaneously implementing localization and segmentation, which are two of the most important vision-based tasks for robotics. While the goals and techniques used for them were considered to be different previously, we show that by making use of the intermediate results of the two modules, their performance can be enhanced at the same time. Our framework is able to handle both the instantaneous motion and long-term changes of instances in localization with the help of the segmentation result, which also benefits from the refined 3D pose information. We conduct experiments on various datasets, and prove that our framework works effectively on improving the precision and robustness of the two tasks and outperforms existing localization and segmentation algorithms.

# Salient Object Detection

**《Selectivity or Invariance: Boundary-aware Salient Object Detection》**

arXiv：https://arxiv.org/abs/1812.10066

> Typically, a salient object detection (SOD) model faces opposite requirements in processing object interiors and boundaries. The features of interiors should be invariant to strong appearance change so as to pop-out the salient object as a whole, while the features of boundaries should be selective to slight appearance change to distinguish salient objects and background. To address this selectivity-invariance dilemma, we propose a novel boundary-aware network with successive dilation for image-based SOD. In this network, the feature selectivity at boundaries is enhanced by incorporating a boundary localization stream, while the feature invariance at interiors is guaranteed with a complex interior perception stream. Moreover, a transition compensation stream is adopted to amend the probable failures in transitional regions between interiors and boundaries. In particular, an integrated successive dilation module is proposed to enhance the feature invariance at interiors and transitional regions. Extensive experiments on four datasets show that the proposed approach outperforms 11 state-of-the-art methods.

# Re-ID

**《3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification》**

IEEE Transactions on Neural Networks and Learning Systems

arXiv：https://arxiv.org/abs/1812.10222

> In this paper, we introduce a global video representation to video-based person re-identification (re-ID) that aggregates local 3D features across the entire video extent. Most of the existing methods rely on 2D convolutional networks (ConvNets) to extract frame-wise deep features which are pooled temporally to generate the video-level representations. However, 2D ConvNets lose temporal input information immediately after the convolution, and a separate temporal pooling is limited in capturing human motion in shorter sequences. To this end, we present a \textit{global} video representation (3D PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the appearance and motion dynamics in full-length videos. However, encoding each video frame in its entirety and computing an aggregate global representation across all frames is tremendously challenging due to occlusions and misalignments. To resolve this, our proposed network is further augmented with 3D part alignment module to learn local features through soft-attention module. These attended features are statistically aggregated to yield identity-discriminative representations. Our global 3D features are demonstrated to achieve state-of-the-art results on three benchmark datasets: MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011

**《Spatial and Temporal Mutual Promotion for Video-based Person Re-identification》**

AAAI 2019

arXiv：https://arxiv.org/abs/1812.10305

> Video-based person re-identification is a crucial task of matching video sequences of a person across multiple camera views. Generally, features directly extracted from a single frame suffer from occlusion, blur, illumination and posture changes. This leads to false activation or missing activation in some regions, which corrupts the appearance and motion representation. How to explore the abundant spatial-temporal information in video sequences is the key to solve this problem. To this end, we propose a Refining Recurrent Unit (RRU) that recovers the missing parts and suppresses noisy parts of the current frame's features by referring historical frames. With RRU, the quality of each frame's appearance representation is improved. Then we use the Spatial-Temporal clues Integration Module (STIM) to mine the spatial-temporal information from those upgraded features. Meanwhile, the multi-level training objective is used to enhance the capability of RRU and STIM. Through the cooperation of those modules, the spatial and temporal features mutually promote each other and the final spatial-temporal feature representation is more discriminative and robust. Extensive experiments are conducted on three challenging datasets, i.e., iLIDS-VID, PRID-2011 and MARS. The experimental results demonstrate that our approach outperforms existing state-of-the-art methods of video-based person re-identification on iLIDS-VID and MARS and achieves favorable results on PRID-2011.

**《Cluster Loss for Person Re-Identification》**

ICVGIP 2018

arXiv：https://arxiv.org/abs/1812.10325

> Person re-identification (ReID) is an important problem in computer vision, especially for video surveillance applications. The problem focuses on identifying people across different cameras or across different frames of the same camera. The main challenge lies in identifying the similarity of the same person against large appearance and structure variations, while differentiating between individuals. Recently, deep learning networks with triplet loss have become a common framework for person ReID. However, triplet loss focuses on obtaining correct orders on the training set. We demonstrate that it performs inferior in a clustering task. In this paper, we design a cluster loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better generalization ability and can achieve higher accuracy on the test set especially for a clustering task. We also introduce a batch hard training mechanism for improving the results and faster convergence of training.

**《EgoReID: Person re-identification in Egocentric Videos Acquired by Mobile Devices with First-Person Point-of-View》**

arXiv：https://arxiv.org/abs/1812.09570

> Widespread use of wearable cameras and recording devices such as cellphones have opened the door to a lot of interesting research in first-person Point-of-view (POV) videos (egocentric videos). In recent years, we have seen the performance of video-based person Re-Identification (ReID) methods improve considerably. However, with the influx of varying video domains, such as egocentric videos, it has become apparent that there are still many open challenges to be faced. These challenges are a result of factors such as poor video quality due to ego-motion, blurriness, severe changes in lighting conditions and perspective distortions. To facilitate the research towards conquering these challenges, this paper contributes a new, first-of-its-kind dataset called EgoReID. The dataset is captured using 3 mobile cellphones with non-overlapping field-of-view. It contains 900 IDs and around 10,200 tracks with a total of 176,000 detections. Moreover, for each video we also provide 12-sensor meta data. Directly applying current approaches to our dataset results in poor performance. Considering the unique nature of our dataset, we propose a new framework which takes advantage of both visual and sensor meta data to successfully perform Person ReID. In this paper, we propose to adopt human body region parsing to extract local features from different body regions and then employ 3D convolution to better encode temporal information of each sequence of body parts. In addition, we also employ sensor meta data to determine target's next camera and their estimated time of arrival, such that the search is only performed among tracks present in the predicted next camera around the estimated time. This considerably improves our ReID performance as it significantly reduces our search space.


# Super-resolution

**《3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.09079

> In video super-resolution, the spatio-temporal coherence between, and among the frames must be exploited appropriately for accurate prediction of the high resolution frames. Although 2D convolutional neural networks (CNNs) are powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal feature extraction as they can preserve temporal information. To this end, we propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that does not require motion alignment as preprocessing. Our 3DSRnet maintains the temporal depth of spatio-temporal feature maps to maximally capture the temporally nonlinear characteristics between low and high resolution frames, and adopts residual learning in conjunction with the sub-pixel outputs. It outperforms the most state-of-the-art method with average 0.45 and 0.36 dB higher in PSNR for scales 3 and 4, respectively, in the Vidset4 benchmark. Our 3DSRnet first deals with the performance drop due to scene change, which is important in practice but has not been previously considered.


# Image Denosing

**《A Multiscale Image Denoising Algorithm Based On Dilated Residual Convolution Network》**

arXiv：https://arxiv.org/abs/1812.09131

> Image denoising is a classical problem in low level computer vision. Model-based optimization methods and deep learning approaches have been the two main strategies for solving the problem. Model-based optimization methods are flexible for handling different inverse problems but are usually time-consuming. In contrast, deep learning methods have fast testing speed but the performance of these CNNs is still inferior. To address this issue, here we propose a novel deep residual learning model that combines the dilated residual convolution and multi-scale convolution groups. Due to the complex patterns and structures of inside an image, the multiscale convolution group is utilized to learn those patterns and enlarge the receptive field. Specifically, the residual connection and batch normalization are utilized to speed up the training process and maintain the denoising performance. In order to decrease the gridding artifacts, we integrate the hybrid dilated convolution design into our model. To this end, this paper aims to train a lightweight and effective denoiser based on multiscale convolution group. Experimental results have demonstrated that the enhanced denoiser can not only achieve promising denoising results, but also become a strong competitor in practical application.

# Zero-Shot Learning

**《Domain-Aware Generalized Zero-Shot Learning》**

arXiv：https://arxiv.org/abs/1812.09903

> Generalized zero-shot learning (GZSL) is the problem of learning a classifier where some classes have samples, and others are learned from side information, like semantic attributes or text description, in a zero-shot learning fashion (ZSL). A major challenge in GZSL is to learn consistently for those two different domains. Here we describe a probabilistic approach that breaks the model into three modular components, and then combines them in a consistent way. Specifically, our model consists of three classifiers: A "gating" model that softly decides if a sample is from a "seen" class and two experts: a ZSL expert, and an expert model for seen classes. We address two main difficulties in this approach: How to provide an accurate estimate of the gating probability without any training samples for unseen classes; and how to use an expert predictions when it observes samples outside of its domain. 
The key insight in our approach is to pass information between the three models to improve each others accuracy, while keeping the modular structure. We test our approach, Domain-Aware GZSL (DAZL) on three standard GZSL benchmark datasets (AWA, CUB, SUN), and find that it largely outperforms state-of-the-art GZSL models. DAZL is also the first model that closes the gap and surpasses the performance of generative models for GZSL, even-though it is a light-weight model that is much easier to train and tune.

# Few-Shot

**《Learning Compositional Representations for Few-Shot Recognition》**

arXiv：https://arxiv.org/abs/1812.09213

> One of the key limitations of modern deep learning based approaches lies in the amount of data required to train them. Humans, on the other hand, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning ability is the compositional structure of concept representations in the human brain - something that deep learning models are lacking. In this work we make a step towards bridging this gap between human and machine learning by introducing a simple regularization technique that allows the learned representation to be decomposable into parts. We evaluate the proposed approach on three datasets: CUB-200-2011, SUN397, and ImageNet, and demonstrate that our compositional representations require fewer examples to learn classifiers for novel categories, outperforming state-of-the-art few-shot learning approaches by a significant margin.

**《Similarity R-C3D for Few-shot Temporal Activity Detection》**

arXiv：https://arxiv.org/abs/1812.10000

> Many activities of interest are rare events, with only a few labeled examples available. Therefore models for temporal activity detection which are able to learn from a few examples are desirable. In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection which detects the start and end time of the few-shot input activities in an untrimmed video. Our model is end-to-end trainable and can benefit from more few-shot examples. At test time, each proposal is assigned the label of the few-shot activity class corresponding to the maximum similarity score. Our Similarity R-C3D method outperforms previous work on three large-scale benchmarks for temporal activity detection (THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets) in the few-shot setting. Our code will be made available.


# Other

**《A Scale Invariant Approach for Sparse Signal Recovery》**

arXiv：https://arxiv.org/abs/1812.08852

> In this paper, we study the ratio of the L1 and L2 norms, denoted as L1/L2, to promote sparsity. Due to the non-convexity and non-linearity, there has been little attention to this scale-invariant metric. Compared to popular models in the literature such as the Lp model for p∈(0,1) and the transformed L1 (TL1), this ratio model is parameter free. Theoretically, we present a weak null space property (wNSP) and prove that any sparse vector is a local minimizer of the L1/L2 model provided with this wNSP condition. Computationally, we focus on a constrained formulation that can be solved via the alternating direction method of multipliers (ADMM). Experiments show that the proposed approach is comparable to the state-of-the-art methods in sparse recovery. In addition, a variant of the L1/L2 model to apply on the gradient is also discussed with a proof-of-concept example of MRI reconstruction.construction.

**《Polygonal approximation of digital planar curve using novel significant measure》**

arXiv：https://arxiv.org/abs/1812.09271

> This paper presents an iterative smoothing technique for polygonal approximation of digital image boundary. The technique starts with finest initial segmentation points of a curve. The contribution of initially segmented points towards preserving the original shape of the image boundary is determined by computing the significant measure of every initial segmentation points which is sensitive to sharp turns, which may be missed easily when conventional significant measures are used for detecting dominant points. The proposed method differentiates between the situations when a point on the curve between two points on a curve projects directly upon the line segment or beyond this line segment. It not only identifies these situations, but also computes its significant contribution for these situations differently. This situation-specific treatment allows preservation of points with high curvature even as revised set of dominant points are derived. The experimental results show that the proposed technique competes well with the state of the art techniques.

**《Quicker ADC : Unlocking the hidden potential of Product Quantization with SIMD》**

arXiv：https://arxiv.org/abs/1812.09162

github：https://github.com/technicolor-research/faiss-quickeradc

> Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. A common approach is to rely on Product Quantization that allows storing large vector databases in memory and also allows efficient distance computations. Yet, implementations of nearest neighbor search with Product Quantization have their performance limited by the many memory accesses they perform. Following this observation, André et al. proposed more efficient implementations of m×4 product quantizers (PQ) leveraging specific SIMD instructions. 
Quicker ADC contributes additional implementations not limited to m×4 codes and relying on AVX-512, the latest revision of SIMD instruction set. In doing so, Quicker ADC faces the challenge of using efficiently 5,6 and 7-bit shuffles that do not align to computer bytes or words. To this end, we introduce (i) irregular product quantizers combining sub-quantizers of different granularity and (ii) split tables allowing lookup tables larger than registers. We evaluate Quicker ADC with multiple indexes including Inverted Multi-Indexes and IVF HNSW and show that it outperforms FAISS PQ implementation and optimization (i.e., Polysemous codes) for numerous configurations. Finally, we open-source at this http URL a fork of FAISS that includes Quicker ADC.

**《Writer-Aware CNN for Parsimonious HMM-Based Offline Handwritten Chinese Text Recognition》**

arXiv：https://arxiv.org/abs/1812.09809

> Recently, the hybrid convolutional neural network hidden Markov model (CNN-HMM) has been introduced for offline handwritten Chinese text recognition (HCTR) and has achieved state-of-the-art performance. In a CNN-HMM system, a handwritten text line is modeled by a series of cascading HMMs, each representing one character, and the posterior distributions of HMM states are calculated by CNN. However, modeling each of the large vocabulary of Chinese characters with a uniform and fixed number of hidden states requires high memory and computational costs and makes the tens of thousands of HMM state classes confusing. Another key issue of CNN-HMM for HCTR is the diversified writing style, which leads to model strain and a significant performance decline for specific writers. To address these issues, we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). Validated on the ICDAR 2013 competition of CASIA-HWDB database, the more compact WCNN-PHMM of a 7360-class vocabulary can achieve a relative character error rate (CER) reduction of 16.6% over the conventional CNN-HMM without considering language modeling. Moreover, the state-tying results of PHMM explicitly show the information sharing among similar characters and the confusion reduction of tied state classes. Finally, we visualize the learned writer codes and demonstrate the strong relationship with the writing styles of different writers. To the best of our knowledge, WCNN-PHMM yields the best results on the ICDAR 2013 competition set, demonstrating its power when enlarging the size of the character vocabulary.


**《Color Image Enhancement Method Based on Weighted Image Guided Filtering》**

arXiv：https://arxiv.org/abs/1812.09930

> A novel color image enhancement method is proposed based on Retinex to enhance color images under non-uniform illumination or poor visibility conditions. Different from the conventional Retinex algorithms, the Weighted Guided Image Filter is used as a surround function instead of the Gaussian filter to estimate the background illumination, which can overcome the drawbacks of local blur and halo artifact that may appear by Gaussian filter. To avoid color distortion, the image is converted to the HSI color model, and only the intensity channel is enhanced. Then a linear color restoration algorithm is adopted to convert the enhanced intensity image back to the RGB color model, which ensures the hue is constant and undistorted. Experimental results show that the proposed method is effective to enhance both color and gray images with low exposure and non-uniform illumination, resulting in better visual quality than traditional method. At the same time, the objective evaluation indicators are also superior to the conventional methods. In addition, the efficiency of the proposed method is also improved thanks to the linear color restoration algorithm.


**《Coupled Recurrent Network (CRN)》**

arXiv：https://arxiv.org/abs/1812.10071

> Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation.


**《RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration》**

arXiv：https://arxiv.org/abs/1812.10212

> Direct image-to-image alignment that relies on the optimization of photometric error metrics suffers from limited convergence range and sensitivity to lighting conditions. Deep learning approaches has been applied to address this problem by learning better feature representations using convolutional neural networks, yet still require a good initialization. In this paper, we demonstrate that the inaccurate numerical Jacobian limits the convergence range which could be improved greatly using learned approaches. Based on this observation, we propose a novel end-to-end network, RegNet, to learn the optimization of image-to-image pose registration. By jointly learning feature representation for each pixel and partial derivatives that replace handcrafted ones (e.g., numerical differentiation) in the optimization step, the neural network facilitates end-to-end optimization. The energy landscape is constrained on both the feature representation and the learned Jacobian, hence providing more flexibility for the optimization as a consequence leads to more robust and faster convergence. In a series of experiments, including a broad ablation study, we demonstrate that RegNet is able to converge for large-baseline image pairs with fewer iterations.

**《Learning Not to Learn: Training Deep Neural Networks with Biased Data》**

arXiv：https://arxiv.org/abs/1812.10352

> We propose a novel regularization algorithm to train deep neural networks, in which data at training time is severely biased. Since a neural network efficiently learns data distribution, a network is likely to learn the bias information to categorize input data. It leads to poor performance at test time, if the bias is, in fact, irrelevant to the categorization. In this paper, we formulate a regularization loss based on mutual information between feature embedding and bias. Based on the idea of minimizing this mutual information, we propose an iterative algorithm to unlearn the bias information. We employ an additional network to predict the bias distribution and train the network adversarially against the feature embedding network. At the end of learning, the bias prediction network is not able to predict the bias not because it is poorly trained, but because the feature embedding network successfully unlearns the bias information. We also demonstrate quantitative and qualitative experimental results which show that our algorithm effectively removes the bias information from feature embedding.


**《Dynamic Runtime Feature Map Pruning》**

arXiv：https://arxiv.org/abs/1812.09922

github：https://github.com/liangtailin/darknet-modified

注：感觉蛮有意思的，代码已经开源

> High bandwidth requirements are an obstacle for accelerating the training and inference of deep neural networks. Most previous research focuses on reducing the size of kernel maps for inference. We analyze parameter sparsity of six popular convolutional neural networks - AlexNet, MobileNet, ResNet-50, SqueezeNet, TinyNet, and VGG16. Of the networks considered, those using ReLU (AlexNet, SqueezeNet, VGG16) contain a high percentage of 0-valued parameters and can be statically pruned. Networks with Non-ReLU activation functions in some cases may not contain any 0-valued parameters (ResNet-50, TinyNet). We also investigate runtime feature map usage and find that input feature maps comprise the majority of bandwidth requirements when depth-wise convolution and point-wise convolutions used. We introduce dynamic runtime pruning of feature maps and show that 10% of dynamic feature map execution can be removed without loss of accuracy. We then extend dynamic pruning to allow for values within an epsilon of zero and show a further 5% reduction of feature map loading with a 1% loss of accuracy in top-1.

**《Informative Object Annotations: Tell Me Something I Don't Know》**

arXiv：https://arxiv.org/abs/1812.10358

> Capturing the interesting components of an image is a key aspect of image understanding. When a speaker annotates an image, selecting labels that are informative greatly depends on the prior knowledge of a prospective listener. Motivated by cognitive theories of categorization and communication, we present a new unsupervised approach to model this prior knowledge and quantify the informativeness of a description. Specifically, we compute how knowledge of a label reduces uncertainty over the space of labels and utilize this to rank candidate labels for describing an image. While the full estimation problem is intractable, we describe an efficient algorithm to approximate entropy reduction using a tree-structured graphical model. We evaluate our approach on the open-images dataset using a new evaluation set of 10K ground-truth ratings and find that it achieves ~65% agreement with human raters, largely outperforming other unsupervised baseline approaches.


================================================
FILE: 2018/12/31.md
================================================
【计算机视觉论文速递】2018-12-31

- [x] 2018-12-31

本文分享共16篇论文，涉及CNN、语义分割、GAN、3D和显著性目标检测等方向。

[TOC]

# CNN

# Face

**《Deception Detection by 2D-to-3D Face Reconstruction from Videos》**

arXiv：https://arxiv.org/abs/1812.10558

注：NB的研究点啊

> Lies and deception are common phenomena in society, both in our private and professional lives. However, humans are notoriously bad at accurate deception detection. Based on the literature, human accuracy of distinguishing between lies and truthful statements is 54% on average, in other words it is slightly better than a random guess. While people do not much care about this issue, in high-stakes situations such as interrogations for series crimes and for evaluating the testimonies in court cases, accurate deception detection methods are highly desirable. To achieve a reliable, covert, and non-invasive deception detection, we propose a novel method that jointly extracts reliable low- and high-level facial features namely, 3D facial geometry, skin reflectance, expression, head pose, and scene illumination in a video sequence. Then these features are modeled using a Recurrent Neural Network to learn temporal characteristics of deceptive and honest behavior. We evaluate the proposed method on the Real-Life Trial (RLT) dataset that contains high-stake deceptive and honest videos recorded in courtrooms. Our results show that the proposed method (with an accuracy of 72.8%) improves the state of the art as well as outperforming the use of manually coded facial attributes 67.6%) in deception detection.

# Semantic Segmentation

**《Coarse-to-fine Semantic Segmentation from Image-level Labels》**

arXiv：https://arxiv.org/abs/1812.10885

> Deep neural network-based semantic segmentation generally requires large-scale cost extensive annotations for training to obtain better performance. To avoid pixel-wise segmentation annotations which are needed for most methods, recently some researchers attempted to use object-level labels (e.g. bounding boxes) or image-level labels (e.g. image categories). In this paper, we propose a novel recursive coarse-to-fine semantic segmentation framework based on only image-level category labels. For each image, an initial coarse mask is first generated by a convolutional neural network-based unsupervised foreground segmentation model and then is enhanced by a graph model. The enhanced coarse mask is fed to a fully convolutional neural network to be recursively refined. Unlike existing image-level label-based semantic segmentation methods which require to label all categories for images contain multiple types of objects, our framework only needs one label for each image and can handle images contains multi-category objects. With only trained on ImageNet, our framework achieves comparable performance on PASCAL VOC dataset as other image-level label-based state-of-the-arts of semantic segmentation. Furthermore, our framework can be easily extended to foreground object segmentation task and achieves comparable performance with the state-of-the-art supervised methods on the Internet Object dataset.


**《S4-Net: Geometry-Consistent Semi-Supervised Semantic Segmentation》**

arXiv：https://arxiv.org/abs/1812.10717

> We show that it is possible to learn semantic segmentation from very limited amounts of manual annotations, by enforcing geometric 3D constraints between multiple views. More exactly, image locations corresponding to the same physical 3D point should all have the same label. We show that introducing such constraints during learning is very effective, even when no manual label is available for a 3D point, and can be done simply by employing techniques from 'general' semi-supervised learning to the context of semantic segmentation. To demonstrate this idea, we use RGB-D image sequences of rigid scenes, for a 4-class segmentation problem derived from the ScanNet dataset. Starting from RGB-D sequences with a few annotated frames, we show that we can incorporate RGB-D sequences without any manual annotations to improve the performance, which makes our approach very convenient. Furthermore, we demonstrate our approach for semantic segmentation of objects on the LabelFusion dataset, where we show that one manually labeled image in a scene is sufficient for high performance on the whole scene.

**《Future semantic segmentation of time-lapsed videos with large temporal displacement》**

arXiv：https://arxiv.org/abs/1812.10786

homepage：https://samarth-b.github.io/skycamIrr/

> An important aspect of video understanding is the ability to predict the evolution of its content in the future. This paper presents a future frame semantic segmentation technique for predicting semantic masks of the current and future frames in a time-lapsed video. We specifically focus on time-lapsed videos with large temporal displacement to highlight the model's ability to capture large motions in time. We first introduce a unique semantic segmentation prediction dataset with over 120,000 time-lapsed sky-video frames and all corresponding semantic masks captured over a span of five years in North America region. The dataset has immense practical value for cloud cover analysis, which are treated as non-rigid objects of interest. %Here the model provides both semantic segmentation of cloud region and solar irradiance emitted from a region from the sky-videos. Next, our proposed recurrent network architecture departs from existing trend of using temporal convolutional networks (TCN) (or feed-forward networks), by explicitly learning an internal representations for the evolution of video content with time. Experimental evaluation shows an improvement of mean IoU over TCNs in the segmentation task by 10.8% for 10 mins (21% over 60 mins) ahead of time predictions. Further, our model simultaneously measures both the current and future solar irradiance from the same video frames with a normalized-MAE of 10.5% over two years. These results indicate that recurrent memory networks with attention mechanism are able to capture complex advective and diffused flow characteristic of dense fluids even with sparse temporal sampling and are more suitable for future frame prediction tasks for longer duration videos.

# GAN

ICLR 2019

**《InstaGAN: Instance-aware Image-to-Image Translation》**

arXiv：https://arxiv.org/abs/1812.10889

github：https://github.com/sangwoomo/instagan

> Unsupervised image-to-image translation has gained considerable attention due to the recent impressive progress based on generative adversarial networks (GANs). However, previous methods often fail in challenging cases, in particular, when an image has multiple target instances and a translation task involves significant changes in shape, e.g., translating pants to skirts in fashion images. To tackle the issues, we propose a novel method, coined instance-aware GAN (InstaGAN), that incorporates the instance information (e.g., object segmentation masks) and improves multi-instance transfiguration. The proposed method translates both an image and the corresponding set of instance attributes while maintaining the permutation invariance property of the instances. To this end, we introduce a context preserving loss that encourages the network to learn the identity function outside of target instances. We also propose a sequential mini-batch inference/training technique that handles multiple instances with a limited GPU memory and enhances the network to generalize better for multiple instances. Our comparative evaluation demonstrates the effectiveness of the proposed method on different image datasets, in particular, in the aforementioned challenging cases.

**《TOP-GAN: Label-Free Cancer Cell Classification Using Deep Learning with a Small Training Set》**

arXiv：https://arxiv.org/abs/1812.11006

> We propose a new deep learning approach for medical imaging that copes with the problem of a small training set, the main bottleneck of deep learning, and apply it for classification of healthy and cancer cells acquired by quantitative phase imaging. The proposed method, called transferring of pre-trained generative adversarial network (TOP-GAN), is a hybridization between transfer learning and generative adversarial networks (GANs). Healthy cells and cancer cells of different metastatic potential have been imaged by low-coherence off-axis holography. After the acquisition, the optical path delay maps of the cells have been extracted and directly used as an input to the deep networks. In order to cope with the small number of classified images, we have used GANs to train a large number of unclassified images from another cell type (sperm cells). After this preliminary training, and after transforming the last layer of the network with new ones, we have designed an automatic classifier for the correct cell type (healthy/primary cancer/metastatic cancer) with 90-99% accuracy, although small training sets of down to several images have been used. These results are better in comparison to other classic methods that aim at coping with the same problem of a small training set. We believe that our approach makes the combination of holographic microscopy and deep learning networks more accessible to the medical field by enabling a rapid, automatic and accurate classification in stain-free imaging flow cytometry. Furthermore, our approach is expected to be applicable to many other medical image classification tasks, suffering from a small training set.

**《Finger-GAN: Generating Realistic Fingerprint Images Using Connectivity Imposed GAN》**

arXiv：https://arxiv.org/abs/1812.10482

> Generating realistic biometric images has been an interesting and, at the same time, challenging problem. Classical statistical models fail to generate realistic-looking fingerprint images, as they are not powerful enough to capture the complicated texture representation in fingerprint images. In this work, we present a machine learning framework based on generative adversarial networks (GAN), which is able to generate fingerprint images sampled from a prior distribution (learned from a set of training images). We also add a suitable regularization term to the loss function, to impose the connectivity of generated fingerprint images. This is highly desirable for fingerprints, as the lines in each finger are usually connected. We apply this framework to two popular fingerprint databases, and generate images which look very realistic, and similar to the samples in those databases. Through experimental results, we show that the generated fingerprint images have a good diversity, and are able to capture different parts of the prior distribution. We also evaluate the Frechet Inception distance (FID) of our proposed model, and show that our model is able to achieve good quantitative performance in terms of this score.

# 3D

**《Learning to Reconstruct Shapes from Unseen Classes》**

arXiv：https://arxiv.org/abs/1812.11166

homepage：http://genre.csail.mit.edu/

> From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more generic, class-agnostic shape priors. We achieve this with an inference network and training procedure that combine 2.5D representations of visible surfaces (depth and silhouette), spherical shape representations of both visible and non-visible surfaces, and 3D voxel-based representations, in a principled manner that exploits the causal structure of how 3D shapes give rise to 2D images. Experiments demonstrate that GenRe performs well on single-view shape reconstruction, and generalizes to diverse novel objects from categories not seen during training.

**《3D Point-Capsule Networks》**

arXiv：https://arxiv.org/abs/1812.10775

> In this paper, we propose 3D point-capsule networks, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data. 3D capsule networks arise as a direct consequence of our novel unified 3D auto-encoder formulation. Their dynamic routing scheme and the peculiar 2D latent space deployed by our approach bring in improvements for several common point cloud-related tasks, such as object classification, object reconstruction and part segmentation as substantiated by our extensive evaluations. Moreover, it enables new applications such as part interpolation and replacement.

**《Deflecting 3D Adversarial Point Clouds Through Outlier-Guided Removal》**

arXiv：https://arxiv.org/abs/1812.11017

> Neural networks are vulnerable to adversarial examples, which poses a threat to their application in security sensitive systems. We propose simple random sampling (SRS) and statistical outlier removal (SOR) as defenses for 3D point cloud classification, where both methods remove points by estimating probability of points serving as adversarial points. Compared with ensemble adversarial training which is the state-of-the-art defending method, SOR has several advantages: better defense performance, randomization makes the network more robust to adversarial point clouds, no additional training or fine-tuning required, and few computations are needed by adding the points-removal layer. In particular, our experiments on ModelNet40 show that SOR is very effective as defense in practice. The strength of those defenses lies in their non-differentiable nature and inherent randomness, which makes it difficult for an adversary to circumvent the defenses. Our best defense eliminates 81.4% of strong white-box attacks by C&W and l2 loss based attack methods.


# Salient Object Detection

**《Salient Object Detection via High-to-Low Hierarchical Context Aggregation》**

arXiv：https://arxiv.org/abs/1812.10956

注：程明明团队work，代码即将开源

> Recent progress on salient object detection mainly aims at exploiting how to effectively integrate convolutional side-output features in convolutional neural networks (CNN). Based on this, most of the existing state-of-the-art saliency detectors design complex network structures to fuse the side-output features of the backbone feature extraction networks. However, should the fusion strategies be more and more complex for accurate salient object detection? In this paper, we observe that the contexts of a natural image can be well expressed by a high-to-low self-learning of side-output convolutional features. As we know, the contexts of an image usually refer to the global structures, and the top layers of CNN usually learn to convey global information. On the other hand, it is difficult for the intermediate side-output features to express contextual information. Here, we design an hourglass network with intermediate supervision to learn contextual features in a high-to-low manner. The learned hierarchical contexts are aggregated to generate the hybrid contextual expression for an input image. At last, the hybrid contextual features can be used for accurate saliency estimation. We extensively evaluate our method on six challenging saliency datasets, and our simple method achieves state-of-the-art performance under various evaluation metrics. Code will be released upon paper acceptance.


# Action Recognition

**《Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.10550

> Recognizing human actions in untrimmed videos is an important challenging task. An effective 3D motion representation and a powerful learning model are two key factors influencing recognition performance. In this paper we introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a color encoding process. By normalizing the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the color-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. We then design and train different Deep Convolutional Neural Networks (D-CNNs) based on the Residual Network architecture (ResNet) on the obtained image-based representations to learn 3D motion features and classify them into classes. Our method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches whilst requiring less computation for training and prediction.


# Image Restoration

**《Residual Dense Network for Image Restoration》**

arXiv：https://arxiv.org/abs/1812.10477

> Convolutional neural network has recently achieved great success for image restoration (IR) and also offered hierarchical features. However, most deep CNN based IR models do not make full use of the hierarchical features from the original low-quality images, thereby achieving relatively-low performance. In this paper, we propose a novel residual dense network (RDN) to address this problem in IR. We fully exploit the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via densely connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory mechanism. To adaptively learn more effective features from preceding and current local features and stabilize the training of wider network, we proposed local feature fusion in RDB. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. We demonstrate the effectiveness of RDN with three representative IR applications, single image super-resolution, Gaussian image denoising, and image compression artifact reduction. Experiments on benchmark datasets show that our RDN achieves favorable performance against state-of-the-art methods for each IR task.

# Other

**《Chart-Text: A Fully Automated Chart Image Descriptor》**

arXiv：https://arxiv.org/abs/1812.10636

Demo：http://ml.spi-global.com:8000/

注：很有意思的项目，根据图标生成文字

> Images greatly help in understanding, interpreting and visualizing data. Adding textual description to images is the first and foremost principle of web accessibility. Visually impaired users using screen readers will use these textual descriptions to get better understanding of images present in digital contents. In this paper, we propose Chart-Text a novel fully automated system that creates textual description of chart images. Given a PNG image of a chart, our Chart-Text system creates a complete textual description of it. First, the system classifies the type of chart and then it detects and classifies the labels and texts in the charts. Finally, it uses specific image processing algorithms to extract relevant information from the chart images. Our proposed system achieves an accuracy of 99.72% in classifying the charts and an accuracy of 78.9% in extracting the data and creating the corresponding textual description.

**《TROVE Feature Detection for Online Pose Recovery by Binocular Cameras》**

arXiv：https://arxiv.org/abs/1812.10967

> This paper proposes a new and efficient method to estimate 6-DoF ego-states: attitudes and positions in real time. The proposed method extract information of ego-states by observing a feature called "TROVE" (Three Rays and One VErtex). TROVE features are projected from structures that are ubiquitous on man-made constructions and objects. The proposed method does not search for conventional corner-type features nor use Perspective-n-Point (PnP) methods, and it achieves a real-time estimation of attitudes and positions up to 60 Hz. The accuracy of attitude estimates can reach 0.3 degrees and that of position estimates can reach 2 cm in an indoor environment. The result shows a promising approach for unmanned robots to localize in an environment that is rich in man-made structures.


**《Center Emphasized Visual Saliency and Contrast-based Full Reference Image Quality Index》**

arXiv：https://arxiv.org/abs/1812.11163

github：http://layek.khu.ac.kr/CEQI/

> Objective Image Quality Assessment (IQA) is imperative in this multimedia-intensive world to asses the visual quality of an image close to the human ability. There are many parameters that bring human attention to an image and if the center part contains any visually salient information then it draws the attention even more. To the best of our knowledge, any previous IQA method did not give extra importance to the center part. In this paper, we propose a full reference image quality assessment (FR-IQA) approach using visual saliency and contrast, however, we give extra attention to the center by raising-up sensitivity of the similarity maps in that region. We evaluated our method on three popular benchmark databases (TID2008, CSIQ and LIVE) and compared with 13 state-of-the-art approaches which reveal the stronger correlation of our method with human evaluated values. The prediction of quality score is consistent for distortion-specific as well as distortion-independent cases. Moreover, faster processing makes it applicable to any real-time application. The MATLAB pcode is publicly available online to test the algorithm and can be found at http://layek.khu.ac.kr/CEQI/.


================================================
FILE: 2018/cvpr2018-paper-list.csv
================================================
Paper ID,Type,Title,Author(s)5,Poster,Single-Shot Refinement Neural Network for Object Detection,"Shifeng Zhang, CBSR, NLPR, CASIA; Longyin Wen, GE Global Research Center; Xiao Bian, ; Zhen Lei, Chinese Academy of Sciences ; Stan Li,"7,Poster,Video Captioning via Hierarchical Reinforcement Learning,"Xin Wang, UCSB; Wenhu Chen, ; Jiawei Wu, UCSB; Yuan-Fang Wang, UCSB; William Yang Wang, UCSB"12,Oral,DensePose: Multi-Person Dense Human Pose Estimation In The Wild,"Alp Guler, INRIA; Natalia Neverova, Facebook AI Research; Iasonas Kokkinos, FAIR/UCL"12,Poster,DensePose: Multi-Person Dense Human Pose Estimation In The Wild,"Alp Guler, INRIA; Natalia Neverova, Facebook AI Research; Iasonas Kokkinos, FAIR/UCL"19,Poster,Frustum PointNets for 3D Object Detection from RGB-D Data,"Charles R. Qi, Stanford University; Wei Liu, ; Chenxia Wu, ; hao Su, ; Leonidas J. Guibas,"21,Poster,Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge,"Damien Teney, Unversity of Adelaide; Peter Anderson, Australian National University; Xiaodong He, ; Anton Van den Hengel, University of Adelaide"24,Poster,Rethinking the Faster R-CNN Architecture for Temporal Action Localization,"Yu-Wei Chao, University of Michigan; Sudheendra Vijayanarasimhan, Google Research; Bryan Seybold, Google Research; David Ross, Google Research; Jia Deng, ; Rahul Sukthankar, Google Research"27,Spotlight,Shape from Shading through Shape Evolution,"Dawei Yang, University of Michigan; Jia Deng,"27,Poster,Shape from Shading through Shape Evolution,"Dawei Yang, University of Michigan; Jia Deng,"34,Poster,A High-Quality Denoising Dataset for Smartphone Cameras,"Abdelrahman Abdelhamed, York University; Stephen Lin, Microsoft Research Asia, China; Michael Brown, York University"35,Poster,Improving Color Reproduction Accuracy in the Camera Imaging Pipeline,"Hakki Karaimer, York University; Michael Brown, York University"37,Spotlight,End-to-End Dense Video Captioning with Masked Transformer,"Luowei Zhou, University of Michigan; Yingbo Zhou, Salesforce; Jason Corso, ; Richard Socher, Meta-Mind; Caiming Xiong, Salesforce"37,Poster,End-to-End Dense Video Captioning with Masked Transformer,"Luowei Zhou, University of Michigan; Yingbo Zhou, Salesforce; Jason Corso, ; Richard Socher, Meta-Mind; Caiming Xiong, Salesforce"41,Poster,pOSE: Pseudo Object Space Error for Initialization-Free Bundle Adjustment,"Je Hyeong Hong, University of Cambridge; Christopher Zach, Toshiba Research"47,Poster,Learning to Segment Every Thing,"Ronghang Hu, UC Berkeley; Piotr Dollar, Facebook AI Research, Menlo Park, USA; Kaiming He, ; Trevor Darrell, UC Berkeley, USA; Ross Girshick,"48,Poster,Density-aware Single Image De-raining using a Multi-stream Dense Network,"He Zhang, Rutgers; Vishal Patel,"49,Poster,Densely Connected Pyramid Dehazing Network,"He Zhang, Rutgers; Vishal Patel,"52,Poster,Embodied Question Answering,"Abhishek Das, Georgia Tech; Samyak Datta, Georgia Tech; Georgia Gkioxari, Facebook; Devi Parikh, Georgia Tech; Dhruv Batra, Georgia Tech; Stefan Lee, Georgia Tech"53,Spotlight,TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays,"Xiaosong Wang, NIH; Yifan Peng, NIH NLM; Le Lu, Nvidia Corp; Zhiyong Lu, ; Ronald Summers,"53,Poster,TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays,"Xiaosong Wang, NIH; Yifan Peng, NIH NLM; Le Lu, Nvidia Corp; Zhiyong Lu, ; Ronald Summers,"64,Poster,Towards Open-Set Identity Preserving Face Synthesis,"Jianmin Bao, USTC; Dong Chen, Microsoft Research Asia; Fang Wen, ; Houqiang Li, ; Gang Hua, Microsoft Research"67,Poster,Baseline Desensitizing In Translation Averaging,"Bingbing Zhuang, National University of Singapore; Loong Fah Cheong, National University of Singapore; Gim Hee Lee, National University of SIngapore"68,Poster,Learning from the Deep: A Revised Underwater Image Formation Model,"Derya Akkaynak, University of Haifa; Tali Treibitz, University of Haifa"76,Oral,Context Encoding for Semantic Segmentation,"Hang Zhang, Rutgers University; Kristin Dana, ; Jianping Shi, SenseTime; Zhongyue Zhang, Amazon; Xiaogang Wang, Chinese University of Hong Kong; Ambrish Tyagi, Amazon; Amit Agrawal, Amazon"76,Poster,Context Encoding for Semantic Segmentation,"Hang Zhang, Rutgers University; Kristin Dana, ; Jianping Shi, SenseTime; Zhongyue Zhang, Amazon; Xiaogang Wang, Chinese University of Hong Kong; Ambrish Tyagi, Amazon; Amit Agrawal, Amazon"77,Poster,Deep Texture Manifold for Ground Terrain Recognition,"Jia Xue, Rutgers; Hang Zhang, Rutgers University; Kristin Dana,"83,Poster,DS*: Tighter Lifting-Free Convex Relaxations for Quadratic Matching Problems,"Florian Bernard, ; Christian Theobalt, MPI Informatics; Michael Moeller, University of Siegen"85,Poster,"Sparse, Smart Contours to Represent and Edit Images","Tali Dekel, Google; Dilip Krishnan, Google; Chuang Gan, Tsinghua University; Ce Liu, Google, Cambridge, USA; William Freeman, Google"92,Poster,Every Smile is Unique: Landmark-guided Diverse Smile Generation,"Wei Wang, University of Trento; Xavier Alameda-Pineda, University of Trento; Dan Xu, ; Elisa Ricci, U. Perugia; Nicu Sebe, University of Trento"95,Poster,Generative Non-Rigid Shape Completion with Graph Convolutional Autoencoders,"Or Litany, Tel Aviv University; Alex Bronstein, ; Michael Bronstein, ; Ameesh Makadia, Google Research"97,Poster,Learning a Discriminative Prior for Blind Image Deblurring,"Lerenhan Li, HUST; Jinshan Pan, UC Merced; Wei-Sheng Lai, University of California, Merced; Changxin Gao, HUST; Nong Sang, ; Ming-Hsuan Yang, UC Merced"100,Poster,Attentional ShapeContextNet for Point Cloud Recognition,"Saining Xie, UCSD; Sainan Liu, UCSD; Zeyu Chen, UCSD; Zhuowen Tu, UCSD, USA"102,Poster,Learning Superpixels with Segmentation-Aware Affinity Loss,"Wei-Chih Tu, National Taiwan University; Ming-Yu Liu, NVIDIA; Varun Jampani, NVIDIA Research; Deqing Sun, NVIDIA; Shao-Yi Chien, National Taiwan University; Ming-Hsuan Yang, UC Merced; Jan Kautz, NVIDIA"103,Spotlight,"Real-World Repetition Estimation by Div, Grad and Curl","Tom Runia, University of Amsterdam; Cees Snoek, University of Amsterdam; Arnold Smeulders, University of Amsterdam, Netherlands"103,Poster,"Real-World Repetition Estimation by Div, Grad and Curl","Tom Runia, University of Amsterdam; Cees Snoek, University of Amsterdam; Arnold Smeulders, University of Amsterdam, Netherlands"106,Poster,Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation,"Qihang Yu, Peking University; Lingxi Xie, UCLA; Yan Wang, JHU; Yuyin Zhou, JHU; Elliot Fishman, ; Alan Yuille, JHU"109,Poster,MegaDepth: Learning Single-View Depth Prediction from Internet Photos,"Zhengqi Li, Cornell University; Noah Snavely, Cornell University / Google"110,Spotlight,Learning Intrinsic Image Decomposition from Watching the World,"Zhengqi Li, Cornell University; Noah Snavely, Cornell University / Google"110,Poster,Learning Intrinsic Image Decomposition from Watching the World,"Zhengqi Li, Cornell University; Noah Snavely, Cornell University / Google"112,Poster,Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering,"Aishwarya Agrawal, Georgia Institute of Technology; Dhruv Batra, Georgia Tech; Devi Parikh, Georgia Tech; Aniruddha Kembhavi, Allen Institute for Artificial Intelligence"116,Poster,Human-centric Indoor Scene Synthesis Using Stochastic Grammar,"Siyuan Qi, UCLA; Yixin Zhu, UCLA; Siyuan Huang, UCLA; Chenfanfu Jiang, ; Song-Chun Zhu,"120,Poster,Learning by Asking Questions,"Ishan Misra, CMU; Ross Girshick, ; Rob Fergus, New York University; Martial Hebert, Carnegie Mellon University; Abhinav Gupta, ; Laurens van der Maaten, Facebook"121,Poster,Instance Embedding Transfer to Unsupervised Video Object Segmentation,"Siyang Li, USC; Bryan Seybold, Google Research; Alexey Vorobyov, Google Inc.; Alireza Fathi, Stanford University; Qin Huang, University of Southern California; C.-C. Jay Kuo, University of Southern California"122,Poster,Detect-and-Track: Efficient Pose Estimation in Videos,"Rohit Girdhar, CMU; Georgia Gkioxari, Facebook; Lorenzo Torresani, Darthmout College, USA; Manohar Paluri, ; Du Tran, Dartmouth College"124,Poster,Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval,"Chao Li, Xidian University; Cheng Deng, Xidian University; Ning Li, Xidian University; Wei Liu, ; Dacheng Tao, University of Sydney; Xinbo Gao,"125,Poster,Guided Proofreading of Automatic Segmentations for Connectomics,"Daniel Haehn, Harvard University; Verena Kaynig, ; James Tompkin, Brown University; Jeff Lichtman, Harvard University; Hanspeter Pfister, Harvard University"128,Oral,Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation,"Seungryul Baek, Imperial College London; Kwang In Kim, University of Bath; Tae-Kyun Kim, Imperial College London"128,Poster,Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation,"Seungryul Baek, Imperial College London; Kwang In Kim, University of Bath; Tae-Kyun Kim, Imperial College London"130,Poster,Context-aware Synthesis for Video Frame Interpolation,"Simon Niklaus, Portland State University; Feng Liu, Portland State University"131,Poster,2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning,"Diogo Luvizon, ETIS Lab; David Picard, ETIS /LIP6; Hedi Tabia, ETIS / ENSEA"135,Poster,NAG: Network for Adversary Generation,"Konda Reddy Mopuri, Indian Institute of Science; Utkarsh Ojha, MNNIT Allahabad; Utsav Garg, Nanyang Technological University; Venkatesh Babu Radhakrishnan, Indian Institute of Science"136,Spotlight,LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation,"Tak-Wai Hui, The Chinese University of Hong Kong; Chen-Change Loy, the Chinese University of Hong Kong; Xiaoou Tang, Chinese University of Hong Kong"136,Poster,LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation,"Tak-Wai Hui, The Chinese University of Hong Kong; Chen-Change Loy, the Chinese University of Hong Kong; Xiaoou Tang, Chinese University of Hong Kong"137,Poster,Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration,"Lu Sheng, The Chinese University of HK; Jing Shao, The Sensetime Group Limited; Ziyi Lin, SenseTime Co. Ltd.; Xiaogang Wang, Chinese University of Hong Kong"142,Spotlight,Multi-view Harmonized Bilinear Network for 3D Object Recognition,"Tan Yu, Nanyang Technological Univ; Jingjing Meng, ; Junsong Yuan, Nanyang Technological University"142,Poster,Multi-view Harmonized Bilinear Network for 3D Object Recognition,"Tan Yu, Nanyang Technological Univ; Jingjing Meng, ; Junsong Yuan, Nanyang Technological University"144,Spotlight,Tangent Convolutions for Dense Prediction in 3D,"Maxim Tatarchenko, Freiburg; Jaesik Park, Intel Labs; Qian-Yi Zhou, ABQ Technologies; Vladlen Koltun, Intel Labs"144,Poster,Tangent Convolutions for Dense Prediction in 3D,"Maxim Tatarchenko, Freiburg; Jaesik Park, Intel Labs; Qian-Yi Zhou, ABQ Technologies; Vladlen Koltun, Intel Labs"145,Oral,Semi-parametric Image Synthesis,"Xiaojuan Qi, CUHK; Qifeng Chen, Intel Labs; Jiaya Jia, Chinese University of Hong Kong; Vladlen Koltun, Intel Labs"145,Poster,Semi-parametric Image Synthesis,"Xiaojuan Qi, CUHK; Qifeng Chen, Intel Labs; Jiaya Jia, Chinese University of Hong Kong; Vladlen Koltun, Intel Labs"147,Poster,Interactive Image Segmentation with Latent Diversity,"Zhuwen Li, Intel Labs; Qifeng Chen, Intel Labs; Vladlen Koltun, Intel Labs"155,Spotlight,3D Hand Pose Estimation: From Current Achievements to Future Goals,"Shanxin Yuan, Imperial College London; Guillermo Garcia-Hernando, Imperial College London; Bjorn Stenger, ; Tae-Kyun Kim, Imperial College London; Gyeongsik Moon, Seoul National University; Ju Yong Chang, Kwangwoon University; Kyoung Mu Lee, ; Pavlo Molchanov, NVIDIA Research; Liuhao Ge, NTU; Junsong Yuan, Nanyang Technological University; Xinghao Chen, Tsinghua University; Guijin Wang, Tsinghua University; Fan Yang, Nara institute of science and technology; Kai Akiyama, Nara Institute of Science and Technology; Yang Wu, Nara Institute of Science and Technology; Qingfu Wan, Fudan University; Meysam Madadi, Autonomus University of Barcelona and Computer Vision Center, Barcelona, Spain; Sergio Escalera, University of Barcelona; Shile Li, Technical University of Munich; Dongheui Lee, Technical University of Munich; Iason Oikonomidis, FORTH; Antonis Argyros, FORTH"155,Poster,3D Hand Pose Estimation: From Current Achievements to Future Goals,"Shanxin Yuan, Imperial College London; Guillermo Garcia-Hernando, Imperial College London; Bjorn Stenger, ; Tae-Kyun Kim, Imperial College London; Gyeongsik Moon, Seoul National University; Ju Yong Chang, Kwangwoon University; Kyoung Mu Lee, ; Pavlo Molchanov, NVIDIA Research; Liuhao Ge, NTU; Junsong Yuan, Nanyang Technological University; Xinghao Chen, Tsinghua University; Guijin Wang, Tsinghua University; Fan Yang, Nara institute of science and technology; Kai Akiyama, Nara Institute of Science and Technology; Yang Wu, Nara Institute of Science and Technology; Qingfu Wan, Fudan University; Meysam Madadi, Autonomus University of Barcelona and Computer Vision Center, Barcelona, Spain; Sergio Escalera, University of Barcelona; Shile Li, Technical University of Munich; Dongheui Lee, Technical University of Munich; Iason Oikonomidis, FORTH; Antonis Argyros, FORTH"165,Poster,W2F: A Weakly-Supervised to Fully-Supervised Framework for Object Detection,"Yongqiang Zhang, Harbin institute of Technology/KAUST; Yancheng Bai, Kaust/Iscas; Mingli Ding, ; Yongqiang Li, ; Bernard Ghanem,"167,Spotlight,BlockDrop: Dynamic Inference Paths in Residual Networks,"Zuxuan Wu, University of Maryland; Tushar Nagarajan, University of Texas at Austin; Abhishek Kumar, ; Steven Rennie, ; Larry Davis, University of Maryland, USA; Kristen Grauman, ; Rogerio Feris, IBM"167,Poster,BlockDrop: Dynamic Inference Paths in Residual Networks,"Zuxuan Wu, University of Maryland; Tushar Nagarajan, University of Texas at Austin; Abhishek Kumar, ; Steven Rennie, ; Larry Davis, University of Maryland, USA; Kristen Grauman, ; Rogerio Feris, IBM"168,Spotlight,MapNet: Geometry-Aware Learning of Maps for Camera Localization,"Samarth Brahmbhatt, Georgia Tech; Jinwei Gu, NVIDIA; Kihwan Kim, NVIDIA Research; James Hays, Georgia Tech; Jan Kautz, NVIDIA"168,Poster,MapNet: Geometry-Aware Learning of Maps for Camera Localization,"Samarth Brahmbhatt, Georgia Tech; Jinwei Gu, NVIDIA; Kihwan Kim, NVIDIA Research; James Hays, Georgia Tech; Jan Kautz, NVIDIA"170,Poster,BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning,"Ziming Zhang, MERL; Yuanwei Wu, University of Kansas; Guanghui Wang, University of Kansas"178,Poster,Salient Object Detection Driven by Fixation Prediction,"Wenguan Wang, Beijing Institute of Technology; Jianbing Shen, Beijing Institute of Technolog; Xingping Dong, Beijing Institute of Technology; Ali Borji, UCF"179,Poster,3D Object Detection with Latent Support Surfaces,"Zhile Ren, Brown University; Erik Sudderth, UC Irvine"181,Oral,Practical Block-wise Neural Network Architecture Generation,"Zhao Zhong, Institute of Automation,CAS; Junjie Yan, ; Wei Wu, ; Jing Shao, The Sensetime Group Limited; cheng-lin Liu,"181,Poster,Practical Block-wise Neural Network Architecture Generation,"Zhao Zhong, Institute of Automation,CAS; Junjie Yan, ; Wei Wu, ; Jing Shao, The Sensetime Group Limited; cheng-lin Liu,"182,Poster,Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points,"Fabien Baradel, LIRIS, INSA-Lyon; Christian Wolf, INRIA, INSA-Lyon, CITI, LIRIS; Julien Mille, INSA Val de Loire; Graham Taylor, University of Guelph"185,Oral,Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning,"Qi Wu, University of Adelaide; Peng Wang, ; Chunhua Shen, University of Adelaide; Ian Reid, ; Anton Van den Hengel, University of Adelaide"185,Poster,Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning,"Qi Wu, University of Adelaide; Peng Wang, ; Chunhua Shen, University of Adelaide; Ian Reid, ; Anton Van den Hengel, University of Adelaide"186,Poster,Visual Grounding via Accumulated Attention,"chaorui Deng, ; Qi Wu, University of Adelaide; Fuyuan Hu, ; Fan Lyu, Suzhou University of Science and Technology; Mingkui Tan, South China University of Technology; Qingyao Wu, School of Software Engineering, South China University of Technology"191,Poster,Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors,"Xuanyi Dong, UTS; Shoou-I Yu, Oculus; Xinshuo Weng, Carnegie Mellon University; Shih-En Wei, Oculus Research; Yi Yang, ; Yaser Sheikh,"195,Poster,ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing,"Jian Zhang, KAUST; Bernard Ghanem,"200,Poster,Perturbative Neural Networks: Rethinking Convolution in CNNs,"Felix Juefei-Xu, Carnegie Mellon University; Vishnu Naresh Boddeti, Michigan State University; Marios Savvides, Carnegie Mellon University"203,Spotlight,Nonlinear 3D Face Morphable Model,"LUAN TRAN, Michigan State University; Xiaoming Liu, Michigan State University"203,Poster,Nonlinear 3D Face Morphable Model,"LUAN TRAN, Michigan State University; Xiaoming Liu, Michigan State University"205,Spotlight,Neural Baby Talk,"Jiasen Lu, Georgia Institute of Technology; Jianwei Yang, Georgia Tech; Dhruv Batra, Georgia Tech; Devi Parikh, Georgia Tech"205,Poster,Neural Baby Talk,"Jiasen Lu, Georgia Institute of Technology; Jianwei Yang, Georgia Tech; Dhruv Batra, Georgia Tech; Devi Parikh, Georgia Tech"216,Poster,Towards Pose Invariant Face Recognition in the Wild,"Jian Zhao, NUS; Yu Cheng, Nanyang Technological University; Yan Xu, Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore; Lin Xiong, Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore; Jianshu Li, National University of Singapo; Fang Zhao, National University of Singapore; Karlekar Jayashree, Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore; Sugiri Pranata, Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore; Shengmei Shen, Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore; Junliang Xing, Institute of Automation, Chinese Academy of Sciences; Shuicheng Yan, National University of Singapore; Jiashi Feng,"224,Poster,MoNet: Deep Motion Exploitation for Video Object Segmentation,"Huaxin Xiao, Nudt; Jiashi Feng, ; Guosheng Lin, Nanyang Technological Universi; Yu Liu, NUDT; Maojun Zhang,"229,Poster,Exploring Disentangled Feature Representation Beyond Face Identification,"Yu Liu, CUHK; Fangyin Wei, Peking University; Jing Shao, The Sensetime Group Limited; Lu Sheng, The Chinese University of HK; Junjie Yan, ; Xiaogang Wang, Chinese University of Hong Kong"232,Poster,Towards Effective Low-bitwidth Convolutional Neural Networks,"Bohan Zhuang, The University of Adelaide; Chunhua Shen, University of Adelaide; Mingkui Tan, South China University of Technology; Lingqiao Liu, University of Adelaide; Ian Reid,"234,Poster,Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries,"Bohan Zhuang, The University of Adelaide; Qi Wu, University of Adelaide; Chunhua Shen, University of Adelaide; Ian Reid, ; Anton Van den Hengel, University of Adelaide"237,Poster,Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering,"Kaili Zhao, Beijing University of Post & T; Wen-Sheng Chu, Carnegie Mellon University; Aleix Martinez, The ohio state university"242,Spotlight,Few-Shot Image Recognition by Predicting Parameters from Activations,"Siyuan Qiao, Johns Hopkins University; Chenxi Liu, JHU; Wei Shen, Shanghai University; Alan Yuille,"242,Poster,Few-Shot Image Recognition by Predicting Parameters from Activations,"Siyuan Qiao, Johns Hopkins University; Chenxi Liu, JHU; Wei Shen, Shanghai University; Alan Yuille,"246,Poster,Single-Shot Object Detection with Enriched Semantics,"Zhishuai Zhang, Johns Hopkins University; Siyuan Qiao, Johns Hopkins University; Cihang Xie, JHU; Wei Shen, Shanghai University; Bo Wang, HikVision USA Inc.; Alan Yuille, JHU"250,Poster,Unifying Identification and Context Learning for Person Recognition,"Qingqiu Huang, CUHK; Yu Xiong, CUHK; Dahua Lin, CUHK"252,Poster,Separating Self-Expression and Visual Content in Hashtag Supervision,"Andreas Veit, Cornel Tech ; Maximillian Nickel, ; Serge Belongie, ; Laurens van der Maaten, Facebook"255,Poster,Multi-Cue Correlation Filters for Robust Visual Tracking,"Ning Wang, USTC; Wengang Zhou, USTC; Qi Tian, ; Richang Hong, ; Meng Wang, HeFei University of Technology; Houqiang Li,"260,Poster,Beyond Trade-off: Accelerate FCN-based Face Detection with Higher Accuracy,"Guanglu Song, Beihang University; Yu Liu, CUHK; Ming Jiang, BUAA; Yujie Wang, Beihang university"261,Poster,On the Robustness of Semantic Segmentation Models to Adversarial Attacks,"Anurag Arnab, University of Oxford; Ondrej Miksik, University of Oxford; Phil Torr, Oxford"266,Oral,"PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume","Deqing Sun, NVIDIA; Xiaodong Yang, NVIDIA; Ming-Yu Liu, NVIDIA; Jan Kautz, NVIDIA"266,Poster,"PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume","Deqing Sun, NVIDIA; Xiaodong Yang, NVIDIA; Ming-Yu Liu, NVIDIA; Jan Kautz, NVIDIA"270,Oral,Illuminant Spectra-based Source Separation Using Flash Photography,"Zhuo Hui, Carnegie Mellon University; Kalyan Sunkavalli, Adobe Systems Inc.; Sunil Hadap, ; Aswin Sankaranarayanan, Carnegie Mellon University"270,Poster,Illuminant Spectra-based Source Separation Using Flash Photography,"Zhuo Hui, Carnegie Mellon University; Kalyan Sunkavalli, Adobe Systems Inc.; Sunil Hadap, ; Aswin Sankaranarayanan, Carnegie Mellon University"281,Spotlight,Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging,"Brandon Smith, University of Wisconsin-Madiso; Matthew O'Toole, Stanford University; Mohit Gupta, Wisconsin"281,Poster,Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging,"Brandon Smith, University of Wisconsin-Madiso; Matthew O'Toole, Stanford University; Mohit Gupta, Wisconsin"285,Poster,Improved Human Pose Estimation through Adversarial Data Augmentation,"Zhiqiang Tang, Rutgers; Xi Peng, ; Fei Yang, facebook; Rogerio Feris, IBM; Dimitris Metaxas, Rutgers"289,Poster,Generative Adversarial Learning Towards Fast Weakly Supervised Detection,"Yunhang Shen, Xiamen University; Rongrong Ji, ; Shengchuan Zhang, ; Wangmeng Zuo, Harbin Institute of Technology; Yan Wang, Microsoft"298,Spotlight,Audio to Body Dynamics,"Eli Shlizerman, Facebook; Lucio Dery, Stanford; Hayden Schoen, Facebook; Ira Kemelmacher,"298,Poster,Audio to Body Dynamics,"Eli Shlizerman, Facebook; Lucio Dery, Stanford; Hayden Schoen, Facebook; Ira Kemelmacher,"299,Poster,The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,"Richard Zhang, UC Berkeley; Phillip Isola, UC Berkeley; Alexei Efros, UC Berkeley; Eli Shechtman, Adobe Research; Oliver Wang, Adobe"303,Poster,Frame-Recurrent Video Super-Resolution,"Mehdi S. M. Sajjadi, Max Planck Institute for Intel; Raviteja Vemulapalli, Google; Matthew Brown,"304,Poster,Deep Mutual Learning,"Ying Zhang, QMUL; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh; Huchuan Lu, Dalian University of Technology"308,Poster,Real-world Anomaly Detection in Surveillance Videos,"Waqas Sultani, ; Chen Chen, University of Central Florida; Mubarak Shah, UCF"310,Poster,Soccer on Your Tabletop,"Konstantinos Rematas, University of Washington; Ira Kemelmacher, ; Brian Curless, Washington; Steve Seitz, Washington/Google"312,Poster,Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification,"Shuang Li, The Chinese University of HK; Slawomir Bak, Disney Research; Peter Carr, Disney Research"313,Poster,HashGAN: Deep Learning to Hash with Pair Conditional Wasserstein GAN,"Yue Cao, Tsinghua University; Mingsheng Long, Tsinghua University; Bin Liu, Tsinghua University; Jianmin Wang,"316,Poster,Excitation Backprop for RNNs,"Sarah Bargal, Boston University; Andrea Zunino, Istituto Italiano di Tecnologia; Donghyun Kim, Boston University; Jianming Zhang, Adobe Research; Vittorio Murino, Istituto Italiano di Tecnologia; Stan Sclaroff, Boston University"319,Poster,Dynamic-Structured Semantic Propagation Network,"Xiaodan Liang, Carnegie Mellon University; Hongfei Zhou, ; Eric Xing, Carnegie Mellon University"325,Spotlight,Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation,"Huaizu Jiang, UMass Amherst; Deqing Sun, NVIDIA; Varun Jampani, NVIDIA Research; Ming-Hsuan Yang, UC Merced; Erik Miller, ; Jan Kautz, NVIDIA"325,Poster,Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation,"Huaizu Jiang, UMass Amherst; Deqing Sun, NVIDIA; Varun Jampani, NVIDIA Research; Ming-Hsuan Yang, UC Merced; Erik Miller, ; Jan Kautz, NVIDIA"326,Oral,SPLATNet: Sparse Lattice Networks for Point Cloud Processing,"Hang Su, University of Massachusetts, Amherst; Varun Jampani, NVIDIA Research; Deqing Sun, NVIDIA; Evangelos Kalogerakis, UMass; Subhransu Maji, ; Ming-Hsuan Yang, UC Merced; Jan Kautz, NVIDIA"326,Poster,SPLATNet: Sparse Lattice Networks for Point Cloud Processing,"Hang Su, University of Massachusetts, Amherst; Varun Jampani, NVIDIA Research; Deqing Sun, NVIDIA; Evangelos Kalogerakis, UMass; Subhransu Maji, ; Ming-Hsuan Yang, UC Merced; Jan Kautz, NVIDIA"329,Poster,Video Representation Learning Using Discriminative Pooling,"Jue Wang, ANU; Anoop Cherian, ; Fatih Porikli, NICTA, Australia; Stephen Gould, Australian National University"330,Poster,Attend and Interact: Higher-Order Object Interactions for Video Understanding,"CHIH-YAO MA, GEORGIA TECH; Asim Kadav, NEC Labs; Iain Melvin, ; Zsolt Kira, ; Ghassan AlRegib, ; Hans Peter Graf,"342,Poster,Human Pose Estimation with Parsing Induced Learner,"Xuecheng Nie, National University of Singapo; Jiashi Feng, ; Yiming Zuo, Tsinghua University; Shuicheng Yan,"345,Poster,4D Human Body Correspondences from Panoramic Depth Maps,"Zhong Li, University of Delaware; Minye Wu, ShanghaiTech; Wangyiteng Zhou, ShanghaiTech University; Jingyi Yu, University of Delaware, USA"346,Poster,Recognizing Human Actions as Evolution of Pose Estimation Maps,"Mengyuan Liu, Nanyang Technological University; Junsong Yuan,"348,Poster,GraphBit: Bitwise Interaction Mining via Deep Reinforcement Learning,"Yueqi Duan, Tsinghua University; Ziwei Wang, Tsinghua University; Jiwen Lu, Tsinghua University; Xudong Lin, Tsinghua University; Jie Zhou,"350,Spotlight,Deep Adversarial Metric Learning,"Yueqi Duan, Tsinghua University; Wenzhao Zheng, Tsinghua University; Xudong Lin, Tsinghua University; Jiwen Lu, Tsinghua University; Jie Zhou,"350,Poster,Deep Adversarial Metric Learning,"Yueqi Duan, Tsinghua University; Wenzhao Zheng, Tsinghua University; Xudong Lin, Tsinghua University; Jiwen Lu, Tsinghua University; Jie Zhou,"353,Poster,Revisiting Video Saliency: A Large-scale Benchmark and a New Model,"Wenguan Wang, Beijing Institute of Technology; Jianbing Shen, Beijing Institute of Technolog; Fang Guo, Beijing Institute of Technology; Ming-Ming Cheng, Nankai University; Ali Borji, UCF"362,Poster,Graph-Cut RANSAC,"Daniel Barath, MTA SZTAKI; Jiri Matas,"363,Poster,Five-point Fundamental Matrix Estimation for Uncalibrated Cameras,"Daniel Barath, MTA SZTAKI"367,Poster,Hashing as Tie-Aware Learning to Rank,"Kun He, Boston University; Fatih Cakir, Boston University; Sarah Bargal, Boston University; Stan Sclaroff, Boston University"368,Poster,Optimizing Local Feature Descriptors for Nearest Neighbor Matching,"Kun He, Boston University; Yan Lu, ; Stan Sclaroff, Boston University"369,Oral,"Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies","Hanbyul Joo, CMU; Tomas Simon, Oculus Research; Yaser Sheikh,"369,Poster,"Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies","Hanbyul Joo, CMU; Tomas Simon, Oculus Research; Yaser Sheikh,"374,Spotlight,Consensus Maximization for Semantic Region Correspondences,"Pablo Speciale, ETH; Danda Paudel, ; Martin Oswald, ETH Zurich; Hayko Riemenschneider, Computer Vision Lab, ETH Zurich; Luc Van Gool, KTH; Marc Pollefeys, ETH"374,Poster,Consensus Maximization for Semantic Region Correspondences,"Pablo Speciale, ETH; Danda Paudel, ; Martin Oswald, ETH Zurich; Hayko Riemenschneider, Computer Vision Lab, ETH Zurich; Luc Van Gool, KTH; Marc Pollefeys, ETH"380,Poster,ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing,"Chen-Hsuan Lin, CMU; Ersin Yumer, Argo AI; Oliver Wang, Adobe; Eli Shechtman, Adobe Research; Simon Lucey,"391,Poster,Motion-Guided Cascaded Refinement Network for Video Object Segmentation,"Ping Hu, ; Gang Wang, ; Xiangfei Kong, Nanyang Technological University; Jason Kuen, NTU, Singapore; Yap-Peng Tan,"397,Poster,Zigzag Learning for Weakly Supervised Object Detection,"Xiaopeng Zhang, National University of Singapore; Jiashi Feng, ; Hongkai Xiong, Shanghai Jiao Tong University; Qi Tian,"405,Spotlight,"Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models","Jiuxiang Gu, Nanyang Technological Universi; Jianfei Cai, ; Joty Shafiq Rayhan, ; Li Niu, Rice University; Gang Wang,"405,Poster,"Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models","Jiuxiang Gu, Nanyang Technological Universi; Jianfei Cai, ; Joty Shafiq Rayhan, ; Li Niu, Rice University; Gang Wang,"406,Spotlight,VITON: An Image-based Virtual Try-on Network,"Xintong Han, University of Maryland; Zuxuan Wu, University of Maryland; Zhe Wu, University of Maryland; Ruichi Yu, ; Larry Davis, University of Maryland, USA"406,Poster,VITON: An Image-based Virtual Try-on Network,"Xintong Han, University of Maryland; Zuxuan Wu, University of Maryland; Zhe Wu, University of Maryland; Ruichi Yu, ; Larry Davis, University of Maryland, USA"408,Poster,Cross-Domain Self-supervised Multi-task Feature Learning Using Synthetic Game Imagery,"Zhongzheng Ren, UC Davis; Yong Jae Lee, UC Davis"409,Poster,LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image,"Chuhang Zou, UIUC; Alex Colburn, Zillow Group Inc.; Qi Shan, Zillow Group; Derek Hoiem,"418,Poster,Thoracic Disease Identification and Localization with Limited Supervision,"Zhe Li, Syracuse University; Chong Wang, Google Inc; Mei Han, Google Inc; Yuan Xue, Google; Wei Wei, Google Inc.; Li-jia Li, Google Inc; Fei-Fei Li, Google Inc."419,Poster,Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks,"Jason Kuen, NTU, Singapore; Xiangfei Kong, Nanyang Technological University; Zhe Lin, Adobe Systems, Inc.; Gang Wang, ; Jianxiong Yin, NVIDIA; Simon See, NVIDIA; Yap-Peng Tan,"420,Poster,Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation,"Jiwoon Ahn, DGIST; Suha Kwak, POSTECH"421,Poster,Deep End-to-End Time-of-Flight Imaging,"Shuochen Su, University of British Columbia; Felix Heide, Stanford University; Gordon Wetzstein, ; Wolfgang Heidrich,"423,Spotlight,Fast and Accurate Online Video Object Segmentation via Tracking Parts,"Jingchun Cheng, Tsinghua University; Yi-Hsuan Tsai, NEC Labs America; Wei-Chih Hung, University of California, Merced; Shengjin Wang, ; Ming-Hsuan Yang, UC Merced"423,Poster,Fast and Accurate Online Video Object Segmentation via Tracking Parts,"Jingchun Cheng, Tsinghua University; Yi-Hsuan Tsai, NEC Labs America; Wei-Chih Hung, University of California, Merced; Shengjin Wang, ; Ming-Hsuan Yang, UC Merced"425,Poster,Min-Entropy Latent Model for Weakly Supervised Object Detection,"Fang Wan, UCAS; Pengxu Wei, ; Jianbin Jiao, ; Zhenjun Han, ; Qixiang Ye,"429,Poster,Future Frame Prediction for Anomaly Detection  A New Baseline,"Wen Liu, ShanghaiTech University; Weixin Luo, Shanghaitech University; Dongze Lian, ShanghaiTech University; Shenghua Gao, ShanghaiTech University"430,Poster,Face Aging with Identity-Preserved Conditional Generative Adversarial Networks,"Zongwei WANG, ; Xu Tang, ; Weixin Luo, Shanghaitech University; Shenghua Gao, ShanghaiTech University"431,Poster,Learning to Compare: Relation Network for Few-Shot Learning,"Flood Sung, Independent Researcher; Yongxin Yang, Queen Mary University of London; Li Zhang, Queen Mary University of London; Tao Xiang, Queen Mary University of London; Phil Torr, Oxford; Timothy Hospedales, University of Edinburgh"435,Oral,Deep Layer Aggregation,"Fisher Yu, UC Berkeley; Dequan Wang, UC Berkeley; Evan Shelhamer, UC Berkeley; Trevor Darrell, UC Berkeley, USA"435,Poster,Deep Layer Aggregation,"Fisher Yu, UC Berkeley; Dequan Wang, UC Berkeley; Evan Shelhamer, UC Berkeley; Trevor Darrell, UC Berkeley, USA"436,Poster,Style Aggregated Network for Facial Landmark Detection,"Xuanyi Dong, UTS; Yan Yan, UTS; Wanli Ouyang, The University of Sydney; Yi Yang,"442,Spotlight,M3: Multimodal Memory Modelling for Video Captioning,"Junbo Wang, Institute of Automation, Chine; Wei Wang, ; Yan Huang, ; Liang Wang, unknown; Tieniu Tan, NLPR China"442,Poster,M3: Multimodal Memory Modelling for Video Captioning,"Junbo Wang, Institute of Automation, Chine; Wei Wang, ; Yan Huang, ; Liang Wang, unknown; Tieniu Tan, NLPR China"449,Poster,Classification Driven Dynamic Image Enhancement,"Vivek Sharma, Karlsruhe Institute of Technology; Ali Diba, ; Davy Neven, KU Leuven; Michael Brown, York University; Luc Van Gool, KTH; Rainer Stiefelhagen, Karlsruhe Institute of Technology"456,Poster,Generative Image Inpainting with Contextual Attention,"Jiahui Yu, UIUC; Zhe Lin, Adobe Systems, Inc.; Jimei Yang, ; Xiaohui Shen, Adobe Research; Xin Lu, ; Thomas Huang,"458,Spotlight,Iterative Visual Reasoning Beyond Convolutions,"Xinlei Chen, Facebook; Li-jia Li, Google Inc; Fei-Fei Li, Google Inc.; Abhinav Gupta,"458,Poster,Iterative Visual Reasoning Beyond Convolutions,"Xinlei Chen, Facebook; Li-jia Li, Google Inc; Fei-Fei Li, Google Inc.; Abhinav Gupta,"460,Poster,Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification,"Jianlou Si, BUPT; Honggang Zhang, ; Chun-Guang Li, Beijing Univ. of Posts&Telecom; Jason Kuen, NTU, Singapore; Xiangfei Kong, Nanyang Technological University; Alex Kot, ; Gang Wang,"465,Spotlight,Textbook Question Answering under Teacher Guidance with Memory Networks,"Juzheng Li, Tsinghua University; Hang Su, Tsinghua University; Jun Zhu, Tsinghua University; Siyu Wang, ; Bo Zhang,"465,Poster,Textbook Question Answering under Teacher Guidance with Memory Networks,"Juzheng Li, Tsinghua University; Hang Su, Tsinghua University; Jun Zhu, Tsinghua University; Siyu Wang, ; Bo Zhang,"468,Poster,Multi-Level Factorisation Net for Person Re-Identification,"Xiaobin Chang, Queen Mary Univ. of London; Timothy Hospedales, University of Edinburgh; Tao Xiang, Queen Mary University of London"471,Spotlight,Functional Map of the World,"Gordon Christie, JHU/APL; Neil Fendley, JHU/APL; James Wilson, DigitalGlobe; Ryan Mukherjee, JHU/APL"471,Poster,Functional Map of the World,"Gordon Christie, JHU/APL; Neil Fendley, JHU/APL; James Wilson, DigitalGlobe; Ryan Mukherjee, JHU/APL"473,Poster,A Two-Step Disentanglement Method,"Naama Hadad, Tel Aviv University; Lior Wolf, Tel Aviv University, Israel; Moni Shahar, Tel Aviv University"475,Poster,Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization,"Peihua Li, ; Jiangtao Xie, ; Qilong Wang, ; Zilin Gao, Dalian University of Technology"482,Poster,Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?,"Kensho Hara, AIST; Hirokatsu Kataoka, AIST; Yutaka Satoh, AIST"483,Oral,Left-Right Comparative Recurrent Model for Stereo Matching,"Zequn Jie, ; Pengfei Wang, NUS; Yonggen Ling, Tencent; Bo Zhao, ; Jiashi Feng, ; Wei Liu,"483,Poster,Left-Right Comparative Recurrent Model for Stereo Matching,"Zequn Jie, ; Pengfei Wang, NUS; Yonggen Ling, Tencent; Bo Zhao, ; Jiashi Feng, ; Wei Liu,"487,Oral,Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input,"Adel Bibi, KAUST; Modar Alfadly, King Abdullah University of Science and Technology; Bernard Ghanem,"487,Poster,Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input,"Adel Bibi, KAUST; Modar Alfadly, King Abdullah University of Science and Technology; Bernard Ghanem,"488,Spotlight,Zero-Shot Sketch-Image Hashing,"Yuming Shen, University of East Anglia; Li Liu, University of East Anglia; Fumin Shen, ; Ling Shao, University of East Anglia"488,Poster,Zero-Shot Sketch-Image Hashing,"Yuming Shen, University of East Anglia; Li Liu, University of East Anglia; Fumin Shen, ; Ling Shao, University of East Anglia"490,Spotlight,Interpretable Convolutional Neural Networks,"Quanshi Zhang, UCLA; Yingnian Wu, ; Song-Chun Zhu,"490,Poster,Interpretable Convolutional Neural Networks,"Quanshi Zhang, UCLA; Yingnian Wu, ; Song-Chun Zhu,"491,Poster,Reconstructing Thin Structures of Manifold Surfaces by Integrating Spatial Curves,"Shiwei Li, HKUST; Yao Yao, HKUST; Tian Fang, HKUST; Long Quan, The Hong Kong University of Science and Technology, Hong Kong"493,Poster,Enhancing the Spatial Resolution of Stereo Images using a Parallax Prior,"Daniel S. Jeon, KAIST; Seung-Hwan Baek, KAIST; Inchang Choi, ; Min H. Kim, KAIST"494,Poster,Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB,"Tomoyuki Suzuki, Keio University; Hirokatsu Kataoka, AIST; Yoshimitsu Aoki, Keio University; Yutaka Satoh, AIST"500,Spotlight,Generating Synthetic X-ray Images of a Person from the Surface Geometry,"Brian Teixeira, Siemens Healthineers; Vivek Singh, Siemens Healthineers; Kai Ma, Siemens Healthineers; Birgi Tamersoy, Siemens Healthineers; Terrence Chen, Siemens Healthineers; Yifan Wu, Temple University ; Elena Balashova, Princeton University; Dorin Comaniciu, Siemens Healthineers"500,Poster,Generating Synthetic X-ray Images of a Person from the Surface Geometry,"Brian Teixeira, Siemens Healthineers; Vivek Singh, Siemens Healthineers; Kai Ma, Siemens Healthineers; Birgi Tamersoy, Siemens Healthineers; Terrence Chen, Siemens Healthineers; Yifan Wu, Temple University ; Elena Balashova, Princeton University; Dorin Comaniciu, Siemens Healthineers"505,Poster,Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification,"Wenguan Wang, Beijing Institute of Technology; Yuanlu Xu, University of California, Los Angeles; Jianbing Shen, Beijing Institute of Technolog; Song-Chun Zhu,"506,Poster,Unsupervised CCA,"Yedid Hoshen, Facebook AI Research (FAIR); Lior Wolf, Tel Aviv University, Israel"510,Poster,Discovering Point Lights with Intensity Distance Fields,"Edward Zhang, University of Washington; MIchael Cohen, ; Brian Curless, Washington"512,Poster,Universal Denoising Networks : A Novel CNN-based Network Architecture for Image Denoising,"Stamatios Lefkimmiatis, Skolkovo Institute of Science"517,Poster,Easy Identification from Better Constraints: Multi-Shot Person Re-Identification from Reference Constraints,"Jiahuan Zhou, Northwestern University; Bing Su, Chinese Academy of Sciences; Ying Wu, Northwestern University, USA"533,Spotlight,Recurrent Pixel Embedding for Instance Grouping,"Shu Kong, University of California, Irvine; Charless Fowlkes, University of California, Irvine, USA"533,Poster,Recurrent Pixel Embedding for Instance Grouping,"Shu Kong, University of California, Irvine; Charless Fowlkes, University of California, Irvine, USA"534,Poster,Recurrent Scene Parsing with Perspective Understanding in the Loop,"Shu Kong, University of California, Irvine; Charless Fowlkes, University of California, Irvine, USA"540,Poster,Learning to Hash by Discrepancy Minimization,"Zhixiang Chen, Tsinghua University; Xin Yuan, Tsinghua University; Jiwen Lu, Tsinghua University; Jie Zhou,"542,Poster,Fast End-to-End Trainable Guided Filter,"Huikai Wu, CASIA; Shuai Zheng, EBay; Junge Zhang, ; Kaiqi Huang, National Laboratory of Pattern Recognition"550,Poster,Disentangling Structure and Aesthetics for Content-aware Image Completion,"Andrew Gilbert, University of Surrey; John Collomosse, University of Surrey, UK.; Hailin Jin, ; Brian Price,"552,Oral,An Analysis of Scale Invariance in Object Detection - SNIP,"Bharat Singh, ; Larry Davis, University of Maryland, USA"552,Poster,An Analysis of Scale Invariance in Object Detection - SNIP,"Bharat Singh, ; Larry Davis, University of Maryland, USA"561,Poster,CSGNet: Neural Shape Parser for Constructive Solid Geometry,"Gopal Sharma, University of Massachusetts; Subhransu Maji, ; Rishabh Goyal, Indian Institute of Technology, Kanpu; Difan Liu, UMass Amherst; Evangelos Kalogerakis, UMass"565,Oral,Finding Tiny Faces in the Wild with Generative Adversarial Network,"Yancheng Bai, Kaust/Iscas; Yongqiang Zhang, Harbin institute of Technology/KAUST; Mingli Ding, ; Bernard Ghanem,"565,Poster,Finding Tiny Faces in the Wild with Generative Adversarial Network,"Yancheng Bai, Kaust/Iscas; Yongqiang Zhang, Harbin institute of Technology/KAUST; Mingli Ding, ; Bernard Ghanem,"567,Spotlight,SSNet: Scale Selection Network for Online 3D Action Prediction,"Jun Liu, Nanyang Technological University; Amir Shahroudy, NTU Singapore; Gang Wang, ; Ling-Yu Duan, ; Alex Kot,"567,Poster,SSNet: Scale Selection Network for Online 3D Action Prediction,"Jun Liu, Nanyang Technological University; Amir Shahroudy, NTU Singapore; Gang Wang, ; Ling-Yu Duan, ; Alex Kot,"568,Spotlight,Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs,"Adrian Bulat, ; Georgios Tzimiropoulos,"568,Poster,Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs,"Adrian Bulat, ; Georgios Tzimiropoulos,"569,Poster,The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation,"Pia Bideau, University of Massachusets; Aruni RoyChowdhury, University of Massachusetts; Rakesh Radhakrishnan Menon, University of Massachusetts; Erik Miller,"573,Poster,In-Place Activated BatchNorm for Memory-Optimized Training of DNNs,"Samuel Rota Bulo', Mapillary Research; Lorenzo Porzi, Mapillary Research; Peter Kontschieder,"574,Poster,Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks,"Zhenhua Feng, University of Surrey; Muhammad Awais, university of surrey; Josef Kittler, ; Patrik Huber, University of Surrey; Xiaojun Wu, Jiangnan University"581,Spotlight,Deep Cross-media Knowledge Transfer,"Xin Huang, Peking University; Yuxin Peng, Peking University"581,Poster,Deep Cross-media Knowledge Transfer,"Xin Huang, Peking University; Yuxin Peng, Peking University"588,Poster,Coupled End-to-end Transfer Learning with Generalized Fisher Information,"Shixing Chen, Wayne State University; Caojin Zhang, Wayne State University; Ming Dong,"589,Poster,Knowledge Aided Consistency for Weakly Supervised Phrase Grounding,"Kan Chen, Univ. of Southern California; Jiyang Gao, ; Ram Nevatia,"593,Poster,Viewpoint-aware Attentive Multi-view Inference for Vehicle Re-identification,"Yi Zhou, University of East Anglia; Ling Shao, University of East Anglia"594,Poster,MatNet: Modular Attention Network for Referring Expression Comprehension,"Licheng Yu, UNC Chapel Hill; Zhe Lin, Adobe Systems, Inc.; Xiaohui Shen, Adobe Research; Jimei Yang, ; Xin Lu, ; Mohit Bansal, UNC Chapel Hill; Tamara Berg, University on North carolina"598,Poster,CBMV: A Coalesced Bidirectional Matching Volume for Disparity Estimation,"Konstantinos Batsos, Stevens Institute of Technolog; Changjiang Cai, ; Philippos Mordohai, Stevens Institute of Technology"601,Spotlight,NISP: Pruning Networks using Neuron Importance Score Propagation,"Ruichi Yu, ; Ang Li, Google DeepMind; Chun-Fu (Richard) Chen, IBM T.J. Watson Research Cente; Jui-Hsin Lai, ; Vlad Morariu, University of Maryland; Xintong Han, University of Maryland; Mingfei Gao, University of Maryland; Ching-Yung Lin, ; Larry Davis, University of Maryland, USA"601,Poster,NISP: Pruning Networks using Neuron Importance Score Propagation,"Ruichi Yu, ; Ang Li, Google DeepMind; Chun-Fu (Richard) Chen, IBM T.J. Watson Research Cente; Jui-Hsin Lai, ; Vlad Morariu, University of Maryland; Xintong Han, University of Maryland; Mingfei Gao, University of Maryland; Ching-Yung Lin, ; Larry Davis, University of Maryland, USA"603,Poster,Who Let The Dogs Out? Modeling Dog Behavior From Visual Data,"KIANA EHSANI, 1993; Hessam Bagherinezhad, University of Washington; Joe Redmon, University of Washington; Roozbeh Mottaghi, Allen Institute for Artificial Intelligence; Ali Farhadi,"609,Poster,Efficient Video Object Segmentation via Network Modulation,"Linjie Yang, Snap Research; YANRAN WANG, NORTHWESTERN; Xuehan Xiong, Snapchat; Jianchao Yang, Snap; Aggelos Katsaggelos, Northwestern University"615,Poster,Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision,"Yaojie Liu, Michigan State University; Amin Jourabloo, ; Xiaoming Liu, Michigan State University"618,Poster,Feedback-prop: Convolutional Neural Network Inference under Partial Evidence,"Tianlu Wang, 1994; Kota Yamaguchi, CyberAgent, Inc.; Vicente Ordonez, University of Virginia"619,Poster,A Memory Network Approach for Story-based Temporal Summarization of 360 Videos,"Sangho Lee, Seoul National University; Jinyoung Sung, Seoul National University; Youngjae Yu, ; Gunhee Kim, Carnegie Mellon University"620,Poster,Improving Occlusion and Hard Negative Handling for Single-Stage Object Detectors,"Junhyug Noh, Seoul National University; Soochan Lee, ; Beomsu Kim, ; Gunhee Kim, Carnegie Mellon University"623,Poster,UV-GAN: Adversarial Facial UV Map Completion for Pose-invariant Face Recognition,"Jiankang Deng, Imperial College London; Shiyang Cheng, Imperial College London; Niannan Xue, Imperial College London; Yuxiang Zhou, Imperial College; Stefanos Zafeiriou, Imperial College London"630,Spotlight,Learning a Toolchain for Image Restoration,"Ke Yu, CUHK; Chao Dong, Sensetime Co. Ltd ; Chen-Change Loy, the Chinese University of Hong Kong"630,Poster,Learning a Toolchain for Image Restoration,"Ke Yu, CUHK; Chao Dong, Sensetime Co. Ltd ; Chen-Change Loy, the Chinese University of Hong Kong"631,Poster,Learning to Act Properly: Predicting and Explaining Affordances from Images,"Ching-Yao Chuang, University of Toronto; Jiaman Li, University of Toronto; Antonio Torralba, MIT; Sanja Fidler,"632,Poster,Learning a Discriminative Feature Network for Semantic Segmentation,"Changqian Yu, HUST; Jingbo Wang, Peking University; Chao Peng, Megvii; Changxin Gao, HUST; Gang Yu, Face++; Nong Sang,"633,Poster,Optimizing Video Object Detection via a Scale-Time Lattice,"Kai Chen, CUHK; Jiaqi Wang, CUHK; Shuo Yang, ; Xingcheng Zhang, CUHK; Yuanjun Xiong, Amazon ; Chen-Change Loy, the Chinese University of Hong Kong; Dahua Lin, CUHK"642,Poster,ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,"Xiangyu Zhang, Megvii Inc; Xinyu Zhou, Megvii Technology Inc.; Mengxiao Lin, Megvii Technology Ltd.(Face++); Jian Sun,"643,Poster,Cascaded Pyramid Network for Multi-Person Pose Estimation,"Yilun Chen, Beihang University; Zhicheng Wang, Megvii(Face++); Yuxiang Peng, Tsinghua University; Zhiqiang Zhang, HUST; Gang Yu, Face++; Jian Sun,"648,Poster,Seeing Temporal Modulation of Lights from Standard Cameras,"Naoki Sakakibara, Nagoya Institute of Technology; Fumihiko Sakaue, Nagoya Institute of Technology; JUN SATO, Nagoya Institute of Technology"649,Poster,Point-wise Convolutional Neural Networks,"Binh-Son Hua, SUTD; Khoi Tran, SUTD; Sai-Kit Yeung,"668,Spotlight,Fine-grained Video Captioning for Sports Narrative,"Huanyu Yu, Shanghai Jiao Tong University; Shuo Cheng, SJTU; Bingbing Ni, ; Minsi Wang, Shanghai Jiao Tong University; Zhang Jian, Shanghai Jiao Tong University; Xiaokang Yang,"668,Poster,Fine-grained Video Captioning for Sports Narrative,"Huanyu Yu, Shanghai Jiao Tong University; Shuo Cheng, SJTU; Bingbing Ni, ; Minsi Wang, Shanghai Jiao Tong University; Zhang Jian, Shanghai Jiao Tong University; Xiaokang Yang,"671,Poster,Dense 3D Regression for Hand Pose Estimation,"Chengde Wan, ; Thomas Probst, ; Luc Van Gool, KTH; Angela Yao, University of Bonn"672,Poster,Missing Slice Recovery for Tensors Using a Low-rank Model in Embedded Space,"Tatsuya Yokota, Nagoya Institute of Technology; Burak Erem, ; Seyhmus Guler, ; Simon Warfield, Harvard Medical School; Hidekata Hontani,"673,Poster,Learning Convolutional Networks for Content-weighted Image Compression,"Mu LI, PolyU; Wangmeng Zuo, Harbin Institute of Technology; Shuhang Gu, ; debin Zhao, ; David Zhang, Hong Kong Polytechnic University"678,Poster,Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking,"Qiang Wang, CASIA; Zhu Teng, Beijing Jiaotong University; Junliang Xing, Institute of Automation, Chinese Academy of Sciences; Jin Gao, Institute of Automation, Chinese Academy of Sciences; Weiming Hu,"680,Poster,Deep Cost-Sensitive and Order-Preserving Feature Learning for Cross-Population Age Estimation,"Kai Li, Chinese Academy of Sciences; Junliang Xing, Institute of Automation, Chinese Academy of Sciences; Chi Su, KingSoft; Weiming Hu, ; Yundong Zhang, Vimicro Corporation; Stephen Maybank, Birkbeck University of London"683,Poster,First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations,"Guillermo Garcia-Hernando, Imperial College London; Shanxin Yuan, Imperial College London; Seungryul Baek, Imperial College London; Tae-Kyun Kim, Imperial College London"687,Spotlight,Hand PointNet: 3D Hand Pose Estimation using Point Sets,"Liuhao Ge, NTU; Junwu Weng, Nanyang Technological Univ.; Yujun Cai, NTU; Junsong Yuan, Nanyang Technological University"687,Poster,Hand PointNet: 3D Hand Pose Estimation using Point Sets,"Liuhao Ge, NTU; Junwu Weng, Nanyang Technological Univ.; Yujun Cai, NTU; Junsong Yuan, Nanyang Technological University"695,Poster,Recovering Realistic Texture in Image Super-resolution by Spatial Feature Modulation,"Xintao Wang, CUHK University; Ke Yu, CUHK; Chao Dong, Sensetime Co. Ltd ; Chen-Change Loy, the Chinese University of Hong Kong"700,Poster,Cube Padding for Weakly-Supervised Saliency Prediction in 360$^{\circ}$ Videos,"Hsien-Tzu Cheng, National Tsing Hua University; Chun-Hung Chao, ; Jin-Dong Dong, ; Hao-Kai Wen, ; Tyng-Luh Liu, IIS/Academia Sinica; Min Sun, University of Washington"710,Poster,A Face to Face Neural Conversation Model,"Hang Chu, University of Toronto; Sanja Fidler,"711,Poster,SurfConv: Bridging 3D and 2D Convolution for RGBD Images,"Hang Chu, University of Toronto; Wei-Chiu Ma, MIT; Kaustav Kundu, University of Toronto; Raquel Urtasun, University of Toronto; Sanja Fidler,"717,Poster,Dynamic Video Segmentation Network,"Yu-Shuan Xu, National Tsing Hua University; Chun-Yi Lee, National Tsing Hua University; TSUJUI FU, NTHUCS; HsuanKung Yang, National Tsing Hua University"721,Poster,Multiple Granularity Group Interaction Prediction,"Taiping Yao, Shanghai Jiaotong University; Minsi Wang, Shanghai Jiao Tong University; Huawei Wei, Shanghai Jiao Tong University; Bingbing Ni, ; Xiaokang Yang,"732,Spotlight,Visual Question Reasoning on General Dependency Tree,"Qingxing Cao, Sun Yat-Sen University; Xiaodan Liang, Carnegie Mellon University; Bailin Li, SUN-YAT SEN UNIVERSITY; Liang Lin,"732,Poster,Visual Question Reasoning on General Dependency Tree,"Qingxing Cao, Sun Yat-Sen University; Xiaodan Liang, Carnegie Mellon University; Bailin Li, SUN-YAT SEN UNIVERSITY; Liang Lin,"733,Poster,From Lifestyle VLOGs to Everyday Interactions,"David Fouhey, UC Berkeley; WEICHENG KUO, Berkeley; Alexei Efros, UC Berkeley; Jitendra Malik,"735,Poster,COCO-Stuff: Thing and Stuff Classes in Context,"Holger Caesar, University of Edinburgh; Jasper Uijlings, Google; Vitto Ferrari,"736,Spotlight,GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB,"Franziska Mueller, MPI Informatics; Florian Bernard, MPI Informatics; Oleksandr Sotnychenko, MPI Informatics; Dushyant Mehta, MPI For Informatics; Srinath Sridhar, ; Dan Casas, MPI; Christian Theobalt, MPI Informatics"736,Poster,GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB,"Franziska Mueller, MPI Informatics; Florian Bernard, MPI Informatics; Oleksandr Sotnychenko, MPI Informatics; Dushyant Mehta, MPI For Informatics; Srinath Sridhar, ; Dan Casas, MPI; Christian Theobalt, MPI Informatics"739,Poster,Non-local Neural Networks,"Xiaolong Wang, Carnegie Mellon University; Ross Girshick, ; Abhinav Gupta, ; Kaiming He,"740,Poster,Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs,"Xiaolong Wang, Carnegie Mellon University; Yufei Ye, Carnegie Mellon University; Abhinav Gupta,"744,Oral,Taskonomy: Disentangling Task Transfer Learning,"Alexander Sax, Stanford University; William Shen, ; Amir Zamir, Stanford, UC Berkeley; Jitendra Malik, ; Silvio Savarese, ; Leonidas J. Guibas,"744,Poster,Taskonomy: Disentangling Task Transfer Learning,"Alexander Sax, Stanford University; William Shen, ; Amir Zamir, Stanford, UC Berkeley; Jitendra Malik, ; Silvio Savarese, ; Leonidas J. Guibas,"747,Spotlight,Embodied Real-World Active Perception,"Fei Xia, Stanford University; Amir Zamir, Stanford, UC Berkeley; Zhi-Yang He, Stanford University; Alexander Sax, Stanford University; Jitendra Malik, ; Silvio Savarese,"747,Poster,Embodied Real-World Active Perception,"Fei Xia, Stanford University; Amir Zamir, Stanford, UC Berkeley; Zhi-Yang He, Stanford University; Alexander Sax, Stanford University; Jitendra Malik, ; Silvio Savarese,"754,Spotlight,"SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild'","Soumyadip Sengupta, University of Maryland; Angjoo Kanazawa, University of Maryland; Carlos Castillo, ; David Jacobs, University of Maryland"754,Poster,"SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild'","Soumyadip Sengupta, University of Maryland; Angjoo Kanazawa, University of Maryland; Carlos Castillo, ; David Jacobs, University of Maryland"756,Poster,End-to-end Recovery of Human Shape and Pose,"Angjoo Kanazawa, University of Maryland; Michael Black, Max Planck Institute for Intelligent Systems; David Jacobs, University of Maryland; Jitendra Malik,"757,Poster,"Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene","Shubham Tulsiani, UC Berkeley; David Fouhey, UC Berkeley; Saurabh Gupta, ; Alexei Efros, UC Berkeley; Jitendra Malik,"759,Poster,Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction,"Shubham Tulsiani, UC Berkeley; Alexei Efros, UC Berkeley; Jitendra Malik,"762,Poster,A Fast Resection-Intersection Method for the Known Rotation Problem,"Qianggong Zhang, The University of Adelaide; Tat-Jun Chin, ; Huu Le, The University of Adelaide"764,Poster,Image Generation from Scene Graphs,"Justin Johnson, Stanford University; Agrim Gupta, Stanford University; Fei-Fei Li, Stanford University"765,Spotlight,What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets,"De-An Huang, Stanford University; Vignesh Ramanathan, Facebook; Dhruv Mahajan, ; Juan Carlos Niebles, Stanford University; Fei-Fei Li, Stanford University; Lorenzo Torresani, Darthmout College, USA; Manohar Paluri,"765,Poster,What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets,"De-An Huang, Stanford University; Vignesh Ramanathan, Facebook; Dhruv Mahajan, ; Juan Carlos Niebles, Stanford University; Fei-Fei Li, Stanford University; Lorenzo Torresani, Darthmout College, USA; Manohar Paluri,"766,Poster,PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation,"Danfei Xu, Stanford Univesity; dragomir Anguelov, Zoox Inc.; Ashesh Jain, Zoox Inc."768,Oral,High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs,"Ting-Chun Wang, NVIDIA; Ming-Yu Liu, NVIDIA; Jun-Yan Zhu, UC Berkeley; Andrew Tao, NVIDIA; Bryan Catanzaro, NVIDIA; Jan Kautz, NVIDIA"768,Poster,High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs,"Ting-Chun Wang, NVIDIA; Ming-Yu Liu, NVIDIA; Jun-Yan Zhu, UC Berkeley; Andrew Tao, NVIDIA; Bryan Catanzaro, NVIDIA; Jan Kautz, NVIDIA"769,Poster,Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,"Agrim Gupta, Stanford University; Justin Johnson, Stanford University; Fei-Fei Li, Stanford University; Silvio Savarese, ; Alexandre Alahi, EPFL"777,Spotlight,Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,"Benoit Jacob, Google; Skirmantas Kligys, Google; Bo Chen, Google; Matthew Tang, Google; Menglong Zhu, ; Andrew Howard, Google; Dmitry Kalenichenko, Google; Hartwig Adam, Google"777,Poster,Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,"Benoit Jacob, Google; Skirmantas Kligys, Google; Bo Chen, Google; Matthew Tang, Google; Menglong Zhu, ; Andrew Howard, Google; Dmitry Kalenichenko, Google; Hartwig Adam, Google"778,Oral,"Finding It"": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video""","De-An Huang, Stanford University; Shyamal Buch, Stanford University; Lucio Dery, Stanford University; Animesh Garg, Stanford University; Fei-Fei Li, Stanford University; Juan Carlos Niebles, Stanford University"778,Poster,"Finding It"": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video""","De-An Huang, Stanford University; Shyamal Buch, Stanford University; Lucio Dery, Stanford University; Animesh Garg, Stanford University; Fei-Fei Li, Stanford University; Juan Carlos Niebles, Stanford University"779,Poster,Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatio-temporal Patterns,"Jianming Lv, South China University of Technology; Weihang Chen, South China University of Technology; Qing Li, City University of Hong Kong; Can Yang, South China University of Technology"784,Poster,Kernelized Subspace Pooling for Deep Local Descriptors,"Xing Wei, Xi'an Jiaotong University; Yihong Gong, Xi'an Jiaotong University; Yue Zhang, IAIR,Xi'an Jiaotong University; Nanning Zheng, Xi'an Jiaotong University"786,Poster,Video Rain Removal By Multiscale Convolutional Sparse Coding,"Li Minghan, Xi'an Jiaotong University; Qi Xie, ; Qian Zhao, ; Wei Wei, Xi'an Jiaotong University; Shuhang Gu, ; Jing Tao, ; Deyu Meng, Xi'an Jiaotong University"789,Poster,Learning from Millions of 3D Scans for Large-scale 3D Face Recognition,"Syed Zulqarnain Gilani, The University of Western Aust; Ajmal Mian, UWA"792,Poster,Referring Relationships,"Ranjay Krishna, Stanford University; Ines Chami, Stanford University; Michael Bernstein, Stanford University; Fei-Fei Li, Stanford University"794,Poster,Improving Object Localization with Fitness NMS and Bounded IoU Loss,"Lachlan Tychsen-Smith, CSIRO (Data61); Lars Petersson,"801,Spotlight,Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination,"Zhirong Wu, UC Berkeley; Yuanjun Xiong, Amazon ; Stella Yu, UC Berkeley / ICSI; Dahua Lin, CUHK"801,Poster,Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination,"Zhirong Wu, UC Berkeley; Yuanjun Xiong, Amazon ; Stella Yu, UC Berkeley / ICSI; Dahua Lin, CUHK"809,Spotlight,CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization,"Sixing Hu, NUS; Mengdan Feng, NUS; Rang Nguyen, National Uni. of Singapore; Gim Hee Lee, National University of SIngapore"809,Poster,CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization,"Sixing Hu, NUS; Mengdan Feng, NUS; Rang Nguyen, National Uni. of Singapore; Gim Hee Lee, National University of SIngapore"811,Spotlight,Visual Question Generation as Dual Task of Visual Question Answering,"Yikang Li, ; Nan Duan, Microsoft; Bolei Zhou, Massachuate Institute of Technology; Xiao Chu, Baidu; Wanli Ouyang, The University of Sydney; Xiaogang Wang, Chinese University of Hong Kong"811,Poster,Visual Question Generation as Dual Task of Visual Question Answering,"Yikang Li, ; Nan Duan, Microsoft; Bolei Zhou, Massachuate Institute of Technology; Xiao Chu, Baidu; Wanli Ouyang, The University of Sydney; Xiaogang Wang, Chinese University of Hong Kong"812,Spotlight,Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation,"Yunchao Wei, ; Huaxin Xiao, ; Honghui Shi, UIUC; Zequn Jie, ; Jiashi Feng, ; Thomas Huang,"812,Poster,Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation,"Yunchao Wei, ; Huaxin Xiao, ; Honghui Shi, UIUC; Zequn Jie, ; Jiashi Feng, ; Thomas Huang,"816,Poster,Learning Dual Convolutional Neural Networks for Low-Level Vision,"Jinshan Pan, UC Merced; Sifei Liu, ; Deqing Sun, NVIDIA; Jiawei Zhang, City University of Hong Kong; Yang Liu, DUT; Jimmy Ren, SenseTime Group Limited; Zechao Li, Nanjing University of Science and Technology ; Jinhui Tang, ; Huchuan Lu, Dalian University of Technology; Yu-Wing Tai, Tencent YouTu; Ming-Hsuan Yang, UC Merced"823,Poster,Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation,"Younghyun Jo, Yonsei University; Seoung Wug Oh, Yonsei Univeristy; JaeYeon Kang, Yonsei Univ.; Seon Joo Kim, Yonsei University"836,Spotlight,MegDet: A Large Mini-Batch Object Detector,"Chao Peng, Megvii; Tete Xiao, Peking University; Zeming Li, Tsinghua University, Megvii; Yuning Jiang, Megvii inc.; Xiangyu Zhang, Megvii Inc; Kai Jia, Mevii; Gang Yu, Face++; Jian Sun,"836,Poster,MegDet: A Large Mini-Batch Object Detector,"Chao Peng, Megvii; Tete Xiao, Peking University; Zeming Li, Tsinghua University, Megvii; Yuning Jiang, Megvii inc.; Xiangyu Zhang, Megvii Inc; Kai Jia, Mevii; Gang Yu, Face++; Jian Sun,"842,Poster,AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,"Tao Xu, Lehigh University; Pengchuan Zhang, ; Qiuyuan Huang, ; Han Zhang, Rutgers; Zhe Gan, ; Xiaolei Huang, Lehigh ; Xiaodong He,"844,Spotlight,TOM-Net: Learning Transparent Object Matting from a Single Image,"Guanying Chen, The University of Hong Kong; Kai Han, ; Kwan-Yee Kenneth Wong, The University of Hong Kong"844,Poster,TOM-Net: Learning Transparent Object Matting from a Single Image,"Guanying Chen, The University of Hong Kong; Kai Han, ; Kwan-Yee Kenneth Wong, The University of Hong Kong"847,Poster,End-to-End Deep Kronecker-Product Matching for Person Re-identification,"Yantao Shen, CUHK; Tong Xiao, The Chinese University of HK; Hongsheng Li, ; Shuai Yi, The Chinese University of Hong Kong; Xiaogang Wang, Chinese University of Hong Kong"849,Poster,Semantic Visual Localization,"Johannes Schnberger, ETH Zurich; Marc Pollefeys, ETH; Andreas Geiger, MPI Tuebingen / ETH Zuerich; Torsten Sattler, ETH Zurich"851,Poster,Joint Cuts and Matching of Partitions in One Graph,"Tianshu Yu, Arizona State University; Junchi Yan, Shanghai Jiao Tong University; Jieyi Zhao, University of Texas Health Science Center at Houston; Baoxin Li, Arizona State University"853,Spotlight,Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions,"Torsten Sattler, ETH Zurich; Will Maddern, University of Oxford; Carl Toft, Chalmers University ; Akihiko Torii, Tokyo Institute of Technology; Lars Hammarstrand, Chalmers university of technol; Erik Stenborg, Chalmers University of Tech.; Daniel Safari, DTU; Marc Pollefeys, ETH; Josef Sivic, ; Fredrik Kahl, Chalmers; Tomas Pajdla,"853,Poster,Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions,"Torsten Sattler, ETH Zurich; Will Maddern, University of Oxford; Carl Toft, Chalmers University ; Akihiko Torii, Tokyo Institute of Technology; Lars Hammarstrand, Chalmers university of technol; Erik Stenborg, Chalmers University of Tech.; Daniel Safari, DTU; Marc Pollefeys, ETH; Josef Sivic, ; Fredrik Kahl, Chalmers; Tomas Pajdla,"862,Poster,Crowd Counting via Adversarial Cross-Scale Consistency Pursuit,"Zan Shen, Institute of Image Communication and Network Engineering, Shanghai Jiao Tong U; Bingbing Ni, ; Yi Xu, Shanghai Jiao Tong University; Minsi Wang, Shanghai Jiao Tong University; jianguo Hu, Minivision; Xiaokang Yang,"874,Poster,Deep Group-shuffling Random Walk for Person Re-identification,"Yantao Shen, CUHK; Hongsheng Li, ; Tong Xiao, The Chinese University of HK; Shuai Yi, The Chinese University of Hong Kong; Dapeng Chen, CUHK; Xiaogang Wang, Chinese University of Hong Kong"878,Spotlight,Learning to Detect Features in Texture Images,"Linguang Zhang, Princeton University; Szymon Rusinkiewicz, Princeton University"878,Poster,Learning to Detect Features in Texture Images,"Linguang Zhang, Princeton University; Szymon Rusinkiewicz, Princeton University"888,Poster,Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-Identification,"Jingya Wang, QMUL; Xiatian Zhu, Vision Semantics Ltd.; Shaogang Gong, Queen Mary University; Wei Li, Queen Mary University of Lond"890,Poster,CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles,"Dinesh reddy Narapureddy, Carnegie mellon university; Minh Vo, CMU; Srinivasa Narasimhan, Carnegie Mellon University"892,Poster,Context-aware Deep Feature Compression for High-speed Visual Tracking,"Jongwon Choi, ; Hyung Jin Chang, Imperial College London; Tobias Fischer, Imperial College London; Sangdoo Yun, Seoul National University; Jiyeoup Jeong, Seoul National University; kyuewang Lee, Seoul National University; Yiannis Demiris, ; Jin Choi,"894,Poster,Deep Material-aware Cross-spectral Stereo Matching,"Tiancheng Zhi, Carnegie Mellon University; Bernardo Pires, CMU; Martial Hebert, Carnegie Mellon University; Srinivasa Narasimhan, Carnegie Mellon University"899,Poster,Deep Extreme Cut: From Extreme Points to Object Segmentation,"Kevis-Kokitsi Maninis, ETH Zurich; Sergi Caelles, ETH Zurich; Jordi Pont-Tuset, ETHZ; Luc Van Gool, KTH"906,Spotlight,Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images,"Hao Zhou, UMD; Jin Sun, University of Maryland; Yaser Yacoob, Univ of Maryland; David Jacobs, University of Maryland"906,Poster,Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images,"Hao Zhou, UMD; Jin Sun, University of Maryland; Yaser Yacoob, Univ of Maryland; David Jacobs, University of Maryland"908,Poster,Harmonious Attention Network for Person Re-Identication,"Wei Li, Queen Mary University of Lond; Xiatian Zhu, Vision Semantics Ltd.; Shaogang Gong, Queen Mary University"909,Spotlight,Unsupervised Deep Generative Adversarial Hashing Network,"Kamran Ghasedi Dizaji, University of Pittsburgh; Feng Zheng, University of Pittsburgh; Najmeh Sadoughi, University of Texas at Dallas; Heng Huang, University of Pittsburgh"909,Poster,Unsupervised Deep Generative Adversarial Hashing Network,"Kamran Ghasedi Dizaji, University of Pittsburgh; Feng Zheng, University of Pittsburgh; Najmeh Sadoughi, University of Texas at Dallas; Heng Huang, University of Pittsburgh"910,Poster,Pseudo-Mask Augmented Object Detection,"Xiangyun Zhao, Northwestern University; Shuang Liang, Tongji University; Yichen Wei, Microsoft Research Asia"914,Spotlight,LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH),"Pelin Dogan, ETH Zurich; Albert Li, Disney Research; Leonid Sigal, University of British Columbia; Markus Gross,"914,Poster,LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH),"Pelin Dogan, ETH Zurich; Albert Li, Disney Research; Leonid Sigal, University of British Columbia; Markus Gross,"927,Poster,Adversarial Complementary Learning for Weakly Supervised Object Localization,"Xiaolin Zhang, University of Technology Sydey; Yunchao Wei, ; Jiashi Feng, ; Yi Yang, ; Thomas Huang,"932,Oral,Unsupervised Discovery of Object Landmarks as Structural Representations,"Yuting Zhang, University of Michigan; Yijie Guo, University of Michigan; Yixin Jin, ; Yijun Luo, University of Michigan; Zhiyuan He, University of Michigan; Honglak Lee, University of Michigan, USA"932,Poster,Unsupervised Discovery of Object Landmarks as Structural Representations,"Yuting Zhang, University of Michigan; Yijie Guo, University of Michigan; Yixin Jin, ; Yijun Luo, University of Michigan; Zhiyuan He, University of Michigan; Honglak Lee, University of Michigan, USA"936,Poster,DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map,"Peng Wang, Baidu; Ruigang Yang, University of Kentucky; Binbin Cao, Baidu; Wei Xu, ; Yuanqing Lin,"944,Poster,Monocular Relative Depth Perception with Web Stereo Data Supervision,"Ke Xian, Huazhong University of Science and Technology; Chunhua Shen, University of Adelaide; Zhiguo Cao, Huazhong University of Science and Technology; Hao Lu, Huazhong University of Science and Technology; yang xiao, Huazhong University of Science and Technology; Ruibo Li, Huazhong University of Science and Technology; Zhenbo Luo, Samsung Research Beijing"948,Poster,Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification,"Weijian Deng, University of Chinese Academy; Liang Zheng, University of Texas at San Ant; GUOLIANG KANG, UTS; Yi Yang, ; Qixiang Ye, ; Jianbin Jiao,"952,Poster,Objects as context for detecting their semantic parts,"Abel Gonzalez-Garcia, University of Edinburgh; Davide Modolo, Amazon; Vitto Ferrari,"954,Poster,Camera Style Adaptation for Person Re-identification,"Zhun Zhong, Xiamen University; Liang Zheng, University of Texas at San Ant; Zhedong Zheng, UTS; Shaozi Li, ; Yi Yang, University of Technology, Sydney"961,Poster,Conditional Generative Adversarial Network for Structured Domain Adaptation,"Weixiang Hong, Nanyang Technological Universi; Zhenzhen Wang, Nanyang Technological University; Ming Yang, Horizon Robotics Inc.; Junsong Yuan, Nanyang Technological University"962,Poster,Rotation-sensitive Regression for Oriented Scene Text Detection,"Minghui Liao, Huazhong University of Science and Technology; Zhen Zhu, Huazhong University of Science and Technology; Baoguang Shi, Huazhong University of Science and Technology; Gui-Song Xia, Wuhan University; Xiang Bai, Huazhong University of Science and Technology"963,Poster,Residual Parameter Transfer for Deep Domain Adaptation,"Artem Rozantsev, EPFL; Mathieu Salzmann, EPFL; Pascal Fua,"967,Spotlight,SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation,"Weiyue Wang, USC; Ronald Yu, ; Qiangui Huang, U of Southern CA; Ulrich Neumann, USC"967,Poster,SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation,"Weiyue Wang, USC; Ronald Yu, ; Qiangui Huang, U of Southern CA; Ulrich Neumann, USC"974,Spotlight,Weakly Supervised Instance Segmentation using Class Peak Response,"Yanzhao Zhou, UCAS, China; Yi Zhu, UCAS; Qixiang Ye, ; Qiang Qiu, ; Jianbin Jiao,"974,Poster,Weakly Supervised Instance Segmentation using Class Peak Response,"Yanzhao Zhou, UCAS, China; Yi Zhu, UCAS; Qixiang Ye, ; Qiang Qiu, ; Jianbin Jiao,"978,Poster,Robust Facial Landmark Detection via a Fully-Convolutional Local-Global Context Network,"Daniel Merget, Technical University of Munich; Matthias Rock, TUM; Rigoll Gerhard, TUM"984,Oral,Rotation Averaging and Strong Duality,"Anders Eriksson, ; Fredrik Kahl, Chalmers; Carl Olsson, Lund University; Tat-Jun Chin,"984,Poster,Rotation Averaging and Strong Duality,"Anders Eriksson, ; Fredrik Kahl, Chalmers; Carl Olsson, Lund University; Tat-Jun Chin,"985,Poster,PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning,"Arun Mallya, UIUC; Lana Lazebnik,"999,Oral,Im2Flow: Motion Hallucination from Static Images for Action Recognition,"Ruohan Gao, University of Texas at Austin; Bo Xiong, UT-Austin; Kristen Grauman,"999,Poster,Im2Flow: Motion Hallucination from Static Images for Action Recognition,"Ruohan Gao, University of Texas at Austin; Bo Xiong, UT-Austin; Kristen Grauman,"1001,Poster,Feature Quantization for Defending Against Distortion of Images,"Zhun Sun, Tohoku University; Mete Ozay, ; Yan Zhang, RIKEN Center for AIP ; Xing Liu, Tohoku University; Takayuki Okatani, Tohoku University/RIKEN AIP"1016,Poster,End-to-end weakly-supervised semantic alignment,"Ignacio ROCCO, Inria; Relja Arandjelovic, DeepMind; Josef Sivic,"1018,Spotlight,PointGrid: A Deep Network for 3D Shape Understanding,"Truc Le, University of Missouri - Columbia; Ye Duan, University of Missouri - Columbia"1018,Poster,PointGrid: A Deep Network for 3D Shape Understanding,"Truc Le, University of Missouri - Columbia; Ye Duan, University of Missouri - Columbia"1019,Poster,Imagine it for me: Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts,"Yizhe Zhu, ; Mohamed Elhoseiny, FAIR; Bingchen Liu, Rutgers; Ahmed Elgammal,"1020,Poster,A Minimalist Approach to Type-Agnostic Detection of Quadrics in Point Clouds,"Tolga Birdal, Technical University of Munich; Benjamin Busam, Framos; Nassir Navab, Technical University of Munich; Slobodan Ilic, Siemens AG; Peter Sturm, INRIA Rhone-Alpes"1022,Poster,A Benchmark for Articulated Human Pose Estimation and Tracking,"Mykhaylo Andriluka, MPI Informatics; Umar Iqbal, ; Eldar Insafutdinov, MPI Informatics; Anton Milan, University of Adelaide; Leonid Pishchulin, MPI Informatik; Juergen Gall, University of Bonn, Germany; Bernt Schiele, MPI Informatics Germany"1024,Poster,Boosting Self-Supervised Learning via Knowledge Transfer,"Mehdi Noroozi, University of Bern; Ananthachari Kavalkazhani Vinjimoor, UMBC; Hamed Pirsiavash, ; Paolo Favaro, Bern University, Switzerland"1025,Spotlight,PPFNet: Global Context Aware Local Features for Robust 3D Point Matching,"Haowen Deng, Technical University of Munich; Tolga Birdal, Technical University of Munich; Slobodan Ilic, Siemens AG"1025,Poster,PPFNet: Global Context Aware Local Features for Robust 3D Point Matching,"Haowen Deng, Technical University of Munich; Tolga Birdal, Technical University of Munich; Slobodan Ilic, Siemens AG"1027,Spotlight,Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,"Peter Anderson, Australian National University; Qi Wu, University of Adelaide; Damien Teney, Unversity of Adelaide; Jake Bruce, ; Mark Johnson, Macquarie University; Niko Snderhauf, Queensland University of Technology; Ian Reid, ; Stephen Gould, Australian National University; Anton Van den Hengel, University of Adelaide"1027,Poster,Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,"Peter Anderson, Australian National University; Qi Wu, University of Adelaide; Damien Teney, Unversity of Adelaide; Jake Bruce, ; Mark Johnson, Macquarie University; Niko Snderhauf, Queensland University of Technology; Ian Reid, ; Stephen Gould, Australian National University; Anton Van den Hengel, University of Adelaide"1029,Spotlight,Fast Video Object Segmentation by Reference-Guided Mask Propagation,"Seoung Wug Oh, Yonsei Univeristy; Joon-Young Lee, ; Kalyan Sunkavalli, Adobe Systems Inc.; Seon Joo Kim, Yonsei University"1029,Poster,Fast Video Object Segmentation by Reference-Guided Mask Propagation,"Seoung Wug Oh, Yonsei Univeristy; Joon-Young Lee, ; Kalyan Sunkavalli, Adobe Systems Inc.; Seon Joo Kim, Yonsei University"1035,Poster,Super-Resolving Very Low-Resolution Face Images with Supplementary Attributes,"Xin Yu, Australian National University; Basura Fernando, ANU Canberra Australia; Richard Hartley, Australian National University Australia; Fatih Porikli, NICTA, Australia"1036,Poster,Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding,"Dapeng Chen, CUHK; Hongsheng Li, ; Tong Xiao, The Chinese University of HK; Shuai Yi, The Chinese University of Hong Kong; Xiaogang Wang, Chinese University of Hong Kong"1037,Poster,One-shot Action Localization by Sequence Matching Network,"Hongtao Yang, Australian National University; Xuming He, ShanghaiTech; Fatih Porikli, NICTA, Australia"1052,Poster,Efficient Subpixel Refinement with Symbolic Linear Predictors,"Vincent Lui, Monash University; Jonathon Geeves, Monash University; Winston Yii, Monash University; Tom Drummond, Monash"1056,Poster,Distort-and-Recover: Color Enhancement using Deep Reinforcement Learning,"Jongchan Park, KAIST; Joon-Young Lee, ; Donggeun Yoo, Lunit; In So Kweon, KAIST"1057,Oral,Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification,"Dapeng Chen, CUHK; Dan Xu, ; Hongsheng Li, ; Nicu Sebe, University of Trento, Italy; Xiaogang Wang, Chinese University of Hong Kong"1057,Poster,Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification,"Dapeng Chen, CUHK; Dan Xu, ; Hongsheng Li, ; Nicu Sebe, University of Trento, Italy; Xiaogang Wang, Chinese University of Hong Kong"1058,Poster,Single Image Reflection Separation with Perceptual Losses,"Xuaner Zhang, UC Berkeley; Qifeng Chen, Intel Labs"1063,Spotlight,AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions,"Chunhui Gu, Google; Chen Sun, Google; David Ross, Google Research; Carl Vondrick, Google; Caroline Pantofaru, Google; Yeqng Li, Google Inc.; Sudheendra Vijayanarasimhan, Google Research; George Toderici, Google; Susanna Ricco, Google; Rahul Sukthankar, Google Research; Cordelia Schmid, INRIA Grenoble, France; Jitendra Malik,"1063,Poster,AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions,"Chunhui Gu, Google; Chen Sun, Google; David Ross, Google Research; Carl Vondrick, Google; Caroline Pantofaru, Google; Yeqng Li, Google Inc.; Sudheendra Vijayanarasimhan, Google Research; George Toderici, Google; Susanna Ricco, Google; Rahul Sukthankar, Google Research; Cordelia Schmid, INRIA Grenoble, France; Jitendra Malik,"1067,Poster,Recognize Actions by Disentangling Components of Dynamics,"Yue Zhao, CUHK; Yuanjun Xiong, Amazon ; Dahua Lin, CUHK"1078,Poster,Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains,"Jiahao Pang, SenseTime Group Limited; Wenxiu Sun, SenseTime Group Limited; Chengxi Yang, SenseTime Group Limited; Jimmy Ren, SenseTime Group Limited; Ruichao Xiao, ; Jin Zeng, The Hong Kong University of Science and Technology; Liang Lin,"1082,Poster,Attention-aware Compositional Network for Person Re-Identification,"Jing Xu, SenseNets Technology Limited; Rui Zhao, SenseNets Technology Limited; Feng Zhu, SenseNets Technology Limited; Huaming Wang, SenseNets Technology Limited; Wanli Ouyang, The University of Sydney"1083,Poster,HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification,"Amos Sironi, Prophesee; Manuele Brambilla, Prophesee; Nicolas Bourdis, prophesee; Xavier Lagorce, Prophesee; Ryad Benosman, Universite Pierre et Marie Curie-Paris"1085,Poster,Mask-guided Contrastive Attention Model for Person Re-Identification,"Chunfeng Song, CASIA; Yan Huang, ; Wanli Ouyang, ; Liang Wang, unknown"1097,Spotlight,Pose-Guided Photorealistic Face Rotation,"Yibo Hu, CRIPAC, CASIA; Xiang Wu, Institute of Automation, Chine; Bing Yu, ; Ran He, ; Zhenan Sun, CRIPAC"1097,Poster,Pose-Guided Photorealistic Face Rotation,"Yibo Hu, CRIPAC, CASIA; Xiang Wu, Institute of Automation, Chine; Bing Yu, ; Ran He, ; Zhenan Sun, CRIPAC"1099,Spotlight,Automatic 3D Indoor Scene Modeling from Single Panorama,"Yang Yang, University of Delaware; Shi Jin, ShanghaiTech University; Ruiyang Liu, ; Sing Bing Kang, Microsoft Research; Jingyi Yu, University of Delaware, USA"1099,Poster,Automatic 3D Indoor Scene Modeling from Single Panorama,"Yang Yang, University of Delaware; Shi Jin, ShanghaiTech University; Ruiyang Liu, ; Sing Bing Kang, Microsoft Research; Jingyi Yu, University of Delaware, USA"1101,Spotlight,SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion,"Miroslava Slavcheva, Siemens AG; Maximilian Baust, TUM; Slobodan Ilic, Siemens AG"1101,Poster,SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion,"Miroslava Slavcheva, Siemens AG; Maximilian Baust, TUM; Slobodan Ilic, Siemens AG"1103,Poster,A Biresolution Spectral framework for Product Quantization,"Lopamudra Mukherjee, University of Wisc Whitewater; Sathya Ravi, University of Wisconsin-Madison; Jiming Peng, University of Houston; Vikas Singh, University of Wisconsin-Madison"1109,Poster,Dynamic Zoom-in Network for Fast Object Detection in Large Images,"Mingfei Gao, University of Maryland; Ruichi Yu, ; Ang Li, Google DeepMind; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland, USA"1110,Poster,On the Importance of Label Quality for Semantic Segmentation,"Aleksandar Zlateski, MIT; ronnachai Jaroensri, Massachusetts Institute of Technology; Prafull Sharma, MIT; Fredo Durand,"1113,Poster,EPINET: A Fully-Convolutional Neural Network for Light Field Depth Estimation by Using Epipolar Geometry,"Changha Shin, Yonsei Univ; Hae-Gon Jeon, KAIST; Youngjin Yoon , ; InSo Kweon, ; Seon Joo Kim, Yonsei University"1114,Poster,A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking,"M. Saquib Sarfraz, KIT; Arne Schumann, KIT; Andreas Eberle, KIT; Rainer Stiefelhagen, Karlsruhe Institute of Technology"1118,Poster,Erase or Fill? Deep Joint Recurrent Rain Removal and Reconstruction in Videos,"Jiaying Liu, Peking University; Wenhan Yang, Peking University; Shuai Yang, Peking University; Zongming Guo,"1124,Poster,Scalable and Effective Deep CCA via Soft Decorrelation,"Xiaobin Chang, Queen Mary Univ. of London; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh"1126,Poster,High-order tensor regularization with application to attribute ranking,"Kwang In Kim, University of Bath; Juhyun Park, Lancaster University; James Tompkin, Brown University"1128,Oral,3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare,"Abhijit Kundu, Georgia Institute of Technology; Yin Li, Georgia Tech; James Rehg, Georgia Institute of Technology"1128,Poster,3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare,"Abhijit Kundu, Georgia Institute of Technology; Yin Li, Georgia Tech; James Rehg, Georgia Institute of Technology"1129,Spotlight,FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds,"Yaoqing Yang, Carnegie Mellon University; Chen Feng, MERL; Yiru Shen, Clemson University; Dong Tian, Mitsubishi Electric Research Laboratories"1129,Poster,FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds,"Yaoqing Yang, Carnegie Mellon University; Chen Feng, MERL; Yiru Shen, Clemson University; Dong Tian, Mitsubishi Electric Research Laboratories"1133,Poster,Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Fully Convolutional Network,"Wenda Zhao, Dalian University of Technolog; Dong Wang, DUT; Huchuan Lu, Dalian University of Technology"1134,Poster,Decorrelated Batch Normalization,"Lei Huang, BeiHang university; Dawei Yang, University of Michigan; Bo Lang, Beihang University; Jia Deng,"1139,Spotlight,Unsupervised Textual Grounding: Linking Words to Image Concepts,"Raymond Yeh, UIUC; Minh Do, University of Illinois at Urbana-Champaign; Alex Schwing,"1139,Poster,Unsupervised Textual Grounding: Linking Words to Image Concepts,"Raymond Yeh, UIUC; Minh Do, University of Illinois at Urbana-Champaign; Alex Schwing,"1156,Poster,Scale-recurrent Network for Deep Image Deblurring,"Xin Tao, CUHK; Hongyun Gao, ; Yi Wang, The Chinese University of HK; Xiaoyong Shen, CUHK; Jue Wang, Megvii; Jiaya Jia, Chinese University of Hong Kong"1162,Poster,Low-Shot Recognition with Imprinted Weights,"Hang Qi, UCLA; Matthew Brown, ; David Lowe,"1163,Oral,Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,"Peter Anderson, Australian National University; Xiaodong He, ; Chris Buehler, ; Damien Teney, Unversity of Adelaide; Mark Johnson, Macquarie University; Stephen Gould, Australian National University; Lei Zhang, Microsoft"1163,Poster,Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,"Peter Anderson, Australian National University; Xiaodong He, ; Chris Buehler, ; Damien Teney, Unversity of Adelaide; Mark Johnson, Macquarie University; Stephen Gould, Australian National University; Lei Zhang, Microsoft"1164,Poster,Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation,"Naoto Inoue, The University of Tokyo; Ryosuke Furuta, The University of Tokyo; Toshihiko Yamasaki, The University of Tokyo; Kiyoharu Aizawa,"1170,Poster,Facelet-Bank for Fast Portrait Manipulation,"Ying-Cong Chen, CUHK; Lin Huaijia, the Chinese University of Hong Kong; Ruiyu Li, CUHK; Michelle Shu, ; Xin Tao, CUHK; Yangang Ye, Tencent; Xiaoyong Shen, CUHK; Jiaya Jia, Chinese University of Hong Kong"1172,Poster,Duplex Generative Adversarial Network for Unsupervised Domain Adaptation,"Lanqing Hu, ICT, CAS; Meina Kan, ; Shiguang Shan, Chinese Academy of Sciences; Xilin Chen,"1173,Poster,Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation,"Xiaowei Xu, University of Notre Dame; Yiyu Shi, University of Notre Dame; Qing Lu, University of Notre Dame; Lin Yang, University of Notre Dame; Sharon Hu, University of Notre Dame; Danny Chen, University of Notre Dame"1177,Poster,Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks,"Shi Xuepeng, ICT; Shiguang Shan, Chinese Academy of Sciences; Meina Kan, ; Shuzhe Wu, Chinese Academy of Sciences; Xilin Chen,"1178,Poster,Structure Preserving Video Prediction,"Xu Jingwei, Shanghai Jiao Tong University; Bingbing Ni, ; Zefan Li, Shanghai Jiaotong University; Shuo Cheng, SJTU; Xiaokang Yang,"1182,Poster,Tagging Like Humans: Diverse and Distinct Image Annotation,"Baoyuan Wu, Tencent AI Lab; Weidong Chen, Tencent; Wei Liu, ; Peng Sun, Tencent; Bernard Ghanem, ; Siwei Lyu, SUNY Albany"1185,Poster,Learning to Sketch with Shortcut Cycle Consistency,"Jifei Song, Queen Mary, Uni. of London; Kaiyue Pang, QMUL; Yi-Zhe Song, ; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh"1186,Poster,GroupCap: Group-based Image Captioning with Structured Relevance and Diversity Constraints,"Fuhai Chen, Xiamen university; Rongrong Ji, ; Xiaoshuai Sun, Harbin Institute of Technology; Jinsong Su, Xiamen university"1193,Spotlight,Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks,"Jiawei Zhang, City University of Hong Kong; Jinshan Pan, UC Merced; Jimmy Ren, SenseTime Group Limited; Yibing Song, Tencent AI Lab; Linchao Bao, Tencent AI Lab; Rynson Lau, City University of Hong Kong; Ming-Hsuan Yang, UC Merced"1193,Poster,Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks,"Jiawei Zhang, City University of Hong Kong; Jinshan Pan, UC Merced; Jimmy Ren, SenseTime Group Limited; Yibing Song, Tencent AI Lab; Linchao Bao, Tencent AI Lab; Rynson Lau, City University of Hong Kong; Ming-Hsuan Yang, UC Merced"1194,Poster,Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning,"Xingping Dong, Beijing Institute of Technology; Jianbing Shen, Beijing Institute of Technolog; Wenguan Wang, Beijing Institute of Technology; Yu Liu, Beijing Institute of Technology; Ling Shao, University of East Anglia; Fatih Porikli, NICTA, Australia"1202,Spotlight,Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective,"Jing Zhang, ; Tong Zhang, Australian National University; Yuchao Dai, Australian National University; Mehrtash Harandi, Australian National University; Richard Hartley, Australian National University Australia"1202,Poster,Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective,"Jing Zhang, ; Tong Zhang, Australian National University; Yuchao Dai, Australian National University; Mehrtash Harandi, Australian National University; Richard Hartley, Australian National University Australia"1203,Spotlight,NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning,"Alexander Richard, University of Bonn; Hilde Kuehne, University of Bonn; Ahsan Iqbal, University of Bonn; Juergen Gall, University of Bonn, Germany"1203,Poster,NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning,"Alexander Richard, University of Bonn; Hilde Kuehne, University of Bonn; Ahsan Iqbal, University of Bonn; Juergen Gall, University of Bonn, Germany"1209,Spotlight,Detecting and Recognizing Human-Object Interactions,"Georgia Gkioxari, Facebook; Ross Girshick, ; Kaiming He, ; Piotr Dollar, Facebook AI Research, Menlo Park, USA"1209,Poster,Detecting and Recognizing Human-Object Interactions,"Georgia Gkioxari, Facebook; Ross Girshick, ; Kaiming He, ; Piotr Dollar, Facebook AI Research, Menlo Park, USA"1213,Poster,Augmenting Crowd-Sourced 3D Reconstructions using Semantic Detections,"True Price, UNC Chapel Hill; Johannes Schnberger, ETH Zurich; Zhen Wei, University of North Carolina; Marc Pollefeys, ETH; Jan-Michael Frahm, UNC Chapel Hill"1219,Poster,Visual Relationship Learning with a Factorization-based Prior,"SEONG JAE HWANG, University of Wisconsin - Madison; Zirui Tao , University of Wisconsin - Madi; Vikas Singh, University of Wisconsin-Madison; Hyunwoo Kim, Amazon Lab 126; Sathya Ravi, University of Wisconsin-Madison; Maxwell Collins,"1224,Poster,Re-weighted Adversarial Adaptation Network for Unsupervised Domain Adaptation,"Qingchao Chen, Unviersity College London; Yang Liu, University of Cambridge; Zhaowen Wang, Adobe; Ian Wassell, ; Kevin Chetty,"1226,Poster,Flow Guided Recurrent Neural Encoder for Video Salient Object Detection,"Guanbin Li, ; Yuan Xie, ; Tianhao Wei, ; Liang Lin,"1230,Poster,Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment,"Amit Kumar, University of Maryland; Rama Chellappa, University of Maryland, USA"1235,Poster,Progressive Attention Guided Recurrent Network for Salient Object Detection,"Xiaoning Zhang, Dalian University of Technolog; TIANTIAN WANG, Dalian University of Technolog; Jinqing Qi, ; Huchuan Lu, Dalian University of Technology"1240,Spotlight,Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering,"Junwei Liang, Carnegie Mellon University; Lu Jiang, ; Liangliang Cao, ; Alexander Hauptmann,"1240,Poster,Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering,"Junwei Liang, Carnegie Mellon University; Lu Jiang, ; Liangliang Cao, ; Alexander Hauptmann,"1244,Poster,Unsupervised Learning of Depth and Egomotion from Monocular Video Using 3D Geometric Constraints,"Reza Mahjourian, University of Texas at Austin; Martin Wicke, Google Brain; Anelia Angelova, Google Brain"1247,Poster,Repulsion Loss: Detecting Pedestrians in a Crowd,"Xinlong Wang, Tongji University; Tete Xiao, Peking University; Yuning Jiang, Megvii inc.; Shuai Shao, Megvii; Jian Sun, ; Chunhua Shen, University of Adelaide"1248,Poster,PU-Net: Point Cloud Upsampling Network,"Lequan Yu, The Chinese University of Hong; XIANZHI LI, CUHK; Chi-Wing Fu, ; Daniel Cohen-Or, ; Pheng-Ann Heng,"1249,Spotlight,Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF,"Linchao Bao, Tencent AI Lab; Baoyuan Wu, Tencent AI Lab; Wei Liu,"1249,Poster,Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF,"Linchao Bao, Tencent AI Lab; Baoyuan Wu, Tencent AI Lab; Wei Liu,"1251,Poster,PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection,"Nian Liu, Northwestern Polytechnical University; Junwei Han, Northwestern Polytechnical U.; Ming-Hsuan Yang, UC Merced"1252,Poster,Gated Fusion Network for Single Image Dehazing,"Wenqi Ren, Chinese Academy of Sciences; Lin Ma, Tencent AI Lab; Jiawei Zhang, City University of Hong Kong; Jinshan Pan, UC Merced; Xiaochun Cao, Chinese Academy of Sciences; Wei Liu, ; Ming-Hsuan Yang, UC Merced"1255,Spotlight,Interleaved Structured Sparse Convolutional Neural Networks,"Guotian Xie, Sun Yat-Sen University; Ting Zhang, Microsoft Research Asia; Jianhuang Lai, Sun Yat-sen University; Jingdong Wang, Microsoft Research"1255,Poster,Interleaved Structured Sparse Convolutional Neural Networks,"Guotian Xie, Sun Yat-Sen University; Ting Zhang, Microsoft Research Asia; Jianhuang Lai, Sun Yat-sen University; Jingdong Wang, Microsoft Research"1258,Poster,Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks,"Ping Wei, Xi'an Jiaotong University; Yang Liu, UCLA; Tianmin Shu, University of California, Los Angeles; Nanning Zheng, Xi'an Jiaotong University; Song-Chun Zhu,"1264,Poster,End-to-end Flow Correlation Tracking with Spatial-temporal Attention,"Zheng Zhu, Institute of Automation, CAS; Wei Wu, ; Wei Zou, ; Junjie Yan,"1271,Poster,Left/Right Asymmetric Layer Skippable Networks,"Changmao Cheng, Fudan University; Yanwei Fu, fudan; Yu-Gang Jiang, Fudan University; Wei Liu, ; wenlian Lu, Fudan; Jianfeng Feng, fudan university; Xiangyang Xue,"1276,Oral,Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation,"Henghui Ding, Nanyang Technological University; Xudong Jiang, Nanyang Technological University; Bing Shuai, ; Ai Qun Liu, Nanyang Technological University; Gang Wang,"1276,Poster,Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation,"Henghui Ding, Nanyang Technological University; Xudong Jiang, Nanyang Technological University; Bing Shuai, ; Ai Qun Liu, Nanyang Technological University; Gang Wang,"1280,Spotlight,VITAL: VIsual Tracking via Adversarial Learning,"Yibing Song, Tencent AI Lab; Chao Ma, ; Xiaohe Wu, Harbin Institute of technology; Lijun Gong, City University of Hong Kong; Linchao Bao, Tencent AI Lab; Wangmeng Zuo, Harbin Institute of Technology; Chunhua Shen, University of Adelaide; Rynson Lau, City University of Hong Kong; Ming-Hsuan Yang, UC Merced"1280,Poster,VITAL: VIsual Tracking via Adversarial Learning,"Yibing Song, Tencent AI Lab; Chao Ma, ; Xiaohe Wu, Harbin Institute of technology; Lijun Gong, City University of Hong Kong; Linchao Bao, Tencent AI Lab; Wangmeng Zuo, Harbin Institute of Technology; Chunhua Shen, University of Adelaide; Rynson Lau, City University of Hong Kong; Ming-Hsuan Yang, UC Merced"1282,Poster,RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints,"Asako Kanezaki, National Institute of Advanced; Yasuyuki Matsushita, Osaka University; Yoshifumi Nishida, National Institute of Advanced Industrial Science and Technology (AIST)"1284,Spotlight,Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints,"Alexander Richard, University of Bonn; Hilde Kuehne, University of Bonn; Juergen Gall, University of Bonn, Germany"1284,Poster,Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints,"Alexander Richard, University of Bonn; Hilde Kuehne, University of Bonn; Juergen Gall, University of Bonn, Germany"1287,Oral,Squeeze-and-Excitation Networks,"Jie Hu, Momenta; Li Shen, University of Oxford; Gang Sun, Momenta"1287,Poster,Squeeze-and-Excitation Networks,"Jie Hu, Momenta; Li Shen, University of Oxford; Gang Sun, Momenta"1288,Poster,Edit Probability for Scene Text Recognition,"Fan Bai, Fudan University; Zhanzhan Cheng, Hikvision Research Institute; Yi Niu, Hikvision Research Institute; Shiliang Pu, ; Shuigeng Zhou, Fudan University"1289,Spotlight,Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning,"Jingwen Wang, SCUT; Wenhao Jiang, Tencent AI Lab; Lin Ma, Tencent AI Lab; Wei Liu, ; Yong Xu, South China University of Technology"1289,Poster,Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning,"Jingwen Wang, SCUT; Wenhao Jiang, Tencent AI Lab; Lin Ma, Tencent AI Lab; Wei Liu, ; Yong Xu, South China University of Technology"1290,Poster,Exploit the Unknown Gradually:~ One-Shot Video-Based Person Re-Identification by Stepwise Learning,"Yu Wu, University of technology sydne; Yutian Lin, ; Xuanyi Dong, UTS; Yan Yan, UTS; Wanli Ouyang, The University of Sydney; Yi Yang,"1294,Poster,Learning to Localize Sound Source in Visual Scenes,"Arda Senocak, KAIST; Junsik Kim, Korea Advanced Institute of Science and Technology (KAIST); Tae-Hyun Oh, MIT; Ming-Hsuan Yang, UC Merced; In So Kweon, KAIST"1296,Poster,Dynamic Few-Shot Visual Learning without Forgetting,"Spyros Gidaris, Ecole des Ponts ParisTech ; Nikos Komodakis,"1303,Poster,Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features,"Xiang Wang, Tsinghua University; Shaodi You, Data61; Xi Li, Tsinghua University; Huimin Ma, Tsinghua University"1304,Poster,SINT++: Robust Visual Tracking via Adversarial Hard Positive Generation,"Xiao Wang, Anhui university; Chenglong Li, Anhui University; Bin Luo, ; Jin Tang,"1308,Poster,Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image Style Transfer,"Amir Atapour-Abarghouei, Durham University; Toby Breckon, Durham University"1315,Poster,Fast and Accurate Single Image Super-Resolution via Information Distillation Network,"Zheng Hui, Xidian university; Xiumei Wang, Xidian university; Xinbo Gao,"1317,Spotlight,Low-Latency Video Semantic Segmentation,"Yule Li, Ict; Jianping Shi, SenseTime; Dahua Lin, CUHK"1317,Poster,Low-Latency Video Semantic Segmentation,"Yule Li, Ict; Jianping Shi, SenseTime; Dahua Lin, CUHK"1320,Poster,Domain Adaptive Faster R-CNN for Object Detection in the Wild,"Yuhua Chen, CVL@ETHZ; Wen Li, ETH; Luc Van Gool, KTH"1321,Oral,DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor,"Tao Yu, Beihang University; Zerong Zheng, Tsinghua University; Kaiwen Guo, Google; Jianhui Zhao, Beihang University; Qionghai Dai, ; Hao Li, ; Gerard Pons-Moll, Max Planck for Informatics; Yebin Liu, Tsinghua University"1321,Poster,DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor,"Tao Yu, Beihang University; Zerong Zheng, Tsinghua University; Kaiwen Guo, Google; Jianhui Zhao, Beihang University; Qionghai Dai, ; Hao Li, ; Gerard Pons-Moll, Max Planck for Informatics; Yebin Liu, Tsinghua University"1324,Spotlight,Lean Multiclass Crowdsourcing,"Grant van Horn, California Institute of Technology; Pietro Perona, California Institute of Technology, USA; Serge Belongie,"1324,Poster,Lean Multiclass Crowdsourcing,"Grant van Horn, California Institute of Technology; Pietro Perona, California Institute of Technology, USA; Serge Belongie,"1328,Spotlight,Tell Me Where To Look: Guided Attention Inference Network,"Kunpeng Li, Northeastern University; Ziyan Wu, Siemens Corporation; Kuan-Chuan Peng, Siemens Corporation; Jan Ernst, Siemens Corporation; Yun Fu, Northeastern University"1328,Poster,Tell Me Where To Look: Guided Attention Inference Network,"Kunpeng Li, Northeastern University; Ziyan Wu, Siemens Corporation; Kuan-Chuan Peng, Siemens Corporation; Jan Ernst, Siemens Corporation; Yun Fu, Northeastern University"1329,Spotlight,Residual Dense Network for Image Super-Resolution,"Yulun Zhang, Northeastern University; Yapeng Tian, University of rochester; Yu Kong, Northeastern University; Bineng Zhong, Huaqiao University; Yun Fu, Northeastern University"1329,Poster,Residual Dense Network for Image Super-Resolution,"Yulun Zhang, Northeastern University; Yapeng Tian, University of rochester; Yu Kong, Northeastern University; Bineng Zhong, Huaqiao University; Yun Fu, Northeastern University"1330,Poster,Look at Boundary: A Boundary-Aware Face Alignment Algorithm,"Wayne Wu, SenseTime; Chen Qian, SenseTime; Shuo Yang, ; Quan Wang, SenseTime"1335,Poster,Imagination-IQA: No-reference Image Quality Assessment via Adversarial Learning,"Kwan-Yee Lin, Peking University"1342,Poster,Memory Matching Networks for One-Shot Image Recognition,"Qi Cai, University of Science and Technology of China; Yingwei Pan, University of Science and Technology of China; Ting Yao, Microsoft Research Asia; Chenggang Yan, Hangzhou Dianzi University, China; Tao Mei, Microsoft Research Asia"1343,Poster,3D Human Pose Estimation in the Wild by Adversarial Learning,"Wei Yang, The Chinese University of Hong Kong ; Wanli Ouyang, The University of Sydney; Xiaolong Wang, Carnegie Mellon University; Xiaogang Wang, Chinese University of Hong Kong"1349,Spotlight,Unsupervised Training for 3D Morphable Model Regression,"Kyle Genova, Princeton University; Forrester Cole, Google; Aaron Maschinot, Google; Daniel Vlasic, Google; Aaron Sarna, Google; William Freeman, Google"1349,Poster,Unsupervised Training for 3D Morphable Model Regression,"Kyle Genova, Princeton University; Forrester Cole, Google; Aaron Maschinot, Google; Daniel Vlasic, Google; Aaron Sarna, Google; William Freeman, Google"1350,Poster,Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective,"Suryansh Kumar, Australian National University; Anoop Cherian, ; Yuchao Dai, Australian National University; Hongdong Li, Australian National University"1352,Poster,IQA: Visual Question Answering in Interactive Environments,"Daniel Gordon, University of Washington; Ali Farhadi, ; Aniruddha Kembhavi, Allen Institute for Artificial Intelligence; Dieter Fox, University of Washington; Mohammad Rastegari, AI2; Joe Redmon, University of Washington"1353,Poster,Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking,"Feng Li, Harbin Institute of Technology; Cheng Tian, Harbin Institute of Technology; Wangmeng Zuo, Harbin Institute of Technology; Lei Zhang, The Hong Kong Polytechnic University; Ming-Hsuan Yang, UC Merced"1356,Spotlight,Low-shot Learning from Imaginary Data,"Yu-Xiong Wang, Carnegie Mellon University; Ross Girshick, ; Martial Hebert, ; Bharath Hariharan, Cornell University"1356,Poster,Low-shot Learning from Imaginary Data,"Yu-Xiong Wang, Carnegie Mellon University; Ross Girshick, ; Martial Hebert, ; Bharath Hariharan, Cornell University"1360,Poster,Deep Regression Forests for Age Estimation,"Wei Shen, Shanghai University; Yilu Guo, Shanghai University; Yan Wang, JHU; KAI ZHAO, Nankai University; Bo Wang, HikVision USA Inc.; Alan Yuille,"1363,Spotlight,Partial Transfer Learning with Selective Adversarial Networks,"Zhangjie Cao, Tsinghua University; Mingsheng Long, Tsinghua University; Jianmin Wang,"1363,Poster,Partial Transfer Learning with Selective Adversarial Networks,"Zhangjie Cao, Tsinghua University; Mingsheng Long, Tsinghua University; Jianmin Wang,"1366,Poster,A Bi-directional Message Passing Model for Salient Object Detection,"Lu Zhang, Dalian University of Technolog; Ju Dai, Dalian University of Technolog; Huchuan Lu, Dalian University of Technology; You He, ; Gang Wang,"1369,Poster,Transductive Unbiased Embedding for Zero-Shot Learning,"Jie Song, Zhejiang University; Chengchao Shen, Zhejiang University; Yezhou Yang, Arizona State University; Yang Liu, ; Mingli Song, Zhejiang University"1376,Poster,Scale-Transferrable Object Detection,"Peng Zhou, Sjtu; Bingbing Ni, ; Cong Geng, sjtu; jianguo Hu, Minivision; Yi Xu, Shanghai Jiao Tong University"1378,Poster,Crowd Counting with Deep Negative Correlation Learning,"Zenglin Shi, University of Bern; Le Zhang, Advanced Digital Sciences Cent; XiaoFeng Cao, university of technology sydney; Yun Liu, Nankai University; yangdong Ye, Zhengzhou University, China; Guoyan Zheng, University of Bern"1381,Poster,Deep Cauchy Hashing for Hamming Space Retrieval,"Yue Cao, Tsinghua University; Mingsheng Long, Tsinghua University; Bin Liu, Tsinghua University; Jianmin Wang,"1387,Poster,Demo2Vec: Reasoning Object Affordances from Online Videos,"Te-Lin Wu, USC; Kuan Fang, Stanford University; Daniel Yang, University of Southern California; Joseph Lim, University of Southern California"1389,Poster,GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition,"Yifan Feng, Xidian university; Zizhao Zhang, ; xibin Zhao, ; Rongrong Ji, ; Yue Gao, Tsinghua University"1390,Poster,An End-to-End TextSpotter with Explicit Alignment and Attention,"Tong He, The University of Adelaide; Zhi Tian, SIAT, CAS; Weilin Huang, The University of Oxford; Chunhua Shen, University of Adelaide; Yu Qiao, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Changming Sun, CSIRO Data61"1392,Poster,Stereoscopic Neural Style Transfer,"Dongdong Chen, ; Lu Yuan, Microsoft Research Asia; Jing Liao, ; Nenghai Yu, ; Gang Hua, Microsoft Research"1401,Poster,Bootstrapping the Performance of Webly Supervised Semantic Segmentation,"Tong Shen, The University of Adelaide; Guosheng Lin, Nanyang Technological Universi; Chunhua Shen, University of Adelaide; Ian Reid,"1408,Poster,Learning Markov Clustering Networks for Scene Text Detection,"ZICHUAN LIU, Nanyang Technological Universi; Guosheng Lin, Nanyang Technological Universi; Sheng Yang, Nanyang Technological University; Jiashi Feng, ; Weisi Lin, Nanyang Technological University; Wangling Goh, Nanyang Technological University"1410,Spotlight,Collaborative and Adversarial Network for Unsupervised domain adaptation,"Weichen Zhang, The University of Sydney; Wanli Ouyang, The University of Sydney; Dong Xu, ; Wen Li, ETH"1410,Poster,Collaborative and Adversarial Network for Unsupervised domain adaptation,"Weichen Zhang, The University of Sydney; Wanli Ouyang, The University of Sydney; Dong Xu, ; Wen Li, ETH"1428,Poster,Reflection Removal for Large-Scale 3D Point Clouds,"Jae-Seong Yun, UNIST; Jae-Young Sim, UNIST"1432,Poster,Pose Transferrable Person Re-Identification,"Jinxian Liu, Shanghai Jiao Tong University; Yichao Yan, Shanghai Jiao Tong University; Bingbing Ni, ; Peng Zhou, Sjtu; Shuo Cheng, SJTU; jianguo Hu, Minivision"1435,Spotlight,Learning to Adapt Structured Output Space for Semantic Segmentation,"Yi-Hsuan Tsai, NEC Labs America; Wei-Chih Hung, University of California, Merced; Samuel Schulter, NEC Labs; Kihyuk Sohn, NEC Laboratories America; Ming-Hsuan Yang, UC Merced; Manmohan Chandraker, NEC Labs America"1435,Poster,Learning to Adapt Structured Output Space for Semantic Segmentation,"Yi-Hsuan Tsai, NEC Labs America; Wei-Chih Hung, University of California, Merced; Samuel Schulter, NEC Labs; Kihyuk Sohn, NEC Laboratories America; Ming-Hsuan Yang, UC Merced; Manmohan Chandraker, NEC Labs America"1439,Poster,Efficient Diverse Ensemble for Discriminative Co-Tracking,"Kourosh Meshgi, Kyoto University; Shigeyuki Oba, Kyoto University; Shin Ishii, Kyoto University"1440,Poster,Learning a Single Convolutional Super-Resolution Network for Multiple Degradations,"Kai Zhang, Harbin Institute of Technology; Wangmeng Zuo, Harbin Institute of Technology; Lei Zhang, The Hong Kong Polytechnic University"1443,Poster,Probabilistic Plant Modeling via Multi-View Image-to-Image Translation,"Takahiro Isokane, Osaka university; Fumio Okura, Osaka University; Ayaka Ide, Osaka University; Yasuyuki Matsushita, Osaka University; Yasushi Yagi, Osaka University"1446,Poster,Learning to Parse Wireframes in Images of Man-Made Environments,"Kun Huang, Shanghaitech University; Yifan Wang, ShanghaiTech University; Zihan Zhou, Penn State University; Tianjiao Ding, ; Shenghua Gao, ShanghaiTech University; Yi Ma, EECS, UC Berkeley"1449,Spotlight,A Variational U-Net for Conditional Appearance and Shape Generation,"Ekaterina Sutter, HCI, IWR,Heidelberg University; Patrick Esser, Heidelberg University; Bjorn Ommer, Heidelberg"1449,Poster,A Variational U-Net for Conditional Appearance and Shape Generation,"Ekaterina Sutter, HCI, IWR,Heidelberg University; Patrick Esser, Heidelberg University; Bjorn Ommer, Heidelberg"1453,Oral,Learning to Find Good Correspondences,"Kwang Moo Yi, EPFL; Eduard Trulls, ; Yuki Ono, Sony; Vincent Lepetit, TU Graz; Mathieu Salzmann, EPFL; Pascal Fua,"1453,Poster,Learning to Find Good Correspondences,"Kwang Moo Yi, EPFL; Eduard Trulls, ; Yuki Ono, Sony; Vincent Lepetit, TU Graz; Mathieu Salzmann, EPFL; Pascal Fua,"1458,Oral,Actor and Action Video Segmentation from a Sentence,"Kirill Gavrilyuk, University of Amsterdam; Amir Ghodrati, University of Amsterdam; zhenyang Li, University of Amsterdam; Cees Snoek, University of Amsterdam"1458,Poster,Actor and Action Video Segmentation from a Sentence,"Kirill Gavrilyuk, University of Amsterdam; Amir Ghodrati, University of Amsterdam; zhenyang Li, University of Amsterdam; Cees Snoek, University of Amsterdam"1462,Poster,Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks,"Hao Shen, Fortiss GmbH"1467,Poster,Weakly-supervised Deep Convolutional Neural Network Learning for Facial Action Unit Intensity Estimation,"Yong Zhang, CASIA; Weiming Dong, ; Bao-Gang Hu, CASIA; Qiang Ji, RPI"1470,Oral,Maximum Classifier Discrepancy for Unsupervised Domain Adaptation,"Kuniaki Saito, The University of Tokyo; Kohei Watanabe, ; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"1470,Poster,Maximum Classifier Discrepancy for Unsupervised Domain Adaptation,"Kuniaki Saito, The University of Tokyo; Kohei Watanabe, ; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"1487,Poster,The power of ensembles for active learning in image classification,"William Beluch, Bosch Center for Artificial Intelligence; Tim Genewein, Robert Bosch Center for AI; Andreas Nrnberger, Otto-von-Guericke-Universitt Magdeburg ; Jan Khler, Bosch Center for AI"1493,Poster,Memory Based Online Learning of Deep Representations from Video Streams,"Federico Pernici, MICC University of Florence; federico Bartoli, Micc - University of Florence; Matteo Bruni, Micc - University of Florence; Alberto Del Bimbo, University of Florence"1494,Poster,Correlation Tracking via Joint Discrimination and Reliability Learning,"Chong Sun, DalianUniversityofTechnology; Dong Wang, DUT; Huchuan Lu, Dalian University of Technology; Ming-Hsuan Yang, UC Merced"1497,Poster,Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks,"Wei Xiong, University of Rochester; Wenhan Luo, Tencent AI Lab; Lin Ma, Tencent AI Lab; Wei Liu, ; Jiebo Luo, University of Rochester"1501,Poster,Learning Discriminative Evaluation Metrics for Image Captioning,"Yin Cui, CornellTech; Guandao Yang, Cornell University; Andreas Veit, Cornel Tech ; Xun Huang, ; Serge Belongie,"1502,Poster,Large Scale Fine-Grained Categorization and the Effectiveness of Domain-Specific Transfer Learning,"Yin Cui, CornellTech; Yang Song, Google; Chen Sun, Google; Andrew Howard, Google; Serge Belongie,"1508,Poster,Curve Reconstruction via the Global Statistics of Natural Curves,"Ehud Barnea, Ben-Gurion University; Ohad Ben-Shahar, Ben-Gurion University"1517,Poster,LAMV: Learning to align and match videos with kernelized temporal layers,"Lorenzo Baraldi, University of Modena; Matthijs Douze, ; Rita Cucchiara, ; Herve Jegou, Facebook AI Research"1520,Spotlight,Attentive Generative Adversarial Network for Raindrop Removal from A Single Image,"Rui Qian, Peking University; Robby Tan, Yale-NUS College Also, Electrical and Computer Engineering, NUS; Wenhan Yang, Peking University; Jiajun Su, Peking University; Jiaying Liu, Peking University"1520,Poster,Attentive Generative Adversarial Network for Raindrop Removal from A Single Image,"Rui Qian, Peking University; Robby Tan, Yale-NUS College Also, Electrical and Computer Engineering, NUS; Wenhan Yang, Peking University; Jiajun Su, Peking University; Jiaying Liu, Peking University"1522,Poster,Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment,"Li Ding, MIT; Chenliang Xu, University of Rochester"1524,Poster,Matryoshka Networks: Predicting 3D Geometry via Nested Shape Layers,"Stephan Richter, TU Darmstadt; Stefan Roth,"1531,Poster,Deep Semantic Face Deblurring,"Ziyi Shen, Beijing Institute of Technology; Wei-Sheng Lai, University of California, Merced; Tingfa Xu, Beijing Institute of Technology; Jan Kautz, NVIDIA; Ming-Hsuan Yang, UC Merced"1532,Oral,Detail-Preserving Pooling in Deep Networks,"Faraz Saeedan, TU Darmstadt; Nicolas Weber, ; Michael Goesele, TU Darmstadt; Stefan Roth,"1532,Poster,Detail-Preserving Pooling in Deep Networks,"Faraz Saeedan, TU Darmstadt; Nicolas Weber, ; Michael Goesele, TU Darmstadt; Stefan Roth,"1535,Spotlight,Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation,"Yen-Cheng Liu, National Taiwan University; Yu-Ying Yeh, National Taiwan University; Tzu-Chien Fu, Northwestern University; Wei-Chen Chiu, National Chiao Tung University; Sheng-De Wang, National Taiwan University; Yu-Chiang Frank Wang, Academia Sinica"1535,Poster,Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation,"Yen-Cheng Liu, National Taiwan University; Yu-Ying Yeh, National Taiwan University; Tzu-Chien Fu, Northwestern University; Wei-Chen Chiu, National Chiao Tung University; Sheng-De Wang, National Taiwan University; Yu-Chiang Frank Wang, Academia Sinica"1539,Poster,Visual to Sound: Generating Natural Sound for Videos in the Wild,"Yipin Zhou, UNC-Chapel Hill; Zhaowen Wang, Adobe; Chen Fang, Adobe Research; Trung Bui, ; Tamara Berg, University on North carolina"1543,Poster,Deep Reinforcement Learning of Region Proposal Networks for Object Detection,"Aleksis Pirinen, Lund University; Cristian Sminchisescu,"1549,Poster,When will you do what? - Anticipating Temporal Occurrences of Activities,"Alexander Richard, University of Bonn; Juergen Gall, University of Bonn, Germany; Yazan Abu Farha, University of Bonn"1550,Poster,Pixel-Wise Metric Learning for Blazingly Fast Video Object Segmentation,"Yuhua Chen, CVL@ETHZ; Jordi Pont-Tuset, ETHZ; Alberto Montes, ETHZ; Luc Van Gool, KTH"1552,Poster,Global versus Localized Generative Adversarial Nets,"Guo-Jun Qi, University of Central Florida; Liheng Zhang, University of Central Florida; Hao Hu, University of Central Florida"1561,Spotlight,SeGAN: Segmenting and Generating the Invisible,"KIANA EHSANI, 1993; Roozbeh Mottaghi, Allen Institute for Artificial Intelligence; Ali Farhadi,"1561,Poster,SeGAN: Segmenting and Generating the Invisible,"KIANA EHSANI, 1993; Roozbeh Mottaghi, Allen Institute for Artificial Intelligence; Ali Farhadi,"1562,Poster,Name-removed-for-review: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection,"Tatjana Chavdarova, Idiap and EPFL; Pierre Baqu, EPFL; Andrii Maksai, ; STPHANE BOUQUET, EPFL; Cijo Jose, Idiap and EPFL; Louis Lettry, ETH Zrich; Francois Fleuret, Idiap Research Institute; Pascal Fua, ; Luc Van Gool, KTH"1564,Poster,DeepVoting: A Robust and Explainable Deep Network for Semantic Part Detection under Partial Occlusion,"Zhishuai Zhang, Johns Hopkins University; Cihang Xie, JHU; Jianyu Wang, ; Lingxi Xie, UCLA; Alan Yuille, JHU"1565,Poster,Data Distillation: Towards Omni-Supervised Learning,"Ilija Radosavovic, Facebook AI Research; Piotr Dollar, Facebook AI Research, Menlo Park, USA; Ross Girshick, ; Georgia Gkioxari, Facebook; Kaiming He,"1567,Spotlight,Deep Photo Enhancer: Unsupervised Learning of Image Enhancement from Photographs with GANs,"Yu-Sheng Chen, National Taiwan University; Yu-Ching Wang, National Taiwan University; Man-Hsin Kao, National Taiwan University; Yung-Yu Chuang, National Taiwan University"1567,Poster,Deep Photo Enhancer: Unsupervised Learning of Image Enhancement from Photographs with GANs,"Yu-Sheng Chen, National Taiwan University; Yu-Ching Wang, National Taiwan University; Man-Hsin Kao, National Taiwan University; Yung-Yu Chuang, National Taiwan University"1573,Poster,Neighbors Do Help: Deeply Exploiting Local Structures of Point Clouds,"Yiru Shen, Clemson University; Chen Feng, MERL; Yaoqing Yang, Carnegie Mellon University; Dong Tian, Mitsubishi Electric Research Laboratories"1575,Poster,Controllable Video Generation with Sparse Trajectories,"Zekun Hao, ; Xun Huang, ; Serge Belongie,"1580,Spotlight,Context Embedding Networks,"Kun ho Kim, Caltech; Oisin Mac Aodha, Caltech; Pietro Perona, California Institute of Technology, USA"1580,Poster,Context Embedding Networks,"Kun ho Kim, Caltech; Oisin Mac Aodha, Caltech; Pietro Perona, California Institute of Technology, USA"1584,Spotlight,PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image,"Chen Liu, WUSTL; Jimei Yang, ; Duygu Ceylan, ; Ersin Yumer, Argo AI; Yasutaka Furukawa,"1584,Poster,PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image,"Chen Liu, WUSTL; Jimei Yang, ; Duygu Ceylan, ; Ersin Yumer, Argo AI; Yasutaka Furukawa,"1589,Spotlight,Multi-Task Adversarial Network for Disentangled Feature Learning,"Yang Liu, University of Cambridge; Zhaowen Wang, Adobe; Hailin Jin, ; Ian Wassell,"1589,Poster,Multi-Task Adversarial Network for Disentangled Feature Learning,"Yang Liu, University of Cambridge; Zhaowen Wang, Adobe; Hailin Jin, ; Ian Wassell,"1590,Poster,Low-shot learning with large-scale diffusion,"Matthijs Douze, ; Arthur Szlam, Facebook AI Research; Bharath Hariharan, Cornell University; Herve Jegou, Facebook AI Research"1593,Spotlight,Learning from Synthetic Data: Semantic Segmentation using Generative Adversarial Networks,"Swami Sankaranarayanan, University of Maryland; Yogesh Balaji, University of Maryland; Arpit Jain, ; Ser-Nam Lim, GE Global Research; Rama Chellappa, University of Maryland, USA"1593,Poster,Learning from Synthetic Data: Semantic Segmentation using Generative Adversarial Networks,"Swami Sankaranarayanan, University of Maryland; Yogesh Balaji, University of Maryland; Arpit Jain, ; Ser-Nam Lim, GE Global Research; Rama Chellappa, University of Maryland, USA"1594,Spotlight,Sketch-a-Classifier: Sketch-based Photo Classifier Generation,"Conghui Hu, Queen Mary University of Londo; Da Li, ; Yi-Zhe Song, ; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh"1594,Poster,Sketch-a-Classifier: Sketch-based Photo Classifier Generation,"Conghui Hu, Queen Mary University of Londo; Da Li, ; Yi-Zhe Song, ; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh"1597,Spotlight,VizWiz Grand Challenge: Answering Visual Questions from Blind People,"Danna Gurari, University of Texas at Austin; Qing Li, USTC; Abigale Stangl, ; Anhong Guo, ; Chi Lin, ; Kristen Grauman, ; Jiebo Luo, University of Rochester; Jeffrey Bigham,"1597,Poster,VizWiz Grand Challenge: Answering Visual Questions from Blind People,"Danna Gurari, University of Texas at Austin; Qing Li, USTC; Abigale Stangl, ; Anhong Guo, ; Chi Lin, ; Kristen Grauman, ; Jiebo Luo, University of Rochester; Jeffrey Bigham,"1598,Poster,Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks,"Dinesh Jayaraman, UT Austin ; Kristen Grauman,"1607,Poster,Direct Shape Regression Networks for End-to-End Face Alignment,"Xin Miao, UT Arlington; Xiantong Zhen, Beihang University; Vassilis Athitsos, University of Texas at Arlington; Xianglong Liu, Beihang University; Cheng Deng, Xidian University; Heng Huang, University of Pittsburgh"1620,Poster,Multi-scale Location-aware Kernel Representation for Object Detection,"Hao Wang, Harbin Institute of Technology; Qilong Wang, ; Mingqi Gao, Harbin Institute of Technology; Peihua Li, ; Wangmeng Zuo, Harbin Institute of Technology"1621,Spotlight,Multistage Adversarial Losses for Pose-Based Human Image Synthesis,"Chenyang Si, Institute of Automation, Chine; Wei Wang, ; Liang Wang, unknown; Tieniu Tan, NLPR China"1621,Poster,Multistage Adversarial Losses for Pose-Based Human Image Synthesis,"Chenyang Si, Institute of Automation, Chine; Wei Wang, ; Liang Wang, unknown; Tieniu Tan, NLPR China"1622,Poster,MoCoGAN: Decomposing Motion and Content for Video Generation,"Sergey Tulyakov, ; Ming-Yu Liu, NVIDIA; Xiaodong Yang, NVIDIA; Jan Kautz, NVIDIA"1630,Poster,Joint Pose and Expression Modeling for Facial Expression Recognition,"Feifei Zhang, Jiangsu University; Tianzhu Zhang, CASIA; Qirong Mao, Department of Computer Science and Communication Engineering, Jiangsu University; Changsheng Xu,"1632,Poster,Triplet-Center Loss for Multi-View 3D Object Retrieval,"Xinwei He, HUST; Yang Zhou, Huazhong University of Science and Technology; Zhichao Zhou, Huazhong University of Science and Technology; Song Bai, HUST; Xiang Bai, Huazhong University of Science and Technology"1635,Poster,Beyond Holistic Object Recognition: Enriching Image Understanding with Part States,"Cewu Lu, Shanghai Jiao Tong University; hao Su, ; CK Tang, HKUST"1640,Poster,Recurrent Residual Module for Fast Inference in Videos,"Bowen Pan, Shanghai Jiao Tong University; Wuwei Lin, Shanghai Jiao Tong University; Xiaolin Fang, Zhejiang University; Chaoqin Huang, Shanghai Jiaotong University; Bolei Zhou, Massachuate Institute of Technology; Cewu Lu, Shanghai Jiao Tong University"1643,Spotlight,Environment Upgrade Reinforcement Learning for Non-differentiable Multi-stage Pipelines,"Shuqin Xie, SJTU; Cewu Lu, Shanghai Jiao Tong University; Zitian Chen, Fudan University; Chao Xu, Shanghai Jiao Tong University"1643,Poster,Environment Upgrade Reinforcement Learning for Non-differentiable Multi-stage Pipelines,"Shuqin Xie, SJTU; Cewu Lu, Shanghai Jiao Tong University; Zitian Chen, Fudan University; Chao Xu, Shanghai Jiao Tong University"1644,Spotlight,Separating Style and Content for Generalized Style Transfer,"Yexun Zhang, Shanghai Jiao Tong University; Ya Zhang, ; Wenbin Cai,"1644,Poster,Separating Style and Content for Generalized Style Transfer,"Yexun Zhang, Shanghai Jiao Tong University; Ya Zhang, ; Wenbin Cai,"1645,Poster,LiDAR-Video Driving Dataset: Learning Driving Policies Effectively,"Yiping Chen, Xiamen University; Jingkang Wang, Shanghai Jiao Tong University; Cewu Lu, Shanghai Jiao Tong University; Zhipeng Luo, Xiamen University; Jonathan Li, University of Waterloo; Han Xue, Shanghai Jiao Tong University; Cheng Wang, Xiamen University"1653,Poster,Geometry-Aware Scene Text Detection with Instance Transformation Network,"Fangfang Wang, Zhejiang University; Liming Zhao, Zhejiang University; Xi Li, Zhejiang University; Xinchao Wang, ; Dacheng Tao, University of Sydney"1664,Poster,Temporal Hallucinating for Action Recognition with Few Still Images,"Lei Zhou, ; Yali Wang, SIAT, CAS; Yu Qiao, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"1672,Poster,Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons,"Edward Kim, ; Darryl Hannan, ; Garrett Kenyon,"1676,Spotlight,Learning Spatial-Aware Regressions for Visual Tracking,"Chong Sun, DalianUniversityofTechnology; Dong Wang, DUT; Huchuan Lu, Dalian University of Technology; Ming-Hsuan Yang, UC Merced"1676,Poster,Learning Spatial-Aware Regressions for Visual Tracking,"Chong Sun, DalianUniversityofTechnology; Dong Wang, DUT; Huchuan Lu, Dalian University of Technology; Ming-Hsuan Yang, UC Merced"1679,Poster,Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes,"Weihong Ren, City University of Hong Kong; Di Kang, ; Yandong Tang, Shenyang Institute of Automation, Chinese Academy of Sciences; Antoni Chan, City University of Hong Kong, Hong Kong"1688,Spotlight,Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation,"Pengyuan Lyu, Huazhong University of Science and Technology; Cong Yao, Huazhong University of Science and Technology; Wenhao Wu, Megvii; Shuicheng Yan, National University of Singapore; Xiang Bai, Huazhong University of Science and Technology"1688,Poster,Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation,"Pengyuan Lyu, Huazhong University of Science and Technology; Cong Yao, Huazhong University of Science and Technology; Wenhao Wu, Megvii; Shuicheng Yan, National University of Singapore; Xiang Bai, Huazhong University of Science and Technology"1696,Poster,Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis,"Seunghoon Hong, POSTECH; Dingdong Yang, University of Michigan; Jongwook Choi, University of Michigan; Honglak Lee, University of Michigan, USA"1697,Spotlight,Optimal Structured Light a la Carte,"Parsa Mirdehghan, University of Toronto; Wenzheng Chen, UofT; Kyros Kutulakos,"1697,Poster,Optimal Structured Light a la Carte,"Parsa Mirdehghan, University of Toronto; Wenzheng Chen, UofT; Kyros Kutulakos,"1699,Poster,FOTS: Fast Oriented Text Spotting with a Unified Network,"Xuebo Liu, SenseTime Group Ltd.; Ding Liang, Sensetime; Shi Yan, SenseTime; Dagui Chen, SenseTime; Yu Qiao, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Junjie Yan,"1704,Poster,Deep Marching Cubes: Learning Explicit Surface Representations,"Yiyi Liao, Zhejiang University; Simon Donn, Ghent University; Andreas Geiger, MPI Tuebingen / ETH Zuerich"1708,Poster,Learning 3D Shape Completion from Point Clouds with Weak Supervision,"David Stutz, MPI Saarbruecken; Andreas Geiger, MPI Tuebingen / ETH Zuerich"1716,Poster,A General Two-Step Quantization Approach for Low-bit Neural Networks with High Accuracy,"Peisong Wang, CASIA; Qinghao Hu, Chinese Academy of Sciences; Yifan Zhang, CASIA; Jian Cheng, Chinese Academy of Sciences"1717,Poster,Clinical Skin Lesion Diagnosis using Representations Inspired by Dermatologist Criteria,"Jufeng Yang, Nankai University; Xiaoxiao Sun, ; Jie Liang, ; Paul Rosin,"1719,Spotlight,RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials,"Despoina Paschalidou, MPI Tuebingen; Carolin Schmitt, MPI Tuebingen; Osman Ulusoy, microsoft corporation; Luc Van Gool, KTH; Andreas Geiger, MPI Tuebingen / ETH Zuerich"1719,Poster,RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials,"Despoina Paschalidou, MPI Tuebingen; Carolin Schmitt, MPI Tuebingen; Osman Ulusoy, microsoft corporation; Luc Van Gool, KTH; Andreas Geiger, MPI Tuebingen / ETH Zuerich"1720,Poster,Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition,"Jinmian Ye, University of Electronic Science and Technology of China; Linnan Wang, Brown; Guangxi Li, UESTC; Di Chen, ; Shandian Zhe, School of Computing, University of Utah; Zenglin Xu, University of Electronic Science and Technology of China"1724,Oral,Convolutional Neural Networks with Alternately Updated Clique,"Yibo Yang, Peking Univ.; Zhisheng Zhong, ; Tiancheng Shen, ; Zhouchen Lin, Peking University, China"1724,Poster,Convolutional Neural Networks with Alternately Updated Clique,"Yibo Yang, Peking Univ.; Zhisheng Zhong, ; Tiancheng Shen, ; Zhouchen Lin, Peking University, China"1736,Poster,Deep Progressive Reinforcement Learning for Skeleton-based Action Recognition,"Yansong Tang, Tsinghua University; Yi Tian, ; Peiyang Li, ; Jiwen Lu, Tsinghua University; Jie Zhou,"1738,Poster,Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present,"Xinpeng Chen, Wuhan University; Lin Ma, Tencent AI Lab; Wenhao Jiang, Tencent AI Lab; Jian Yao, ; Wei Liu,"1746,Poster,Dimensionalitys Blessing: Detecting the distributions underlying images,"Wen-Yan Lin, ADSC; Yasuyuki Matsushita, Osaka University; Siying Liu, I2r.a-star.edu.sg; Jianhuang Lai, Sun Yat-sen University"1757,Poster,Learning to Promote Saliency Detectors,"Yu Zeng, Dalian University of Technology; Huchuan Lu, Dalian University of Technology; Lihe Zhang, Dalian University of Technology; Mengyang Feng, DUT, student; Ali Borji, UCF"1761,Poster,Fully Convolutional Adaptation Networks for Semantic Segmentation,"Yiheng Zhang, University of Science and Technology of China; Zhaofan Qiu, University of Science and Technology of China; Ting Yao, Microsoft Research Asia; Dong Liu, Univ Sci Tech China; Tao Mei, Microsoft Research Asia"1765,Poster,Object Referring in Videos with Language and Human Gaze,"Arun Balajee Vasudevan , ETH Zurich; Dengxin Dai, ETH Zurich; Luc Van Gool, KTH"1772,Spotlight,Learning Pose Specific Representations by Predicting different Views,"Georg Poier, Graz University of Technology; David Schinagl, ; Horst Bischof,"1772,Poster,Learning Pose Specific Representations by Predicting different Views,"Georg Poier, Graz University of Technology; David Schinagl, ; Horst Bischof,"1773,Poster,Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images,"Mahdi Rad, TUG; Markus Oberweger, ; Vincent Lepetit, TU Graz"1775,Spotlight,A Papier-Mch Approach to Learning 3D Surface Generation,"Thibault GROUEIX, cole des ponts ParisTech; Bryan Russell, Adobe; Mathew Fisher, Adobe Systems; Mathieu Aubry, ; Vladimir Kim, Adobe Research"1775,Poster,A Papier-Mch Approach to Learning 3D Surface Generation,"Thibault GROUEIX, cole des ponts ParisTech; Bryan Russell, Adobe; Mathew Fisher, Adobe Systems; Mathieu Aubry, ; Vladimir Kim, Adobe Research"1790,Poster,Deep PhaseNet for Video Frame Interpolation,"Simone Meyer, ETH Zurich; Abdelaziz Djelouah, The Walt Disney Company; Christopher Schroers, Disney Research Zurich; Brian McWilliams, ; Alexander Sorkine-Hornung, ; Markus Gross,"1792,Poster,Non-blind Deblurring: Handling Kernel Uncertainty with CNNs,"Subeesh Vasu, IIT Madras; Venkatesh Reddy Maligireddy, IIT Madras; A.N. Rajagopalan, IIT Madras"1797,Poster,CosFace: Large Margin Cosine Loss for Deep Face Recognition,"Hao Wang, ; Yitong Wang, Tencent AI Lab; Zheng Zhou, ; xing Ji, ; Dihong Gong, ; Zhifeng Li, ; Jingchao Zhou, ; Wei Liu,"1799,Poster,Lightweight Probabilistic Deep Networks,"Jochen Gast, TU Darmstadt; Stefan Roth,"1800,Poster,Occlusion-Aware Rolling Shutter Rectification of 3D Scenes,"Subeesh Vasu, IIT Madras; Mahesh Mohan M R, IIT Madras; A.N. Rajagopalan, IIT Madras"1801,Spotlight,Disentangled Person Image Generation,"Liqian Ma, KU Leuven; Qianru Sun, MPI for Informatics; Stamatios Georgoulis, KU Leuven; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany; Luc Van Gool, KU Leuven"1801,Poster,Disentangled Person Image Generation,"Liqian Ma, KU Leuven; Qianru Sun, MPI for Informatics; Stamatios Georgoulis, KU Leuven; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany; Luc Van Gool, KU Leuven"1815,Poster,CRRN: Multi-Scale Guided Concurrent Reflection Removal Network,"Renjie Wan, Nanyang Technological Universi; Boxin Shi, Peking University; Ling-Yu Duan, ; Ah-Hwee Tan, ; Alex Kot,"1817,Poster,Natural and Effective Obfuscation by Head Inpainting,"Qianru Sun, MPI for Informatics; Liqian Ma, KU Leuven; Seong Joon Oh, MPI-INF; Mario Fritz, MPI, Saarbrucken, Germany; Luc Van Gool, KU Leuven; Bernt Schiele, MPI Informatics Germany"1830,Oral,Deep Learning of Graph Matching,"Andrei Zanfir, IMAR and Lund University; Cristian Sminchisescu,"1830,Poster,Deep Learning of Graph Matching,"Andrei Zanfir, IMAR and Lund University; Cristian Sminchisescu,"1838,Poster,What do Deep Networks Like to See?,"Sebastian Palacio, DFKI; Joachim Folz, DFKI; Andreas Dengel, DFKI; Jrn Hees, DFKI; Federico Raue, DFKI"1859,Spotlight,FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors,"Yu Chen, NUST; Ying Tai, Tencent; Xiaoming Liu, Michigan State University; Chunhua Shen, University of Adelaide; Jian Yang, Nanjing University of Science and Technology"1859,Poster,FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors,"Yu Chen, NUST; Ying Tai, Tencent; Xiaoming Liu, Michigan State University; Chunhua Shen, University of Adelaide; Jian Yang, Nanjing University of Science and Technology"1860,Poster,Person Re-identification with Cascaded Pairwise Convolutions,"Yicheng Wang, ; Zhenzhong Chen, Wuhan University; Feng Wu, ; Gang Wang,"1877,Poster,DA-GAN: Instance-level Image Translation by Deep Attention Generative Adversarial Network,"Shuang Ma, SUNY Buffalo; Jianlong Fu, ; Chang Chen, ; Tao Mei, Microsoft Research Asia"1880,Poster,Deep Cocktail Networks: Multi-source Unsupervised Domain Adaptation with Category Shift,"Ruijia Xu, Sun Yat-sen University; Ziliang Chen, Sun Yat-sen University; Wangmeng Zuo, Harbin Institute of Technology; Junjie Yan, ; Liang Lin,"1883,Spotlight,Learning deep structured active contours end-to-end,"Diego Marcos, ; Devis Tuia, Wageningen University; Benjamin Kellenberger, Wageningen University and Research; Lisa Zhang, University of Toronto; Min Bai, ; Renjie Liao, ; Raquel Urtasun, University of Toronto"1883,Poster,Learning deep structured active contours end-to-end,"Diego Marcos, ; Devis Tuia, Wageningen University; Benjamin Kellenberger, Wageningen University and Research; Lisa Zhang, University of Toronto; Min Bai, ; Renjie Liao, ; Raquel Urtasun, University of Toronto"1887,Poster,Unsupervised Domain Adaptation with Similarity-Based Classifier,"Pedro Pinheiro, EPFL"1893,Poster,Tags2Parts: Discovering Semantic Regions from Shape Tags,"Sanjeev Muralikrishnan, IIT Bombay; Vladimir Kim, Adobe Research; Siddhartha Chaudhuri, IIT Bombay"1895,Poster,A Hierarchical Generative Model for Eye Image Synthesis and Eye Gaze Estimation,"Kang Wang, RPI; Rui Zhao, Rensselaer Polytechnic Institu; Qiang Ji, RPI"1904,Poster,Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks,"Nick Johnston, Google; Damien Vincent, google.com; David Minnen, google.com; Michele Covell, google.com; Saurabh Singh, Univ. of Illinois at Urbana-Champaign; Sung Jin Hwang, google.com; George Toderici, Google; Troy Chinen, google.com; Joel Shor, google.com"1910,Poster,Neural Sign Language Translation,"Necati Cihan Camgoz, CVSSP; Simon Hadfield, ; Richard Bowden, University of Surrey UK; Oscar Koller, ; Hermann Ney,"1914,Poster,3D Pose Estimation and 3D Model Retrieval for Objects in the Wild,"Alexander Grabner, Graz University of Technology; Peter Roth, Graz University of Technology; Vincent Lepetit, TU Graz"1915,Poster,Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs,"Loic Landrieu, IGN; Martin Simonovsky, Universite Paris Est, ENPC"1916,Spotlight,The iNaturalist Species Classification and Detection Dataset,"Grant van Horn, California Institute of Technology; Oisin Mac Aodha, Caltech; Yang Song, Google; Yin Cui, CornellTech; Chen Sun, Google; Alex Shepard, iNaturalist; Hartwig Adam, Google; Pietro Perona, California Institute of Technology, USA; Serge Belongie,"1916,Poster,The iNaturalist Species Classification and Detection Dataset,"Grant van Horn, California Institute of Technology; Oisin Mac Aodha, Caltech; Yang Song, Google; Yin Cui, CornellTech; Chen Sun, Google; Alex Shepard, iNaturalist; Hartwig Adam, Google; Pietro Perona, California Institute of Technology, USA; Serge Belongie,"1919,Spotlight,Teaching Categories to Human Learners with Visual Explanations,"Oisin Mac Aodha, Caltech; Shihan Su, Caltech; Yuxin Chen, Caltech; Pietro Perona, California Institute of Technology, USA; Yisong Yue,"1919,Poster,Teaching Categories to Human Learners with Visual Explanations,"Oisin Mac Aodha, Caltech; Shihan Su, Caltech; Yuxin Chen, Caltech; Pietro Perona, California Institute of Technology, USA; Yisong Yue,"1924,Poster,Motion-Appearance Co-Memory Networks for Video Question Answering,"Jiyang Gao, ; Runzhou Ge, Univ. of Southern California; Kan Chen, Univ. of Southern California; Ram Nevatia,"1925,Poster,Temporal Deformable Residual Networks for Action Segmentation in Videos,"Peng Lei, Oregon State University; Sinisa Todorovic,"1926,Spotlight,Actor and Observer: Joint Modeling of First and Third-Person Videos,"Gunnar Sigurdsson, CMU; Cordelia Schmid, INRIA Grenoble, France; Ali Farhadi, ; Abhinav Gupta, ; Karteek Alahari,"1926,Poster,Actor and Observer: Joint Modeling of First and Third-Person Videos,"Gunnar Sigurdsson, CMU; Cordelia Schmid, INRIA Grenoble, France; Ali Farhadi, ; Abhinav Gupta, ; Karteek Alahari,"1928,Spotlight,Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push,"Siavash Gorji, McGill University; James Clark, McGill University"1928,Poster,Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push,"Siavash Gorji, McGill University; James Clark, McGill University"1931,Poster,Spatially-Adaptive Filter Units for Deep Neural Networks,"Domen Tabernik, University of Ljubljana; Matej Kristan, University of Ljubljana; Ales Leonardis, University of Birmingham, UK"1939,Poster,Boundary Flow: A Siamese Network that Predicts Boundary Motion without Training on Motion,"Peng Lei, Oregon State University; Fuxin Li, Oregon State University; Sinisa Todorovic,"1944,Poster,DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks,"Orest Kupyn, Ukrainian Catholic University; Volodymyr Budzan, Ukrainian Catholic University; Mykola Mykhailych, UCU; Dmytro Mishkin, Czech Technical University; Jiri Matas,"1946,Poster,Discriminability objective for training descriptive captions,"Ruotian Luo, Toyota Technological Institute; Scott Cohen, ; Brian Price, ; Greg Shakhnarovich,"1949,Poster,Rolling Shutter and Radial Distortion are Features for High Frame Rate Multi-camera Tracking,"Akash Bapat, UNC Chapel Hill; Jan-Michael Frahm, UNC Chapel Hill; True Price, UNC Chapel Hill"1956,Spotlight,Language-Based Image Editing with Recurrent attentive Models,"Yelong Shen, Microsoft; Jianbo Chen, UC Berkeley; Jianfeng Gao, ; JingJing Liu, Microsoft; Xiaodong Liu, Microsoft"1956,Poster,Language-Based Image Editing with Recurrent attentive Models,"Yelong Shen, Microsoft; Jianbo Chen, UC Berkeley; Jianfeng Gao, ; JingJing Liu, Microsoft; Xiaodong Liu, Microsoft"1957,Spotlight,SBNet: Sparse Blocks Network for Fast Inference,"Mengye Ren, Uber ATG; Andrei Pokrovsky, Uber ATG; Bin Yang, Uber ATG, UofT; Raquel Urtasun, University of Toronto"1957,Poster,SBNet: Sparse Blocks Network for Fast Inference,"Mengye Ren, Uber ATG; Andrei Pokrovsky, Uber ATG; Bin Yang, Uber ATG, UofT; Raquel Urtasun, University of Toronto"1959,Spotlight,Learning Compositional Visual Concepts with Mutual Consistency,"Yunye Gong, Cornell University; Srikrishna Karanam, Siemens Corporate Technology; Ziyan Wu, Siemens Corporation; Kuan-Chuan Peng, Siemens Corporation; Jan Ernst, Siemens Corporation; Peter Doerschuk, Cornell University"1959,Poster,Learning Compositional Visual Concepts with Mutual Consistency,"Yunye Gong, Cornell University; Srikrishna Karanam, Siemens Corporate Technology; Ziyan Wu, Siemens Corporation; Kuan-Chuan Peng, Siemens Corporation; Jan Ernst, Siemens Corporation; Peter Doerschuk, Cornell University"1960,Poster,Learning Deep Sketch Abstraction,"Umar Riaz Muhammad, Queen Mary Uni of London; Yongxin Yang, Queen Mary University of London; Yi-Zhe Song, ; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh"1969,Spotlight,Learning to Extract a Video Sequence from a Single Motion-Blurred Image,"Meiguang Jin, University of Bern, Switzerlan; Givi Meishvili, University of Bern, Switzerland; Paolo Favaro, Bern University, Switzerland"1969,Poster,Learning to Extract a Video Sequence from a Single Motion-Blurred Image,"Meiguang Jin, University of Bern, Switzerlan; Givi Meishvili, University of Bern, Switzerland; Paolo Favaro, Bern University, Switzerland"1978,Oral,Synthesizing Images of Humans in Unseen Poses,"Guha Balakrishnan, MIT; Adrian Dalca, ; Amy Zhao, MIT; Fredo Durand, ; John Guttag,"1978,Poster,Synthesizing Images of Humans in Unseen Poses,"Guha Balakrishnan, MIT; Adrian Dalca, ; Amy Zhao, MIT; Fredo Durand, ; John Guttag,"1981,Poster,Learning to See in the Dark,"Chen Chen, UIUC; Qifeng Chen, Intel Labs; Jia Xu, Tencent AI Lab; Vladlen Koltun, Intel Labs"1988,Oral,Neural Inverse Kinematics for Unsupervised Motion Retargetting,"Ruben Villegas, University of Michigan; Jimei Yang, ; Duygu Ceylan, ; Honglak Lee, University of Michigan, USA"1988,Poster,Neural Inverse Kinematics for Unsupervised Motion Retargetting,"Ruben Villegas, University of Michigan; Jimei Yang, ; Duygu Ceylan, ; Honglak Lee, University of Michigan, USA"1989,Poster,Eliminating Background-bias for Robust Person Re-identification,"Maoqing Tian, Sensetime Limited; Shuai Yi, The Chinese University of Hong Kong; Hongsheng Li, ; Shihua Li, ; Xuesen Zhang, SenseTime; Jianping Shi, SenseTime; Junjie Yan, ; Xiaogang Wang, Chinese University of Hong Kong"1990,Poster,Uncalibrated Photometric Stereo under Natural Illumination,"Zhipeng Mo, ; Boxin Shi, Peking University; Feng Lu, U. Tokyo; Sai-Kit Yeung, ; Yasuyuki Matsushita, Osaka University"1991,Poster,A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping,"Debang Li, CASIA; Huikai Wu, CASIA; Junge Zhang, ; Kaiqi Huang,"2013,Poster,Weakly Supervised Action Localization by Sparse Temporal Pooling Network,"Phuc Nguyen, University of California, Irvine; Ting Liu, Google, Inc.; Gautam Prasad, Google, Inc.; Bohyung Han, Seoul National University"2018,Poster,Very Large-Scale Global SfM by Distributed Motion Averaging,"Siyu Zhu, HKUST; Runze Zhang, HKUST; Lei Zhou, HKUST; Tianwei Shen, HKUST; Tian Fang, HKUST; Ping Tan, ; Long Quan, The Hong Kong University of Science and Technology, Hong Kong"2021,Poster,ID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis,"Yujun Shen, Dept. of IE, CUHK; Ping Luo, The Chinese University of Hong Kong; Junjie Yan, ; Xiaogang Wang, Chinese University of Hong Kong; Xiaoou Tang, Chinese University of Hong Kong"2022,Spotlight,DenseASPP: Densely Connected Networks for Semantic Segmentation,"Maoke Yang, DeepMotion; Kun Yu, DeepMotion; Kuiyuan Yang, DeepMotion"2022,Poster,DenseASPP: Densely Connected Networks for Semantic Segmentation,"Maoke Yang, DeepMotion; Kun Yu, DeepMotion; Kuiyuan Yang, DeepMotion"2040,Poster,DVQA: Understanding Data Visualization via Question Answering,"Kushal Kafle, ; Brian Price, ; Scott Cohen, ; Christopher Kanan, RIT"2044,Spotlight,iVQA: Inverse Visual Question Answering,"Feng Liu, Southeast Univeristy; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh; Wankou Yang, Southeast University; Changyin Sun, Southeast University"2044,Poster,iVQA: Inverse Visual Question Answering,"Feng Liu, Southeast Univeristy; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh; Wankou Yang, Southeast University; Changyin Sun, Southeast University"2048,Poster,Globally Optimal Inlier Set Maximization for Atlanta Frame Estimation,"Kyungdon Joo, ; Tae-Hyun Oh, MIT; In So Kweon, KAIST; Jean-Charles Bazin, KAIST"2058,Spotlight,Recurrent Slice Networks for 3D Segmentation on Point Clouds,"Qiangui Huang, U of Southern CA; Weiyue Wang, USC; Ulrich Neumann, USC"2058,Poster,Recurrent Slice Networks for 3D Segmentation on Point Clouds,"Qiangui Huang, U of Southern CA; Weiyue Wang, USC; Ulrich Neumann, USC"2064,Poster,End-to-end Convolutional Semantic Embeddings,"Quanzeng You, Microsoft; Zhengyou Zhang, Microsoft Research; Jiebo Luo, University of Rochester"2082,Spotlight,"The Easy, The Medium and The Hard: Adapting Across Varied Domain Shifts","Swami Sankaranarayanan, University of Maryland; Yogesh Balaji, University of Maryland; Carlos Castillo, ; Rama Chellappa, University of Maryland, USA"2082,Poster,"The Easy, The Medium and The Hard: Adapting Across Varied Domain Shifts","Swami Sankaranarayanan, University of Maryland; Yogesh Balaji, University of Maryland; Carlos Castillo, ; Rama Chellappa, University of Maryland, USA"2089,Poster,Visual Question Answering with Memory-Augmented Networks,"Chao Ma, ; Chunhua Shen, University of Adelaide; Anthony Dick, University of Adelaide; Qi Wu, University of Adelaide; Peng Wang, The University of Adelaide; Anton Van den Hengel, University of Adelaide; Ian Reid,"2096,Spotlight,InLoc: Indoor Visual Localization with Dense Matching and View Synthesis,"Hajime Taira, Tokyo Institute of Technology; Masatoshi Okutomi, Tokyo Institute of Technology; Torsten Sattler, ETH Zurich; Mircea Cimpoi, Czech Institute of Informatics; Marc Pollefeys, ETH; Josef Sivic, ; Tomas Pajdla, ; Akihiko Torii, Tokyo Institute of Technology"2096,Poster,InLoc: Indoor Visual Localization with Dense Matching and View Synthesis,"Hajime Taira, Tokyo Institute of Technology; Masatoshi Okutomi, Tokyo Institute of Technology; Torsten Sattler, ETH Zurich; Mircea Cimpoi, Czech Institute of Informatics; Marc Pollefeys, ETH; Josef Sivic, ; Tomas Pajdla, ; Akihiko Torii, Tokyo Institute of Technology"2097,Poster,MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition,"Yizhou Zhou, Univ of Scienc.&Tech. of China; Xiaoyan Sun, Microsoft; Zheng-Jun Zha, ; Wenjun Zeng,"2102,Poster,Content-Sensitive Supervoxels via Uniform Tessellations on Video Manifolds,"Ran Yi, Tsinghua University; Yong-Jin Liu, ; Yu-Kun Lai, Cardiff University"2108,Spotlight,Weakly Supervised Coupled Networks for Visual Sentiment Analysis,"Jufeng Yang, Nankai University; Dongyu She, ; Yu-Kun Lai, Cardiff University; Paul Rosin, ; Ming-Hsuan Yang, UC Merced"2108,Poster,Weakly Supervised Coupled Networks for Visual Sentiment Analysis,"Jufeng Yang, Nankai University; Dongyu She, ; Yu-Kun Lai, Cardiff University; Paul Rosin, ; Ming-Hsuan Yang, UC Merced"2111,Poster,3D Semantic Trajectory Reconstruction from 3D Pixel Continuum,"Jae Yoon, ; Ziwei Li, UMN; Hyun Park,"2113,Spotlight,End-to-End Learning of Motion Representation for Video Understanding,"Lijie Fan, Tsinghua University; Wenbing Huang, Tencent AI Lab; Chuang Gan, Tsinghua University; Stefano Ermon, Stanford University; Junzhou Huang, UT Arlingtron; Boqing Gong, University of Central Florida"2113,Poster,End-to-End Learning of Motion Representation for Video Understanding,"Lijie Fan, Tsinghua University; Wenbing Huang, Tencent AI Lab; Chuang Gan, Tsinghua University; Stefano Ermon, Stanford University; Junzhou Huang, UT Arlingtron; Boqing Gong, University of Central Florida"2114,Poster,Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships,"Yong Liu, ICT; Ruiping Wang, Institute of Computing Technology, Chinese Academy of Sciences; Shiguang Shan, Chinese Academy of Sciences; Xilin Chen,"2123,Poster,Feature Selective Networks for Object Detection,"Yao Zhai, University of Science and Technology of China; Jingjing Fu, ; Yan Lu, ; Houqiang Li,"2129,Poster,High-speed Tracking with Multi-kernel Correlation Filters,"Ming Tang, NLPR, IA, CAS; Bin Yu, NLPR, IA, CAS; Fan Zhang, BUPT; Jinqiao Wang,"2130,Spotlight,Weakly Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer,"Hao-Shu Fang, Shanghai Jiao Tong University; Guansong Lu, Shanghai Jiao Tong University; Xiaolin Fang, Zhejiang University; Yu-Wing Tai, Tencent YouTu; Cewu Lu, Shanghai Jiao Tong University"2130,Poster,Weakly Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer,"Hao-Shu Fang, Shanghai Jiao Tong University; Guansong Lu, Shanghai Jiao Tong University; Xiaolin Fang, Zhejiang University; Yu-Wing Tai, Tencent YouTu; Cewu Lu, Shanghai Jiao Tong University"2131,Poster,Semantic Video Segmentation by Gated Recurrent Flow Propagation,"David Nilsson, Lund University; Cristian Sminchisescu,"2135,Poster,A Constrained Deep Neural Network for Ordinal Regression,"Yanzhu Liu, Nanyang Technological Universi; Adams Kong, NTU Singapore ; Chi Keong Goh, Rolls-Royce Advanced Technology Centre"2136,Poster,Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction,"Yanyu Xu, Shanghaitech University; Zhixin Piao, ; Shenghua Gao, ShanghaiTech University"2138,Poster,Spline Error Weighting for Robust Visual-Inertial Fusion,"Hannes Ovrn, Linkping University; Per-Erik Forssen, Linkoping University"2139,Poster,Mean-Variance Loss for Deep Age Estimation from a Face,"Hongyu Pan, Institute of Computing Technol; Hu Han, ; Shiguang Shan, Chinese Academy of Sciences; Xilin Chen,"2152,Poster,Pose-Robust Face Recognition via Deep Residual Equivariant Mapping,"Kaidi Cao, Tsinghua University; Yu Rong, CUHK; Cheng Li, SenseTime; Chen-Change Loy, the Chinese University of Hong Kong"2161,Spotlight,Viewpoint-aware Video Summarization,"Atsushi Kanehira, University of Tokyo; Luc Van Gool, KTH; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"2161,Poster,Viewpoint-aware Video Summarization,"Atsushi Kanehira, University of Tokyo; Luc Van Gool, KTH; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"2162,Poster,Statistical Tomography of Microscopic Life,"Aviad Levis, Technion Institute of Technology; Ronen Talmon, Technion - Israel Institute of Technology; Yoav Schechner, Technion Haifa, Israel"2165,Poster,Divide and Conquer for Full-Resolution Light Field Deblurring,"Mahesh Mohan M R, IIT Madras; A.N. Rajagopalan, IIT Madras"2172,Poster,Conditional Probability Models for Deep Image Compression,"Eirikur Agustsson, ETH Zurich; Fabian Mentzer, ETHZ Zrich; Michael Tschannen, ETH Zurich; Radu Timofte, ETH Zurich; Luc Van Gool, KTH"2194,Oral,Direction-aware Spatial Context Features for Shadow Detection,"Xiaowei Hu, CUHK; Lei Zhu, ; Chi-Wing Fu, ; Jing Qin, The Hong Kong Polytechnic University; Pheng-Ann Heng,"2194,Poster,Direction-aware Spatial Context Features for Shadow Detection,"Xiaowei Hu, CUHK; Lei Zhu, ; Chi-Wing Fu, ; Jing Qin, The Hong Kong Polytechnic University; Pheng-Ann Heng,"2195,Poster,Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification,"Xiang Long, Tsinghua University; Chuang Gan, Tsinghua University; Gerard De Melo, Rutgers University; Jiajun Wu, MIT; Xiao Liu, ; Shilei Wen, Baidu Research"2198,Poster,Occluded Pedestrian Detection through Guided Attention in CNNs,"Shanshan Zhang, MPI; Jian Yang, Nanjing University of Science and Technology; Bernt Schiele, MPI Informatics Germany"2201,Poster,SO-Net: Self-Organizing Network for Point Cloud Analysis,"Jiaxin Li, National University of Singapore; Ben Chen, National Univ of Singapore; Gim Hee Lee, National University of SIngapore"2205,Poster,CartoonGAN: Generative Adversarial Networks for Photo Cartoonization,"Yang Chen, Tsinghua University; Yu-Kun Lai, Cardiff University; Yong-Jin Liu,"2225,Poster,Conditional Image-to-Image Translation,"Jianxin Lin, USTC; Yingce Xia, ; Tao Qin, ; Zhibo Chen, ; Tie-Yan Liu,"2231,Poster,Human Appearance Transfer,"Mihai Zanfir, IMAR and Lund University ; Alin-Ionut Popa, IMAR; Andrei Zanfir, IMAR and Lund University; Cristian Sminchisescu,"2235,Poster,Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes,"Elisabeta Marinoiu, IMAR and Lund University; Andrei Zanfir, IMAR and Lund University; Cristian Sminchisescu,"2236,Poster,Egocentric Basketball Motion Planning from a Single First-Person Image,"Gedas Bertasius, University of Pennsylvania; Aaron Chan, U. of Southern California; Jianbo Shi, University of Pennsylvania, USA"2237,Poster,SGAN: An Alternative Training of Generative Adversarial Networks,"Tatjana Chavdarova, Idiap and EPFL; Francois Fleuret, Idiap Research Institute"2240,Poster,3D Human Pose Reconstruction and Action Classification in Robot Assisted Therapy of Children with Autism,"Elisabeta Marinoiu, IMAR and Lund University; Mihai Zanfir, IMAR and Lund University ; Vlad Olaru, ; Cristian Sminchisescu,"2253,Poster,Zero-Shot Super-Resolution using Deep Internal Learning,"Assaf Shocher, Weizmann institut of Science; Michal Irani, Weizmann Institute of Science; Nadav Cohen, Institute for Advanced Study"2257,Poster,Deep Diffeomorphic Transformer Networks,"Nicki Skafte Detlefsen, DTU; Oren Freifeld, Ben-Gurion University; Soren Hauberg, Technical University of Denmark"2260,Poster,Single Image Dehazing via Conditional Generative Adversarial Network,"Runde Li, NJUST; Jinshan Pan, UC Merced; Zechao Li, Nanjing University of Science and Technology ; Jinhui Tang,"2261,Spotlight,Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination,"Hazel Doughty, University of Bristol; Dima Damen, University of Bristol; Walterio Mayol-Cuevas,"2261,Poster,Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination,"Hazel Doughty, University of Bristol; Dima Damen, University of Bristol; Walterio Mayol-Cuevas,"2266,Spotlight,HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization,"Bin Zhao, Northwestern Polytechnical Uni; Xuelong Li, ; Xiaoqiang Lu,"2266,Poster,HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization,"Bin Zhao, Northwestern Polytechnical Uni; Xuelong Li, ; Xiaoqiang Lu,"2268,Poster,"Detect globally, refine locally: A novel approach to saliency detection","TIANTIAN WANG, Dalian University of Technolog; Lihe Zhang, Dalian University of Technology; Huchuan Lu, Dalian University of Technology; Ali Borji, UCF"2285,Poster,Improving Landmark Localization with Semi-Supervised Learning,"Sina Honari, University of Montreal; Pavlo Molchanov, NVIDIA Research; Jan Kautz, NVIDIA; Stephen Tyree, ; Christopher Pal, Ecole Polytechnique de Montreal; Pascal Vincent, University of Montreal"2290,Poster,Reward Learning by Instruction,"Hsiao-Yu Tung, Carnegie Mellon University; Adam Harley, Carnegie Mellon University; Katerina Fragkiadaki, Carnegie Mellon University"2291,Poster,The Lovsz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,"Maxim Berman, ESAT-PSI, KU Leuven; Amal Rannen Triki, KU Leuven; Matthew Blaschko, KU Leuven"2300,Poster,Facial Expression Recognition by De-expression Residue Learning,"Huiyuan Yang, Binghamton University-SUNY; Umur Ciftci, Binghamton University-SUNY; Lijun Yin, Binghamton University State University of New York"2302,Poster,Learning to Understand Image Blur,"Shanghang Zhang, ; Xiaohui Shen, Adobe Research; Zhe Lin, Adobe Systems, Inc.; Radomr Mech, ; Joo Costeira, ; Jose Moura, Carnegie Mellon University"2307,Poster,Hierarchical Novelty Detection for Visual Object Recognition,"Kibok Lee, University of Michigan; Kimin Lee, KAIST; Kyle Min, University of Michigan; Yuting Zhang, University of Michigan; Jinwoo Shin, KAIST; Honglak Lee, University of Michigan, USA"2312,Poster,Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition,"Yaming Wang, University of Maryland; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland, USA"2313,Poster,Adversarial Data Programming: Using GANs to Relax the Bottleneck of Curated Labeled Data,"Arghya Pal, Indian Institute of Technology; Vineeth Balasubramanian, IIT Hyderabad"2318,Poster,Compare and Contrast: Learning Prominent Visual Differences,"Steven Chen, University of Texas at Austin; Kristen Grauman,"2335,Poster,SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis,"Wengling Chen, Georgia Institute of Technolog; James Hays, Georgia Tech"2337,Poster,Grounding Referring Expressions in Images by Variational Context,"Hanwang Zhang, Columbia University; Yulei Niu, Renmin University of China; Shih-Fu Chang,"2342,Spotlight,Multi-Content GAN for Few-Shot Font Style Transfer,"Samaneh Azadi, UC Berkeley; Matthew Fisher, Adobe; Vladimir Kim, Adobe Research; Zhaowen Wang, Adobe; Eli Shechtman, Adobe Research; Trevor Darrell, UC Berkeley, USA"2342,Poster,Multi-Content GAN for Few-Shot Font Style Transfer,"Samaneh Azadi, UC Berkeley; Matthew Fisher, Adobe; Vladimir Kim, Adobe Research; Zhaowen Wang, Adobe; Eli Shechtman, Adobe Research; Trevor Darrell, UC Berkeley, USA"2344,Spotlight,LIME: Live Intrinsic Material Estimation,"Abhimitra Meka, Max Planck Institute for Infor; Maxim Maximov, Graduate School of Computer Science, Saarland University; Michael Zollhfer, MPI Informatics; Avishek Chatterjee, Max Planck Institute for Informatics; Hans-Peter Seidel, Max Planck Institute for Informatics; Christian Richardt, University of Bath; Christian Theobalt, MPI Informatics"2344,Poster,LIME: Live Intrinsic Material Estimation,"Abhimitra Meka, Max Planck Institute for Infor; Maxim Maximov, Graduate School of Computer Science, Saarland University; Michael Zollhfer, MPI Informatics; Avishek Chatterjee, Max Planck Institute for Informatics; Hans-Peter Seidel, Max Planck Institute for Informatics; Christian Richardt, University of Bath; Christian Theobalt, MPI Informatics"2347,Spotlight,Multi-Agent Diverse Generative Adversarial Networks,"Viveka Kulharia, University of Oxford; Arnab Ghosh, University of Oxford; Vinay P. Namboodiri, Indian Institute of Technology Kanpur; Phil Torr, Oxford; Puneet Kumar Dokania, University of Oxford"2347,Poster,Multi-Agent Diverse Generative Adversarial Networks,"Viveka Kulharia, University of Oxford; Arnab Ghosh, University of Oxford; Vinay P. Namboodiri, Indian Institute of Technology Kanpur; Phil Torr, Oxford; Puneet Kumar Dokania, University of Oxford"2351,Spotlight,Light field intrinsics with a deep encoder-decoder network,"Anna Alperovich, University of Konstanz; Ole Johannsen, University of Konstanz; Michael Strecke, University of Konstanz; Bastian Goldluecke,"2351,Poster,Light field intrinsics with a deep encoder-decoder network,"Anna Alperovich, University of Konstanz; Ole Johannsen, University of Konstanz; Michael Strecke, University of Konstanz; Bastian Goldluecke,"2354,Oral,Density Adaptive Point Set Registration,"Felix Jremo Lawin, Linkping University; Martin Danelljan, ; Fahad Khan, Computer Vision Laboratory, Linkoping University , Sweden; Per-Erik Forssen, Linkoping University; Michael Felsberg, Link_ping University"2354,Poster,Density Adaptive Point Set Registration,"Felix Jremo Lawin, Linkping University; Martin Danelljan, ; Fahad Khan, Computer Vision Laboratory, Linkoping University , Sweden; Per-Erik Forssen, Linkoping University; Michael Felsberg, Link_ping University"2355,Spotlight,Learning Monocular 3D Human Pose estimation on weakly-supervised Multi-view Images,"Helge Rhodin, epfl.ch; Jrg Sprri, Balgrist; Isinsu Katircioglu, EPFL Lausanne, Switzerland; Victor Constantin, EPFL; Frdric Meyer, ; Erich Mller, ; Mathieu Salzmann, EPFL; Pascal Fua,"2355,Poster,Learning Monocular 3D Human Pose estimation on weakly-supervised Multi-view Images,"Helge Rhodin, epfl.ch; Jrg Sprri, Balgrist; Isinsu Katircioglu, EPFL Lausanne, Switzerland; Victor Constantin, EPFL; Frdric Meyer, ; Erich Mller, ; Mathieu Salzmann, EPFL; Pascal Fua,"2361,Spotlight,Emotional Attention: A Study of Image Sentiment and Visual Attention,"Shaojing Fan, National University of Singapo; Zhiqi Shen, National University of Singapore; Ming Jiang, University of Minnesota; Bryan Koenig, Southern Utah University; Juan Xu, University of Minnesota; Mohan Kankanhalli, National University of Singapore; Qi Zhao,"2361,Poster,Emotional Attention: A Study of Image Sentiment and Visual Attention,"Shaojing Fan, National University of Singapo; Zhiqi Shen, National University of Singapore; Ming Jiang, University of Minnesota; Bryan Koenig, Southern Utah University; Juan Xu, University of Minnesota; Mohan Kankanhalli, National University of Singapore; Qi Zhao,"2368,Poster,Geometry-Guided CNN for Self-supervised Video Representation learning,"Chuang Gan, Tsinghua University; Boqing Gong, University of Central Florida; Kun Liu, Beijing University of Posts and Telecommunications; hao Su, ; Leonidas J. Guibas,"2380,Poster,Multi-Level Fusion based 3D Object Detection from Monocular Images,"Bin Xu, ; Zhenzhong Chen, Wuhan University"2383,Poster,Explicit Loss-Error-Aware Quantization for Deep Neural Networks,"Aojun Zhou, Intel labs china; Anbang Yao,"2387,Poster,Generative Adversarial Perturbations,"Omid Poursaeed, Cornell University; Isay Katsman, Cornell University; Bicheng Gao, Shanghai Jiao Tong University; Serge Belongie,"2391,Poster,A Hybrid L1-L0 Layer Decomposition Model for Tone Mapping,"Zhetong Liang, PolyU; Jun Xu, Hong Kong Polytechnic U; David Zhang, Hong Kong Polytechnic University; Zisheng Cao, ; Lei Zhang, The Hong Kong Polytechnic University"2403,Poster,Learning Deep Correspondence through Prior and Posterior Feature Constancy,"Zhengfa Liang, NUDT; Yiliu Feng, NUDT; Yulan Guo, NUDT; Hengzhu Liu, NUDT; Wei Chen, ; Linbo Qiao, ; Li Zhou, NUDT; Jianfeng Zhang, NUDT"2406,Spotlight,Through-Wall Human Pose Estimation Using Radio Signals,"Mingmin Zhao, MIT; Tianhong Li, MIT; Mohammad Abu Alsheikh, MIT; Yonglong Tian, Massachusetts Institute of Technology; Hang Zhao, MIT; Antonio Torralba, MIT; Dina Katabi, MIT"2406,Poster,Through-Wall Human Pose Estimation Using Radio Signals,"Mingmin Zhao, MIT; Tianhong Li, MIT; Mohammad Abu Alsheikh, MIT; Yonglong Tian, Massachusetts Institute of Technology; Hang Zhao, MIT; Antonio Torralba, MIT; Dina Katabi, MIT"2411,Poster,End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching,"Georgios Georgakis, George Mason University; Srikrishna Karanam, Siemens Corporate Technology; Ziyan Wu, Siemens Corporation; Jan Ernst, Siemens Corporation; Jana Kosecka, George Mason Univiversity"2425,Spotlight,Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence,"Ruiqi Gao, UCLA; Yang Lu, University of California Los Angeles; Junpei Zhou, ; Song-Chun Zhu, ; Yingnian Wu,"2425,Poster,Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence,"Ruiqi Gao, UCLA; Yang Lu, University of California Los Angeles; Junpei Zhou, ; Song-Chun Zhu, ; Yingnian Wu,"2427,Poster,Matching Adversarial Networks,"Gellert Mattyus, UBER ATG; Raquel Urtasun, University of Toronto"2434,Poster,Stochastic Variational Inference with Gradient Linearization,"Tobias Pltz, TU Darmstadt; Anne Wannenwetsch, TU Darmstadt; Stefan Roth,"2437,Poster,Geometry-aware Deep Network for Single-Image Novel View Synthesis,"Miaomiao Liu, Data61,CSIRO; Xuming He, ShanghaiTech; Mathieu Salzmann, EPFL"2443,Poster,Robust Depth Estimation from Auto Bracketed Images,"Sunghoon Im, KAIST; Hae-Gon Jeon, KAIST; In So Kweon, KAIST"2447,Poster,Document Enhancement using Visibility Detection,"Nati Kligler, Technion; Sagi Katz, Technion; Ayellet Tal, Technion"2450,Poster,Co-Occurrence Template Matching,"Shai Avidan, ; rotal kat, Tel-Aviv University; roy jevnisek, Tel-Aviv University"2451,Poster,Intrinsic Image Transformation via Scale Space Decomposition,"Lechao Cheng, ; Chengyi Zhang, Zhejiang University; Zicheng Liao,"2455,Poster,Depth and Transient Imaging with Compressive SPAD Array Cameras,"Qilin Sun, KAUST; Xiong Dun, KAUST; Yifan (Evan) Peng, UBC; Wolfgang Heidrich,"2457,Poster,Efficient and Deep Person Re-Identification using Multi-Level Similarity,"Yiluan Guo, SUTD; Ngai-Man Cheung,"2462,Oral,Hybrid Camera Pose Estimation,"Federico Camposeco, ETH; Andrea Cohen, ETH Zurich; Marc Pollefeys, ETH; Torsten Sattler, ETH Zurich"2462,Poster,Hybrid Camera Pose Estimation,"Federico Camposeco, ETH; Andrea Cohen, ETH Zurich; Marc Pollefeys, ETH; Torsten Sattler, ETH Zurich"2463,Poster,SoS-RSC: A Sum-of-Squares Polynomial Approach to Robustifying Subspace Clustering Algorithms,"Octavia Camps, Northeastern University, USA; Mario Sznaier,"2464,Spotlight,Alive Caricature from 2D to 3D,"Qianyi Wu, USTC; Juyong Zhang, University of Science and Technology of China; Yu-Kun Lai, Cardiff University; Jianmin Zheng, Nanyang Technological University; Jianfei Cai,"2464,Poster,Alive Caricature from 2D to 3D,"Qianyi Wu, USTC; Juyong Zhang, University of Science and Technology of China; Yu-Kun Lai, Cardiff University; Jianmin Zheng, Nanyang Technological University; Jianfei Cai,"2484,Poster,Arbitrary Style Transfer with Deep Feature Reshuffle,"Shuyang Gu, USTC; Congliang Chen, Peking University; Jing Liao, ; Lu Yuan, Microsoft Research Asia"2485,Spotlight,Self-Supervised Feature Learning by Learning to Spot Artifacts,"Simon Jenni, Universitt Bern; Paolo Favaro, Bern University, Switzerland"2485,Poster,Self-Supervised Feature Learning by Learning to Spot Artifacts,"Simon Jenni, Universitt Bern; Paolo Favaro, Bern University, Switzerland"2486,Poster,Multi-Label Zero-Shot Learning with Structured Knowledge Graphs,"Chung-Wei Lee, National Taiwan University; Wei Fang, National Taiwan University; Chih-Kuan Yeh, Carnegie Mellon University; Yu-Chiang Frank Wang, Academia Sinica"2494,Spotlight,Towards High Performance Video Object Detection,"Xizhou Zhu, ; Jifeng Dai, Microsoft Research; Lu Yuan, Microsoft Research Asia; Yichen Wei, Microsoft Research Asia"2494,Poster,Towards High Performance Video Object Detection,"Xizhou Zhu, ; Jifeng Dai, Microsoft Research; Lu Yuan, Microsoft Research Asia; Yichen Wei, Microsoft Research Asia"2496,Spotlight,Mesoscopic Facial Geometry inference Using Deep Neural Networks,"Loc Huynh, USC ICT; Weikai Chen, USC ICT; Shunsuke Saito, ; Jun Xing, ICT; Koki Nagano, Pinscreen, Inc; Andrew Jones, USC ICT; Paul Debevec, USC ICT; Hao Li,"2496,Poster,Mesoscopic Facial Geometry inference Using Deep Neural Networks,"Loc Huynh, USC ICT; Weikai Chen, USC ICT; Shunsuke Saito, ; Jun Xing, ICT; Koki Nagano, Pinscreen, Inc; Andrew Jones, USC ICT; Paul Debevec, USC ICT; Hao Li,"2498,Oral,Relation Networks for Object Detection,"Han Hu, ; Jiayuan Gu, Microsoft; Zheng Zhang, Microsoft; Jifeng Dai, Microsoft Research; Yichen Wei, Microsoft Research Asia"2498,Poster,Relation Networks for Object Detection,"Han Hu, ; Jiayuan Gu, Microsoft; Zheng Zhang, Microsoft; Jifeng Dai, Microsoft Research; Yichen Wei, Microsoft Research Asia"2499,Poster,Mobile Video Object Detection with Temporally-Aware Feature Maps,"Menglong Zhu, ; Mason Liu, Georgia Tech"2501,Poster,Free supervision from video games,"Philipp Krahenbuhl,"2503,Poster,Adversarially Learned One-Class Classifier for Novelty Detection,"Mohammad Sabokrou, Institute for Research in Fundamental Sciences (IPM); Mohammad Khalooie, ; Mahmood Fathi, ; Ehsan Adeli, Stanford University"2508,Poster,"Fast, Simple, and Effective Resource-Constrained Structure Learning of Deep Networks","Ariel Gordon, Google; Elad Eban, Google; Bo Chen, Google; ofir Nachum, Google; Tien-Ju Yang, Massachusetts Institute of Technology; Edward Choi, Georgia Institute of Technology"2509,Poster,Resource Aware Person Re-identification across Multiple Resolutions,"Yan Wang, Cornell university; Lequn Wang, Cornell University; yurong you, shang hai jiao tong university; xu zou, tsinghua university; Vincent Chen, cornell university; Serena Li, CORNELL UNIVERSITY; Bharath Hariharan, Cornell University; Gao Huang, ; Kilian Weinberger, Cornell University"2514,Spotlight,Deep Learning under Privileged Information Using Heteroscedastic Dropout,"Ozan Sener, Stanford University; Silvio Savarese, ; John Lambert, Stanford University"2514,Poster,Deep Learning under Privileged Information Using Heteroscedastic Dropout,"Ozan Sener, Stanford University; Silvio Savarese, ; John Lambert, Stanford University"2517,Poster,Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks,"Long Chen, ZJU; Hanwang Zhang, Columbia University; Jun Xiao, ZJU; Wei Liu, ; Shih-Fu Chang,"2518,Poster,Learning and Using the Arrow of Time,"Donglai Wei, MIT; Andrew Zisserman, Oxford; William Freeman, MIT/Google; Joseph Lim, University of Southern California"2519,Poster,Optimizing Filter Size in Convolutional Neural Networks for Facial Action Unit Recognition,"Shizhong Han, 1986; zibo Meng, ; Zhiyuan Li, University of South Carolina; JAMES O'REILLY, University of South Carolina; Jie Cai, University of South Carolina; Xiaofeng Wang, University of South Carolina; Yan Tong, University of South Carolina"2523,Oral,"Revisiting Salient Object Detection: Simultaneous Detection, Ranking, and Subitizing of Multiple Salient Objects","Md Amirul Islam, University of Manitoba; Mahmoud Kalash, University of Manitoba; Neil D. B. Bruce, University of Manitoba"2523,Poster,"Revisiting Salient Object Detection: Simultaneous Detection, Ranking, and Subitizing of Multiple Salient Objects","Md Amirul Islam, University of Manitoba; Mahmoud Kalash, University of Manitoba; Neil D. B. Bruce, University of Manitoba"2529,Poster,Gaze Prediction in Dynamic $360^\circ$ Immersive Videos,"Yanyu Xu, Shanghaitech University; Yanbing Dong, ; Junru Wu, ; Zhengzhong Sun, ; Zhiru Shi, ; Jingyi Yu, ; Shenghua Gao, ShanghaiTech University"2541,Poster,Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing,"Zilong Huang, HUST; Xinggang Wang, ; Jiasi Wang, HUST; Wenyu Liu, ; Jingdong Wang, Microsoft Research"2542,Poster,Modulated Convolutional Networks,"Xiaodi Wang, Beihang University; Baochang Zhang, ; Ce Li, CUMTB; Rongrong Ji, ; jungong han, ; Xianbin Cao, Beihang University; jianzhuang liu,"2548,Poster,V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map,"Gyeongsik Moon, Seoul National University; Ju Yong Chang, Kwangwoon University; Kyoung Mu Lee,"2561,Poster,SeedNet : Automatic Seed Generation with Deep Reinforcement Learning for Robust Interactive Segmentation,"Gwangmo Song, Seoul National University; Heesoo Myeong, Samsung; Kyoung Mu Lee,"2565,Spotlight,Deep Parametric Continuous Convolutional Neural Networks,"Shenlong Wang, ; Shun Da Suo, ; Wei-Chiu Ma, MIT; Raquel Urtasun, University of Toronto"2565,Poster,Deep Parametric Continuous Convolutional Neural Networks,"Shenlong Wang, ; Shun Da Suo, ; Wei-Chiu Ma, MIT; Raquel Urtasun, University of Toronto"2566,Poster,Optical Flow Guided Feature: A Motion Representation for Video Action Recognition,"Shuyang Sun, The University of Sydney; Zhanghui Kuang, Sense Time; Wanli Ouyang, The University of Sydney; Lu Sheng, The Chinese University of HK; Wei Zhang,"2567,Oral,Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View,"Shuran Song, Princeton ; Andy Zeng, Princeton; Angel Chang, Stanford University; Manolis Savva, ; Silvio Savarese, ; Thomas Funkhouser, Princeton"2567,Poster,Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View,"Shuran Song, Princeton ; Andy Zeng, Princeton; Angel Chang, Stanford University; Manolis Savva, ; Silvio Savarese, ; Thomas Funkhouser, Princeton"2578,Poster,Preserving Semantic Relations for Zero-Shot Learning,"Yashas Annadani, NITK; Soma Biswas, Indian Institute of Science"2582,Poster,What have we learned from deep representations for action recognition?,"Christoph Feichtenhofer, ; Axel Pinz, Graz University of Technology; Richard Wildes, York University; Andrew Zisserman, Oxford"2583,Spotlight,AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation,"Jogendra Kundu, Indian Institute of Science; Phani Krishna Uppala, Indian Institute of Science; Anuj Pahuja, Indian Institute of Science; Venkatesh Babu Radhakrishnan, Indian Institute of Science"2583,Poster,AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation,"Jogendra Kundu, Indian Institute of Science; Phani Krishna Uppala, Indian Institute of Science; Anuj Pahuja, Indian Institute of Science; Venkatesh Babu Radhakrishnan, Indian Institute of Science"2588,Poster,Neural Style Transfer via Meta Networks,"Falong Shen, Peking University; Shuicheng Yan, ; Gang Zeng, Peking University"2589,Spotlight,FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis,"Nitika Verma, INRIA; Edmond Boyer, ; Jakob Verbeek,"2589,Poster,FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis,"Nitika Verma, INRIA; Edmond Boyer, ; Jakob Verbeek,"2599,Poster,InverseFaceNet: Deep Monocular Inverse Face Rendering at over 250 Hz,"Hyeongwoo Kim, MPII; Michael Zollhfer, MPI Informatics; Ayush Tewari, MPI Informatics; Justus Thies, Technical University of Munich; Christian Richardt, University of Bath; Christian Theobalt, MPI Informatics"2601,Poster,"People, Penguins and Petri Dishes: Adapting Object Counting Models To New Visual Domains And Object Types Without Forgetting","Mark Marsden, Dublin City University; Kevin McGuinness, DCU ; Suzanne Little, DCU; Ciara Keogh, University College Dublin, Ireland; Noel O'Connor, DCU"2602,Poster,Multi-Frame Quality Enhancement for Compressed Video,"Ren Yang, Beihang University; Mai Xu, Beihang University; Zulin Wang, Beihang University; Tianyi Li, Beihang University"2603,Spotlight,Cascade R-CNN: Delving into High Quality Object Detection,"Zhaowei Cai, UC San Diego; Nuno Vasconcelos, UCSD, USA"2603,Poster,Cascade R-CNN: Delving into High Quality Object Detection,"Zhaowei Cai, UC San Diego; Nuno Vasconcelos, UCSD, USA"2604,Poster,DiverseNet: When One Right Answer Is Not Enough,"Michael Firman, UCL; Neill Campbell, University of bath; Lourdes Agapito, University College London; Gabriel Brostow, University College London UK"2605,Poster,Beyond the Pixel-Wise Loss for Topology-Aware Delineation,"Agata Mosinska, EPFL; Pablo Marquez Neila, EPFL; Mateusz Kozinski, ; Pascal Fua,"2612,Oral,Polarimetric Dense Monocular SLAM,"Luwei Yang, Simon Farser University; Feitong Tan, Simon Fraser University; Ao Li, Simon Fraser University; Zhaopeng Cui, Simon Fraser University; Yasutaka Furukawa, ; Ping Tan,"2612,Poster,Polarimetric Dense Monocular SLAM,"Luwei Yang, Simon Farser University; Feitong Tan, Simon Fraser University; Ao Li, Simon Fraser University; Zhaopeng Cui, Simon Fraser University; Yasutaka Furukawa, ; Ping Tan,"2618,Poster,A Perceptual Measure for Deep Single Image Camera Calibration,"Yannick Hold-Geoffroy, Universit Laval; Kalyan Sunkavalli, Adobe Systems Inc.; Jonathan Eisenmann, Adobe Systems; Matthew Fisher, Adobe; Emiliano Gambaretto, Adobe Systems; Sunil Hadap, ; Jean-Francois Lalonde, Laval University"2619,Poster,Show Me a Story: Towards Coherent Neural Story Illustration,"Hareesh Ravi, Rutgers University; Lezi Wang, Rutgers; Carlos Muniz, Rutgers University; Leonid Sigal, University of British Columbia; Mubbasir Kapadia, Rutgers University"2624,Poster,Towards Universal Representation for Unseen Action Recognition,"Yi Zhu, University of California Merced; Yang Long, Newcastle University; Yu Guan, Newcastle University; Shawn Newsam, ; Ling Shao, University of East Anglia"2628,Poster,A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects,"Yuanlu Xu, University of California, Los Angeles; Lei Qin, Institute of Computing Technology, Chinese Academy of Sciences; Xiaobai Liu, San Diego State University; Song-Chun Zhu,"2629,Spotlight,LEGO: Learning Edge with Geometry all at Once by Watching Videos,"Zhenheng Yang, ; Peng Wang, Baidu; Yang Wang, Baidu USA; Wei Xu, ; Ram Nevatia,"2629,Poster,LEGO: Learning Edge with Geometry all at Once by Watching Videos,"Zhenheng Yang, ; Peng Wang, Baidu; Yang Wang, Baidu USA; Wei Xu, ; Ram Nevatia,"2630,Spotlight,Iterative Learning with Open-set Noisy Labels,"Yisen Wang, Tsinghua University; Xingjun Ma, The University of Melbourne; Weiyang Liu, Georgia Tech; James Bailey, The University of Melbourne; Hongyuan Zha, Georgia Institute of Technology; Le Song, Georgia Institute of Technology; Shu-Tao Xia, Tsinghua University"2630,Poster,Iterative Learning with Open-set Noisy Labels,"Yisen Wang, Tsinghua University; Xingjun Ma, The University of Melbourne; Weiyang Liu, Georgia Tech; James Bailey, The University of Melbourne; Hongyuan Zha, Georgia Institute of Technology; Le Song, Georgia Institute of Technology; Shu-Tao Xia, Tsinghua University"2633,Poster,Sparse Photometric 3D Face Reconstruction Guided by Morphable Models,"Xuan Cao, ShanghaiTech University; Zhang Chen, ShanghaiTech University; jingyi Yu, Shanghai Tech University; Anpei Chen,"2635,Poster,Deep Adversarial Subspace Clustering,"Pan Zhou, National university of singapo; Yunqing Hou, NUS; Jiashi Feng,"2636,Poster,Multimodal Visual Concept Learning with Weakly Supervised Techniques,"Giorgos Bouritsas, NTUA; Petros Koutras, NTUA; Athanasia Zlatintsi, NTUA; Petros Maragos, NTUA"2640,Poster,"ICE-BA: Efficient, Consistent and Efficient Bundle Adjustment for Visual-Inertial SLAM","Haomin Liu, Baidu; Mingyu Chen, Baidu; Guofeng Zhang, Zhejiang University; Hujun Bao, Zhejiang University; Yingze Bao, Baidu LLC"2644,Poster,KIPPI: KInetic Polygonal Partitioning of Images,"Jean-Philippe Bauchet, Inria; Florent Lafarge,"2647,Poster,Planar Shape Detection at Structural Scales,"Hao Fang, Inria; Florent Lafarge, ; Mathieu Desbrun, Caltech"2648,Poster,A Closer Look at Spatiotemporal Convolutions for Action Recognition,"Du Tran, Dartmouth College; heng Wang, ; Lorenzo Torresani, Darthmout College, USA; Jamie Ray, Facebook; Manohar Paluri,"2650,Oral,Wasserstein Introspective Neural Networks,"Kwonjoon Lee, UC San Diego; Weijian Xu, UC San Diego; Fan Fan, UC San Diego; Zhuowen Tu, UCSD, USA"2650,Poster,Wasserstein Introspective Neural Networks,"Kwonjoon Lee, UC San Diego; Weijian Xu, UC San Diego; Fan Fan, UC San Diego; Zhuowen Tu, UCSD, USA"2657,Spotlight,Learning Globally Optimized Object Detector via Policy Gradient,"Yongming Rao, ; Dahua Lin, CUHK; Jiwen Lu, Tsinghua University"2657,Poster,Learning Globally Optimized Object Detector via Policy Gradient,"Yongming Rao, ; Dahua Lin, CUHK; Jiwen Lu, Tsinghua University"2662,Poster,Reconstruction Network for Video Captioning,"Bairui Wang, ; Lin Ma, Tencent AI Lab; Wei Zhang, ; Wei Liu,"2666,Poster,DOTA: A Large-scale Dataset for Object Detection in Aerial Images,"Gui-Song Xia, Wuhan University; Xiang Bai, Huazhong University of Science and Technology; Jian Ding, Wuhan University; Zhen Zhu, Huazhong University of Science and Technology; Serge Belongie, ; Jiebo Luo, University of Rochester; Mihai Datcu, ; Marcello Pelillo, University of Venice; Liangpei Zhang, Wuhan University"2672,Spotlight,Person Transfer GAN to Bridge Domain Gap for Person Re-Identification,"Longhui Wei, Peking University; Shiliang Zhang, Peking University; Wen Gao, ; Qi Tian,"2672,Poster,Person Transfer GAN to Bridge Domain Gap for Person Re-Identification,"Longhui Wei, Peking University; Shiliang Zhang, Peking University; Wen Gao, ; Qi Tian,"2695,Poster,Tight Nonconvex Relaxation of MAP Inference,"D. Khu L-Huu, Inria & CentraleSuplec, Universit Paris-Saclay; Nikos Paragios, Ecole Centrale de Paris"2697,Poster,Weakly Supervised Phrase Localization with Multi-Scale Anchored Transformer Network,"Fang Zhao, National University of Singapore; Jianshu Li, National University of Singapo; Jian Zhao, NUS; Jiashi Feng,"2701,Poster,Variational Autoencoders for Deforming 3D Mesh Models,"Qingyang Tan, UCAS; Lin Gao, Chinese Academy of Sciences; Yu-Kun Lai, Cardiff University; Shihong Xia, Institute of Computing Technology, CAS, Beijing, China"2703,Poster,DeepMVS: Learning Multi-View Stereopsis,"Po-Han Huang, University of Illinois, U-C; Kevin Matzen, Facebook; Johannes Kopf, Facebook; Narendra Ahuja, University of Illinois at Urbana-Champaign, USA; Jia-Bin Huang, Virginia Tech"2705,Poster,HydraNets: Specialized Dynamic Architectures for Efficient Inference,"Ravi Teja Mullapudi, Carnegie Mellon University; Noam Shazeer, Google; William Mark, Google; Kayvon Fatahalian, Stanford"2708,Spotlight,Multimodal Explanations: Justifying Decisions and Pointing to the Evidence,"Lisa Anne Hendricks, UC Berkeley; Trevor Darrell, UC Berkeley, USA; Anna Rohrbach, UC Berkeley; Zeynep Akata, University of Amsterdam; Bernt Schiele, MPI Informatics Germany; Marcus Rohrbach, UC Berkeley; Dong Huk Park, UC Berkeley"2708,Poster,Multimodal Explanations: Justifying Decisions and Pointing to the Evidence,"Lisa Anne Hendricks, UC Berkeley; Trevor Darrell, UC Berkeley, USA; Anna Rohrbach, UC Berkeley; Zeynep Akata, University of Amsterdam; Bernt Schiele, MPI Informatics Germany; Marcus Rohrbach, UC Berkeley; Dong Huk Park, UC Berkeley"2709,Poster,Feature Generating Networks for Zero-Shot Learning,"Yongqin Xian, Max Planck Institute; Tobias Lorenz, Max Planck Institute for Informatics; Bernt Schiele, MPI Informatics Germany; Zeynep Akata, University of Amsterdam"2711,Poster,Deep Image Prior,"Dmitry Ulyanov, Skoltech; Andrea Vedaldi, U Oxford; Victor Lempitsky,"2715,Poster,Pix3D: Dataset and Methods for 3D Object Modeling from a Single Image,"Xingyuan Sun, Shanghai Jiao Tong University; Jiajun Wu, MIT; Xiuming Zhang, MIT; Zhoutong Zhang, MIT; Tianfan Xue, Google; Joshua Tenenbaum, ; William Freeman, MIT/Google"2722,Poster,Defense against Universal Adversarial Perturbations,"NAVEED AKHTAR, UNIVERSITY OF WESTERN AUSTRALI; Jian Liu, UWA; Ajmal Mian, UWA"2723,Poster,Structure from Recurrent Motion: From Rigidity to Recurrency,"Xiu Li, Tsinghua University; Hongdong Li, Australian National University; Hanbyul Joo, CMU; Yebin Liu, Tsinghua University; Yaser Sheikh,"2726,Spotlight,Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN,"Deepak Babu Sam, Indian Institute of Science; Neeraj Sajjan, Indian Institute of Science; Venkatesh Babu Radhakrishnan, Indian Institute of Science; Mukundhan Srinivasan, NVIDIA"2726,Poster,Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN,"Deepak Babu Sam, Indian Institute of Science; Neeraj Sajjan, Indian Institute of Science; Venkatesh Babu Radhakrishnan, Indian Institute of Science; Mukundhan Srinivasan, NVIDIA"2730,Poster,Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking,"Filip Radenovic, CTU Prague; Ahmet Iscen, Inria; Giorgos Tolias, Czech Technical University in Prague; Yannis Avrithis, Inria; Ondrej Chum, Czech Technical University in Prague"2732,Spotlight,Structured Set Matching Networks for One-Shot Part Labeling,"Jonghyun Choi, ; Jayant Krishnamurthy, Semantic Machines; Aniruddha Kembhavi, Allen Institute for Artificial Intelligence; Ali Farhadi,"2732,Poster,Structured Set Matching Networks for One-Shot Part Labeling,"Jonghyun Choi, ; Jayant Krishnamurthy, Semantic Machines; Aniruddha Kembhavi, Allen Institute for Artificial Intelligence; Ali Farhadi,"2735,Poster,DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation,"Jiang Liu, Carnegie Mellon University; Chenqiang Gao, Chongqing University of Posts and Telecommunications; Deyu Meng, Xi'an Jiaotong University; Alexander Hauptmann,"2738,Poster,Deeply Learned Filter Response Functions for Hyperspectral Reconstruction,"Shijie Nie, NII, Japan; Lin Gu, National Institute of Informatics; Yinqiang Zheng, National Institute of Informatics, Japan; Antony Lam, Saitama University; Nobutaka Ono, Tokyo Metropolitan University; Imari Sato, National Institute of Informatics, Japan"2742,Poster,Learning Strict Identity Mappings in Deep Residual Networks,"Xin Yu, University of Utah; Srikumar Ramalingam, ; Zhiding Yu, Carnegie Mellon University"2744,Poster,Face Detector Adaptation without Negative Transfer or Catastrophic Forgetting,"Muhammad Abdullah Jamal, University of Central Florida; Haoxiang Li, Adobe Research; Boqing Gong, University of Central Florida"2752,Poster,"Multi-Evidence Fusion and Filtering for Weakly Supervised Object Recognition, Detection and Segmentation","Weifeng Ge, The University of Hong Kong; Yizhou Yu, The University of Hong Kong"2763,Poster,SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval,"Peng Xu, Beijing University of Posts an; Yongye Huang, Beijing University of Posts and Telecommunications; Tongtong Yuan, Beijing University of Posts and Telecommunications; Kaiyue Pang, QMUL; Yi-Zhe Song, ; Tao Xiang, Queen Mary University of London; Timothy Hospedales, University of Edinburgh; Zhanyu Ma, Beijing University of Posts and Telecommunications ; Jun Guo, Beijing University of Posts and Telecommunications"2764,Poster,Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams,"Daesik Kim, Seoul National University; YoungJoon Yoo, ; JeeSoo Kim, Seoul national university; SangKuk Lee, Seoul National University; Nojun Kwak, Seoul National University"2765,Oral,The Perception-Distortion Tradeoff,"Yochai Blau, Technion; Tomer Michaeli, Technion"2765,Poster,The Perception-Distortion Tradeoff,"Yochai Blau, Technion; Tomer Michaeli, Technion"2767,Poster,Jerk-Aware Video Acceleration Magnification,"Shoichiro Takeda, NTT Media Intelligence Lab.; Kazuki Okami, NTT Media Intelligence Lab.; Dan Mikami, NTT Media Intelligence Lab.; Megumi Isogai, NTT Media Intelligence Lab.; Hideaki Kimata, NTT Media Intelligence Lab."2769,Spotlight,Video Based Reconstruction of 3D People Models,"Thiemo Alldieck, TU Braunschweig; Marcus Magnor, TU Braunschweig; Weipeng Xu, MPI Informatics; Christian Theobalt, MPI Informatics; Gerard Pons-Moll, Max Planck for Informatics"2769,Poster,Video Based Reconstruction of 3D People Models,"Thiemo Alldieck, TU Braunschweig; Marcus Magnor, TU Braunschweig; Weipeng Xu, MPI Informatics; Christian Theobalt, MPI Informatics; Gerard Pons-Moll, Max Planck for Informatics"2777,Poster,Appearance-and-Relation Networks for Video Classification,"Limin Wang, ETH Zurich; Wei Li, Google; Wen Li, ETH; Luc Van Gool, KTH"2778,Poster,Fast Spectral Ranking for Similarity Search,"Ahmet Iscen, Inria; Yannis Avrithis, Inria; Giorgos Tolias, Czech Technical University in Prague; Teddy Furon, ; Ondrej Chum, Czech Technical University in Prague"2779,Poster,Mining on Manifolds: Metric Learning without Labels,"Ahmet Iscen, Inria; Giorgos Tolias, Czech Technical University in Prague; Yannis Avrithis, Inria; Ondrej Chum, Czech Technical University in Prague"2781,Poster,From source to target and back: Symmetric Bi-Directional Adaptive GAN,"Paolo Russo, University of Rome La Sapienza; Fabio Carlucci, University of Rome La Sapienza; Tatiana Tommasi, Italian Institute of Tecnology; Barbara Caputo, University of Rome La Sapienza, Italy"2784,Spotlight,Path Aggregation Network for Instance Segmentation,"Shu Liu, CUHK; Lu Qi, CUHK; Haifang Qin, ; Jianping Shi, SenseTime; Jiaya Jia, Chinese University of Hong Kong"2784,Poster,Path Aggregation Network for Instance Segmentation,"Shu Liu, CUHK; Lu Qi, CUHK; Haifang Qin, ; Jianping Shi, SenseTime; Jiaya Jia, Chinese University of Hong Kong"2788,Poster,Referring Image Segmentation via Recurrent Refinement Networks,"Ruiyu Li, CUHK; Kaican Li, CUHK; Yi-Chun Kuo, CUHK; Michelle Shu, ; Xiaojuan Qi, CUHK; Xiaoyong Shen, CUHK; Jiaya Jia, Chinese University of Hong Kong"2790,Poster,Defense against adversarial attacks using guided denoiser,"Fangzhou Liao, Tsinghua University; Ming Liang, ; Yinpeng Dong, Tsinghua Univeristy; Tianyu Pang, Tsinghua University; Jun Zhu, Tsinghua University; Xiaolin Hu, Tsinghua University"2792,Spotlight,Neural 3D Mesh Renderer,"Hiroharu Kato, Univ. Tokyo; Tatsuya Harada, University of Tokyo"2792,Poster,Neural 3D Mesh Renderer,"Hiroharu Kato, Univ. Tokyo; Tatsuya Harada, University of Tokyo"2797,Poster,Disentangling Factors of Variation by Mixing Them,"Qiyang HU, University of bern; Attila Szabo, University of Bern; Tiziano Portenier, ; Matthias Zwicker, ; Paolo Favaro, Bern University, Switzerland"2798,Poster,LSTM Pose Machines,"Yue Luo, SenseTime; Jimmy Ren, SenseTime Group Limited; Zhouxia Wang, SenseTime; Wenxiu Sun, SenseTime Group Limited; Jinshan Pan, UC Merced; Jianbo Liu, SenseTime; Jiahao Pang, SenseTime Group Limited; Liang Lin,"2799,Poster,CNN based Learning using Reflection and Retinex Models for Intrinsic Image Decomposition,"Anil Baslamisli, University of Amsterdam; Hoang-An Le, University of Amsterdam; Theo Gevers, University of Amsterdam"2801,Spotlight,Learning Semantic Concepts and Order for Image and Sentence Matching,"Yan Huang, ; Qi Wu, University of Adelaide; Liang Wang, unknown"2801,Poster,Learning Semantic Concepts and Order for Image and Sentence Matching,"Yan Huang, ; Qi Wu, University of Adelaide; Liang Wang, unknown"2805,Spotlight,Modifying Non-Local Variations Across Multiple Views,"Tal Tlusty, Technion; Tomer Michaeli, Technion; Tali Dekel, Google; Lihi Zelnik-Manor,"2805,Poster,Modifying Non-Local Variations Across Multiple Views,"Tal Tlusty, Technion; Tomer Michaeli, Technion; Tali Dekel, Google; Lihi Zelnik-Manor,"2812,Oral,Discriminative Learning of Latent Features for Zero-Shot Recognition,"Yan Li, CASIA; Junge Zhang, ; jianguo Zhang, ; Kaiqi Huang,"2812,Poster,Discriminative Learning of Latent Features for Zero-Shot Recognition,"Yan Li, CASIA; Junge Zhang, ; jianguo Zhang, ; Kaiqi Huang,"2813,Poster,Learning Rich Features for Image Manipulation Detection,"Peng Zhou, University of Maryland, Colleg; Xintong Han, University of Maryland; Vlad Morariu, University of Maryland; Larry Davis, University of Maryland, USA"2819,Poster,"GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose","Zhichao Yin, Sensetime Group Limited; Jianping Shi, SenseTime"2821,Poster,Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network,"Zizhao Zhang, University of Florida; Lin Yang, ; Yefeng Zheng, Simens"2823,Spotlight,Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network,"Zizhao Zhang, University of Florida; Yuanpu Xie, University of Florida; Lin Yang,"2823,Poster,Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network,"Zizhao Zhang, University of Florida; Yuanpu Xie, University of Florida; Lin Yang,"2827,Spotlight,Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection,"David Novotny, Oxford University; Samuel Albanie, Oxford University; Diane Larlus, NAVER LABS Europe; Andrea Vedaldi, U Oxford"2827,Poster,Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection,"David Novotny, Oxford University; Samuel Albanie, Oxford University; Diane Larlus, NAVER LABS Europe; Andrea Vedaldi, U Oxford"2828,Poster,Human Semantic Parsing for Person Re-identification,"Mahdi Kalayeh, UCF; Emrah Basaran, ; Mubarak Shah, UCF"2833,Poster,Ring loss: Convex Feature Normalization for Face Recognition,"Yutong Zheng, Carnegie Mellon University; Dipan Pal, Carnegie Mellon University; Marios Savvides,"2834,Poster,Learned Shape-Tailored Descriptors for Segmentation,"Naeemullah Khan, KAUST; Ganesh Sundaramoorthi,"2835,Poster,ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans,"Angela Dai, ; Daniel Ritchie, Brown University; Martin Bokeloh, Google; Scott Reed, Google; Juergen Sturm, Google; Matthias Niener, Technical University of Munich"2837,Poster,GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation,"Xiaojuan Qi, CUHK; Renjie Liao, ; Zhengzhe Liu, CUHK; Raquel Urtasun, University of Toronto; Jiaya Jia, Chinese University of Hong Kong"2839,Poster,Learning Compressible 360 Video Isomers,"Yu-Chuan Su, UT Austin; Kristen Grauman,"2844,Poster,Geometric robustness of deep networks: analysis and improvement,"Can Kanbak, EPFL; Seyed-Mohsen Moosavi-Dezfooli, ; Pascal Frossard,"2848,Poster,Weakly Supervised Facial Action Unit Recognition through Adversarial Training,"Guozhu Peng, USTC; Shangfei Wang,"2851,Spotlight,Empirical study of the topology and geometry of deep networks,"Alhussein Fawzi, ; Seyed-Mohsen Moosavi-Dezfooli, ; Pascal Frossard, ; Stefano Soatto, UCLA"2851,Poster,Empirical study of the topology and geometry of deep networks,"Alhussein Fawzi, ; Seyed-Mohsen Moosavi-Dezfooli, ; Pascal Frossard, ; Stefano Soatto, UCLA"2852,Poster,Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition,"Feng Liu, Sichuan University; Dan Zeng, Sichuan University; Qijun Zhao, Sichuan University; Xiaoming Liu, Michigan State University"2859,Poster,View Extrapolation of Human Body from a Single Image,"Hao Zhu, Nanjing University; hao Su, ; Peng Wang, Baidu; Xun Cao, EE Department, Nanjing Univ; Ruigang Yang, University of Kentucky"2864,Poster,Adversarially Occluded Samples for Person Re-identification,"Houjing Huang, CASIA; Dangwei Li, ; Zhang Zhang, ; Xiaotang Chen, ; Kaiqi Huang,"2871,Oral,Photometric Stereo in Participating Media Considering Shape-Dependent Forward Scatter,"Yuki Fujimura, Kyoto University; Masaaki Iiyama, Kyoto University; Atsushi Hashimoto, Kyoto University; Michihiko Minoh, Kyoto University"2871,Poster,Photometric Stereo in Participating Media Considering Shape-Dependent Forward Scatter,"Yuki Fujimura, Kyoto University; Masaaki Iiyama, Kyoto University; Atsushi Hashimoto, Kyoto University; Michihiko Minoh, Kyoto University"2873,Poster,Single-Image Depth Estimation Based on Fourier Domain Analysis,"Jaehan Lee, Korea University; Minhyeok Heo, Korea Unversity; Kyung-Rae Kim, Korea University; Chang-Su Kim,"2895,Spotlight,Future Person Localization in First-Person Videos,"Takuma Yagi, The University of Tokyo; Karttikeya Mangalam, IIT Kanpur; Ryo Yonetani, The University of Tokyo; Yoichi Sato, Univ of Tokyo"2895,Poster,Future Person Localization in First-Person Videos,"Takuma Yagi, The University of Tokyo; Karttikeya Mangalam, IIT Kanpur; Ryo Yonetani, The University of Tokyo; Yoichi Sato, Univ of Tokyo"2896,Poster,Classifier Learning with Prior Probabilities for Facial Action Unit Recognition,"Yong Zhang, CASIA; Weiming Dong, ; Bao-Gang Hu, CASIA; Qiang Ji, RPI"2905,Spotlight,Smooth Neighbors on Teacher Graphs for Semi-supervised Learning,"Yucen Luo, Tsinghua University; Jun Zhu, Tsinghua University; Mengxi Li, Tsinghua University; Yong Ren, Tsinghua University; Bo Zhang,"2905,Poster,Smooth Neighbors on Teacher Graphs for Semi-supervised Learning,"Yucen Luo, Tsinghua University; Jun Zhu, Tsinghua University; Mengxi Li, Tsinghua University; Yong Ren, Tsinghua University; Bo Zhang,"2906,Poster,Stacked Conditional Generative Adversarial Networks for Jointly Learning Shadow Detection and Shadow Removal,"Jifeng Wang, NJUST; Xiang Li, NJUST; Jian Yang, Nanjing University of Science and Technology"2914,Poster,Image Restoration by Estimating Frequency Distribution of Local Patches,"Jaeyoung Yoo, Seoul National University; Sang ho Lee, Seoul National University; Nojun Kwak, Seoul National University"2916,Poster,Fully Convolutional Attention Network for Multimodal Reasoning,"Haoqi Fan, Carnegie Mellon University; Jiatong Zhou,"2918,Spotlight,A PID Controller Approach for Stochastic Optimization of Deep Networks,"An Wangpeng , Tsinghua University; Haoqian Wang, Tsinghua University, Shenzhen Graduate School; Qingyun Sun, Stanford Univsersity; Jun Xu, Hong Kong Polytechnic U; QIonghai Dai, Tsinghua University; Lei Zhang, The Hong Kong Polytechnic University"2918,Poster,A PID Controller Approach for Stochastic Optimization of Deep Networks,"An Wangpeng , Tsinghua University; Haoqian Wang, Tsinghua University, Shenzhen Graduate School; Qingyun Sun, Stanford Univsersity; Jun Xu, Hong Kong Polytechnic U; QIonghai Dai, Tsinghua University; Lei Zhang, The Hong Kong Polytechnic University"2932,Poster,Domain Generalization with Adversarial Feature Learning,"Haoliang Li, Nanyang Technological Universi; Sinno Jilain Pan, Nanyang Technological University, Singapore; Shiqi Wang, City University of Hong Kong; Alex Kot,"2936,Poster,Camera Pose Estimation with Unknown Principal Point,"Viktor Larsson, Lund University; Zuzana Kukelova, Czech Technical University in Prague; Yinqiang Zheng, National Institute of Informatics, Japan"2937,Spotlight,Deformation Aware Image Compression,"Tamar Rott Shaham, Technion; Tomer Michaeli, Technion"2937,Poster,Deformation Aware Image Compression,"Tamar Rott Shaham, Technion; Tomer Michaeli, Technion"2946,Poster,AMNet: Memorability Estimation with Attention,"Jiri Fajtl, Kingston University; Vasileios Argyriou, Kingston University; Dorothy Monekosso, Leeds Beckett; Paolo Remagnino, Kingston University"2951,Spotlight,High Performance Visual Tracking with Siamese Region Proposal Network,"Bo Li, SenseTime ; Wei Wu, ; Zheng Zhu, Institute of Automation, CAS; Junjie Yan,"2951,Poster,High Performance Visual Tracking with Siamese Region Proposal Network,"Bo Li, SenseTime ; Wei Wu, ; Zheng Zhu, Institute of Automation, CAS; Junjie Yan,"2954,Poster,Image Blind Denoising With Generative Adversarial Network Based Noise Modeling,"Jingwen Chen, Sun Yat-sen University; Jiawei Chen, Sun Yat-sen University; Hongyang Chao, Sun Yat-sen University; Ming Yang,"2956,Spotlight,Single View Stereo Matching,"Yue Luo, SenseTime; Jimmy Ren, SenseTime Group Limited; Mude Lin, Sun Yat-Sen University; Jiahao Pang, SenseTime Group Limited; Wenxiu Sun, SenseTime Group Limited; Hongsheng Li, ; Liang Lin,"2956,Poster,Single View Stereo Matching,"Yue Luo, SenseTime; Jimmy Ren, SenseTime Group Limited; Mude Lin, Sun Yat-Sen University; Jiahao Pang, SenseTime Group Limited; Wenxiu Sun, SenseTime Group Limited; Hongsheng Li, ; Liang Lin,"2959,Poster,Pyramid Stereo Matching Network,"Jia-Ren Chang, National Chiao Tung University; Yong-Sheng Chen, National Chiao Tung University"2964,Spotlight,Interpret Neural Networks by Identifying Critical Data Routing Paths,"Yulong Wang, Tsinghua University; Hang Su, Tsinghua University; Xiaolin Hu, tsinghua"2964,Poster,Interpret Neural Networks by Identifying Critical Data Routing Paths,"Yulong Wang, Tsinghua University; Hang Su, Tsinghua University; Xiaolin Hu, tsinghua"2967,Poster,Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View,"Albert Pumarola, IRI (CSIC-UPC); Antonio Agudo, IRI (CSIC-UPC); Lorenzo Porzi, Mapillary Research; Alberto Sanfeliu, IRI (CSIC-UPC); Vincent Lepetit, University of Bordeaux; Francesc Moreno-Noguer, Institut de Robotica i Informatica Industrial (UPC/CSIC)"2970,Poster,Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars,"Antonio Loquercio, University of Zurich; Ana Maqueda, Universidad Politecnica de Madrid; Guillermo Gallego, University of Zurich; Narciso Garcia, Universidad Politecnica de Madrid; Davide Scaramuzza, University of Zurich"2972,Spotlight,Beyond Grbner Bases: Basis Selection for Minimal Solvers,"Viktor Larsson, Lund University; Magnus Oskarsson, Lund University Sweden; Kalle Astroem, Lund University; Alge Wallis, ; Zuzana Kukelova, Czech Technical University in Prague; Tomas Pajdla,"2972,Poster,Beyond Grbner Bases: Basis Selection for Minimal Solvers,"Viktor Larsson, Lund University; Magnus Oskarsson, Lund University Sweden; Kalle Astroem, Lund University; Alge Wallis, ; Zuzana Kukelova, Czech Technical University in Prague; Tomas Pajdla,"2974,Spotlight,"A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation","Guillermo Gallego, University of Zurich; Henri Rebecq, University of Zurich; Davide Scaramuzza, University of Zurich"2974,Poster,"A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation","Guillermo Gallego, University of Zurich; Henri Rebecq, University of Zurich; Davide Scaramuzza, University of Zurich"2979,Poster,PoTion: Pose MoTion Representation for Action Recognition,"Vasileios Choutas, Naver Labs Europe; Philippe Weinzaepfel, Xerox; Jerome Revaud, Naver Labs Europe; Cordelia Schmid, INRIA Grenoble, France"2980,Spotlight,Fight ill-posedness with ill-posedness: Single-shot variational depth super-resolution from shading,"Bjoern Haefner, TU Munich; Yvain Queau, Technical University Munich; Thomas Mllenhoff, Technical University of Munich; Daniel Cremers,"2980,Poster,Fight ill-posedness with ill-posedness: Single-shot variational depth super-resolution from shading,"Bjoern Haefner, TU Munich; Yvain Queau, Technical University Munich; Thomas Mllenhoff, Technical University of Munich; Daniel Cremers,"2986,Poster,Deep Lesion Graph in the Wild: Relationship Learning and Organization of Significant Radiology Image Findings in a Diverse Large-scale Lesion Database,"Ke Yan, National Institute of Health; Xiaosong Wang, NIH; Le Lu, Nvidia Corp; Ling Zhang, NIH; Adam Harrison, National Institutes of Health; MOHAMMADHADI Bagheri, NIH; Ronald Summers,"2988,Poster,Inverse Composition Discriminative Optimization for Point Cloud Registration,"Jayakorn Vongkulbhisal, Carnegie Mellon University; Beat Irastorza Ugalde, ; Fernando de la Torre, ; Joo Costeira,"2989,Spotlight,Manifold Learning in Quotient Spaces,"loi Mehr, LIP6; Andr Lieutier, ; Fernando Sanchez Bermudez, ; Vincent Guitteny, ; Nicolas Thome, Conservatoire national des arts et mtiers; Matthieu Cord,"2989,Poster,Manifold Learning in Quotient Spaces,"loi Mehr, LIP6; Andr Lieutier, ; Fernando Sanchez Bermudez, ; Vincent Guitteny, ; Nicolas Thome, Conservatoire national des arts et mtiers; Matthieu Cord,"2992,Poster,Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation,"Piotr Bilinski, University of Oxford; Victor Prisacariu, Oxford"2998,Poster,Logo Synthesis and Manipulation with Clustered Generative Adversarial Networks,"Alexander Sage, ETH Zurich; Eirikur Agustsson, ETH Zurich; Radu Timofte, ETH Zurich; Luc Van Gool, KTH"3001,Spotlight,Seeing Voices and Hearing Faces: Cross-modal biometric matching,"Arsha Nagrani, University of Oxford; Samuel Albanie, Oxford University; Andrew Zisserman, Oxford"3001,Poster,Seeing Voices and Hearing Faces: Cross-modal biometric matching,"Arsha Nagrani, University of Oxford; Samuel Albanie, Oxford University; Andrew Zisserman, Oxford"3005,Poster,"OL: Orthogonal Low-rank Embedding, A Plug and Play Geometric Loss for Deep Learning","Jose Lezama, Universidad de la Republica, Uruguay; Qiang Qiu, ; Pablo Mus, Universidad de la Republica, Uruguay; Guillermo Sapiro, Duke"3006,Spotlight,Compressed Video Action Recognition,"Chao-Yuan Wu, UT Austin; Manzil Zaheer, Carnegie Mellon University; Hexiang Hu, ; R. Manmatha, A9; Alexander Smola, ; Philipp Krahenbuhl,"3006,Poster,Compressed Video Action Recognition,"Chao-Yuan Wu, UT Austin; Manzil Zaheer, Carnegie Mellon University; Hexiang Hu, ; R. Manmatha, A9; Alexander Smola, ; Philipp Krahenbuhl,"3009,Poster,Efficient parametrization of multi-domain deep neural networks,"Sylvestre-Alvise Rebuffi, University of Oxford; Hakan Bilen, University of Oxford; Andrea Vedaldi, U Oxford"3010,Poster,Learning Answer Embeddings for Visual Question Answering,"Hexiang Hu, ; Wei-Lun Chao, USC; Fei Sha, University of Southern California"3012,Poster,Pixar: Real-time 3D Object Detection from Point Clouds,"Bin Yang, Uber ATG, UofT; Wenjie Luo, Uber ATG.; UofT; Raquel Urtasun, University of Toronto"3013,Oral,"Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net","Wenjie Luo, Uber ATG.; UofT; Bin Yang, Uber ATG, UofT; Raquel Urtasun, University of Toronto"3013,Poster,"Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net","Wenjie Luo, Uber ATG.; UofT; Bin Yang, Uber ATG, UofT; Raquel Urtasun, University of Toronto"3021,Spotlight,Creating Capsule Wardrobes from Fashion Images,"Wei-Lin Hsiao, UT-Austin; Kristen Grauman,"3021,Poster,Creating Capsule Wardrobes from Fashion Images,"Wei-Lin Hsiao, UT-Austin; Kristen Grauman,"3022,Spotlight,Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics,"Alex Kendall, ; Yarin Gal, University of Cambridge; Roberto Cipolla,"3022,Poster,Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics,"Alex Kendall, ; Yarin Gal, University of Cambridge; Roberto Cipolla,"3025,Spotlight,Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories,"Antonio Agudo, IRI (CSIC-UPC); Francesc Moreno-Noguer, Institut de Robotica i Informatica Industrial (UPC/CSIC)"3025,Poster,Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories,"Antonio Agudo, IRI (CSIC-UPC); Francesc Moreno-Noguer, Institut de Robotica i Informatica Industrial (UPC/CSIC)"3026,Poster,Sim2Real View Invariant Visual Servoing by Recurrent Control,"Fereshteh Sadeghi, University of Washington; Alexander Toshev, Google; Sergey Levine, UC Berkeley"3027,Poster,Spanning Patches: Deep Patch Selection for Fast Multi-View Stereo,"Alex Poms, Carnegie Mellon University; Shoou-I Yu, Oculus; Chenglei Wu, Oculus; Yaser Sheikh,"3028,Poster,Efficient Large-scale Approximate Nearest Neighbor Search on OpenCL FPGA,"Jialiang Zhang, University of Wisconsin-Madiso; Soroosh Khoram, UW-Madison; Jing Li, University of Wisconsin-Madison"3035,Poster,Learning distributions of shape trajectories from longitudinal datasets: a hierarchical model on a manifold of diffeomorphisms,"Alexandre Bne, Brain and Spine Institute; Olivier Colliot, Institut du Cerveau et de la Moelle pinire; Stanley Durrleman, Institut du Cerveau et de la Moelle pinire"3054,Oral,Trapping Light for Time of Flight,"Ruilin Xu, Columbia University; Mohit Gupta, Wisconsin; Shree Nayar, Columbia University"3054,Poster,Trapping Light for Time of Flight,"Ruilin Xu, Columbia University; Mohit Gupta, Wisconsin; Shree Nayar, Columbia University"3057,Poster,Geometry Aware Optimization for Deep Learning: The Good Practice,"SOUMAVA KUMAR ROY, AUSTRALIAN NATIONAL UNIVERSITY; Zakaria Mhammedi, Data61, CSIRO; Mehrtash Harandi, Australian National University"3063,Poster,Good View Hunting: Learning Photo Composition from 1 Million View Pairs,"Zijun Wei, Stony Brook University; Jianming Zhang, Adobe Research; Minh Hoai, Stony Brook University; Xiaohui Shen, Adobe Research; Zhe Lin, Adobe Systems, Inc.; Radomr Mech, ; Dimitris Samaras,"3073,Poster,Analyzing Filters Toward Efficient ConvNet,"Takumi Kobayashi,"3074,Oral,Feature Space Transfer for Data Augmentation,"Bo Liu, UCSD; Xudong Wang, UCSD; Mandar Dixit, UC San Diego; Roland Kwitt, ; Nuno Vasconcelos, UCSD, USA"3074,Poster,Feature Space Transfer for Data Augmentation,"Bo Liu, UCSD; Xudong Wang, UCSD; Mandar Dixit, UC San Diego; Roland Kwitt, ; Nuno Vasconcelos, UCSD, USA"3075,Poster,Bilateral Ordinal Relevance Multi-instance Regression for Facial Action Unit Intensity Estimation,"Yong Zhang, CASIA; Rui Zhao, Rensselaer Polytechnic Institu; Weiming Dong, ; Bao-Gang Hu, CASIA; Qiang Ji, RPI"3099,Oral,Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250Hz,"Ayush Tewari, MPI Informatics; Michael Zollhfer, MPI Informatics; Pablo Garrido, ; Florian Bernard, ; Hyeongwoo Kim, MPII; Patrick Perez, Technicolor Research; Christian Theobalt, MPI Informatics"3099,Poster,Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250Hz,"Ayush Tewari, MPI Informatics; Michael Zollhfer, MPI Informatics; Pablo Garrido, ; Florian Bernard, ; Hyeongwoo Kim, MPII; Patrick Perez, Technicolor Research; Christian Theobalt, MPI Informatics"3102,Poster,Interpretable Video Captioning via Trajectory Structured Localization,"Xian Wu, Sysu; Guanbin Li, ; Liang Lin,"3107,Poster,Joint Optimization Framework for Learning with Noisy Labels,"Daiki Tanaka, The University of Tokyo; Daiki Ikami, The University of Tokyo; Toshihiko Yamasaki, The University of Tokyo; Kiyoharu Aizawa,"3108,Spotlight,3D Semantic Segmentation with Submanifold Sparse Convolutional Networks,"Benjamin Graham, Facebook AI Research; Laurens van der Maaten, Facebook; Martin Engelcke, University of Oxford"3108,Poster,3D Semantic Segmentation with Submanifold Sparse Convolutional Networks,"Benjamin Graham, Facebook AI Research; Laurens van der Maaten, Facebook; Martin Engelcke, University of Oxford"3112,Poster,Learning a Complete Image Indexing Pipeline,"Himalaya Jain, Inria, Technicolor; Joaquin Zepeda, ; Patrick Perez, Technicolor Research; Rmi Gribonval, Inria"3117,Poster,Real-Time Seamless Single Shot 6D Object Pose Prediction,"Bugra Tekin, ; Sudipta Sinha, Microsoft Research; Pascal Fua,"3118,Poster,Inferring Co-Attention in Social Scene Videos,"Lifeng Fan, VCLA@UCLA; Yixin Chen, VCLA@UCLA; Ping Wei, Xi'an Jiaotong University; Song-Chun Zhu,"3120,Poster,A Network Architecture for Point Cloud Classification via Automatic Depth Images Generation,"Lukas Rahmann, ; Riccardo Roveri, ETH Zurich; Cengiz Oztireli, ; Markus Gross,"3123,Poster,Blind Predicting Similar Quality Map for Image Quality Assessment,"Da Pan, Communication University of CN; Ping Shi, ; Ming Hou, ; Zefeng Ying, ; Sizhe Fu, ; Yuan Zhang,"3124,Oral,"CodeSLAM --- Learning a Compact, Optimisable Representation for Dense Visual SLAM","Michael Bloesch, Imperial College London; Jan Czarnowski, Imperial College London; Ronald Clark, Imperial College London; Stefan Leutenegger, Imperial College London; Andrew Davison, Imperial College London UK"3124,Poster,"CodeSLAM --- Learning a Compact, Optimisable Representation for Dense Visual SLAM","Michael Bloesch, Imperial College London; Jan Czarnowski, Imperial College London; Ronald Clark, Imperial College London; Stefan Leutenegger, Imperial College London; Andrew Davison, Imperial College London UK"3141,Poster,Image Correction via Deep Reciprocating HDR Transformation,"Xin Yang, Dalian University of Technology, City University of Hong Kong; Ke Xu, Dalian University of Technology; City University of Hong Kong; Yibing Song, Tencent AI Lab; Qiang Zhang, Dalian University of Technology; Xiaopeng Wei, Dalian University of Technology; Rynson Lau, City University of Hong Kong"3145,Poster,Towards Human-Machine Cooperation: Evolving Active Learning with Self-supervised Process for Object Detection,"Keze Wang, ; Liang Lin, ; Xiaopeng Yan, Sun Yat-sen University; Lei Zhang, The Hong Kong Polytechnic University"3156,Spotlight,Modeling Facial Geometry using Compositional VAEs,"Timur Bagautdinov, ; Chenglei Wu, Oculus; Jason Saragih, Oculus Research; Pascal Fua, ; Yaser Sheikh,"3156,Poster,Modeling Facial Geometry using Compositional VAEs,"Timur Bagautdinov, ; Chenglei Wu, Oculus; Jason Saragih, Oculus Research; Pascal Fua, ; Yaser Sheikh,"3164,Poster,PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition,"Mikaela Angelina Uy, NUS; Gim Hee Lee, National University of SIngapore"3170,Poster,PoseFlow: A Deep Motion Representation for Understanding Human Behaviors in Videos,"Dingwen Zhang, ; Guangyu Guo, ; Dong Huang, Carnegie Mellon University; Fernando de la Torre, ; Junwei Han, Northwestern Polytechnical U."3180,Poster,Leveraging Unlabeled Data for Crowd Counting by Learning to Rank,"Xialei Liu, Computer Vision Center of UAB; Joost van de Weijer, Computer Vision Center Barcelona; Andrew Bagdanov, Computer Vision Center, Barcelona"3208,Poster,Deep Density Clustering of Unconstrained Faces,"Wei-An Lin, UMD; Jun-Cheng Chen, ; Carlos Castillo, ; Rama Chellappa, University of Maryland, USA"3214,Poster,Learning Steerable Filters for Rotation Equivariant CNNs,"Maurice Weiler, Heidelberg University; Fred Hamprecht, Heidelberg University, Germany; Martin Storath,"3217,Poster,Radially-Distorted Conjugate Translations,"James Pritts, Czech Technical University; Zuzana Kukelova, Czech Technical University in Prague; Viktor Larsson, Lund University; Ondrej Chum, Czech Technical University in Prague"3218,Spotlight,Deep Spatio-Temporal Random Fields for Efficient Video Segmentation,"Siddhartha Chandra, INRIA; Camille Couprie, Facebook Artificial Intelligence Research; Iasonas Kokkinos, FAIR/UCL"3218,Poster,Deep Spatio-Temporal Random Fields for Efficient Video Segmentation,"Siddhartha Chandra, INRIA; Camille Couprie, Facebook Artificial Intelligence Research; Iasonas Kokkinos, FAIR/UCL"3219,Poster,Progressively Complementarity-aware Fusion Network for RGB-D Salient Object Detection,"Hao Chen, City University of Hong Kong; You fu Li, City University of Hong Kong"3220,Poster,Regularizing Deep Networks by Modeling and Predicting Label Structure,"Mohammadreza Mostajabi, TTI-Chicago; Michael Maire, ; Greg Shakhnarovich,"3225,Poster,Image Super-resolution via Dual-state Recurrent Neural Networks,"Wei Han, UIUC; Shiyu Chang, ; Ding Liu, UIUC; Michael Witbrock, ; Thomas Huang,"3232,Spotlight,Robust Hough Transform Based 3D Reconstruction from Circular Light Fields,"Alessandro Vianello, Robert Bosch GmbH; Jens Ackermann, Robert Bosch GmbH; Maximilian Diebold, Heidelberg University; Bernd Jhne, University of Heidelberg"3232,Poster,Robust Hough Transform Based 3D Reconstruction from Circular Light Fields,"Alessandro Vianello, Robert Bosch GmbH; Jens Ackermann, Robert Bosch GmbH; Maximilian Diebold, Heidelberg University; Bernd Jhne, University of Heidelberg"3236,Poster,Probabilistic Joint Face-Skull Modelling for Facial Reconstruction,"Dennis Madsen, University of Basel; Marcel Lthi, ; Andreas Schneider, ; Thomas Vetter, U. Basel"3241,Poster,Making Convolutional Networks Recurrent for Visual Sequence Learning,"Xiaodong Yang, NVIDIA; Pavlo Molchanov, NVIDIA Research; Jan Kautz, NVIDIA"3242,Poster,Zero-Shot Kernel Learning.,"Hongguang Zhang, Data61; Piotr Koniusz, Data61/CSIRO"3244,Poster,Deeper Look at Power Normalizations.,"Piotr Koniusz, Data61/CSIRO; Hongguang Zhang, Data61; Fatih Porikli, NICTA, Australia"3247,Poster,Deformable GANs for Pose-based Human Image Generation,"Aliaksandr Siarohin , DISI, University of Trento; Enver Sangineto, University of Trento; Stphane Lathuilire, Inria; Nicu Sebe, University of Trento"3249,Poster,On the Duality Between Retinex and Image Dehazing,"Adrian Galdran, INESC TEC Porto; Aitor Alvarez-Gila, Tecnalia / CVC-Universitat Autonoma de Barcelona; Alessandro Bria, University of Cassino and L.M.; Javier Vazquez-Corral, Universitat Pompeu Fabra; Marcelo Bertalmio,"3254,Spotlight,LDMNet: Low Dimensional Manifold Regularized Neural Networks,"Wei Zhu, Duke University; Qiang Qiu, ; Jiaji Huang, Baidu Silicon Valley AI Lab; Robert Calderbank, Duke University; Guillermo Sapiro, Duke; Ingrid Daubechies, Duke University"3254,Poster,LDMNet: Low Dimensional Manifold Regularized Neural Networks,"Wei Zhu, Duke University; Qiang Qiu, ; Jiaji Huang, Baidu Silicon Valley AI Lab; Robert Calderbank, Duke University; Guillermo Sapiro, Duke; Ingrid Daubechies, Duke University"3259,Poster,Pulling Actions out of Context: Explicit Separation for Effective Combination,"Yang Wang, Stony Brook University; Minh Hoai, Stony Brook University"3262,Poster,An Unsupervised Learning Model for Deformable Medical Image Registration,"Guha Balakrishnan, MIT; Adrian Dalca, ; Amy Zhao, MIT; Mert Sabuncu, Cornell; John Guttag,"3264,Poster,Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs,"Emanuel Laude, TUM; Jan-Hendrik Lange, ; Jonas Schuepfer, ; Csaba Domokos, ; Laura Leal-Taixe, Technical University of Munich; Frank Schmidt, BCAI; Bjoern Andres, ; Daniel Cremers,"3269,Poster,Multi-Scale Weighted Nuclear Norm Image Restoration,"Noam Yair, Technion; Tomer Michaeli, Technion"3272,Poster,Finding beans in burgers: Deep semantic-visual embedding with localization,"Patrick Perez, Technicolor Research; Matthieu Cord, ; Louis Chevallier, technicolor; Martin Engilberge, technicolor"3275,Oral,FlipDial: A Generative Model for Two-Way Visual Dialogue,"Daniela Massiceti, University of Oxford; Siddharth Narayanaswamy, University of Oxford; Puneet Kumar Dokania, University of Oxford; Phil Torr, Oxford"3275,Poster,FlipDial: A Generative Model for Two-Way Visual Dialogue,"Daniela Massiceti, University of Oxford; Siddharth Narayanaswamy, University of Oxford; Puneet Kumar Dokania, University of Oxford; Phil Torr, Oxford"3279,Poster,Transparency by Design: Closing the Gap Between Performance and Interpretabilty in Visual Reasoning,"David Mascharka, MIT Lincoln Laboratory; Philip Tran, Planck Aerosystems; Ryan Soklaski, MIT Lincoln Laboratory; Arjun Majumdar, MIT Lincoln Laboratory"3280,Spotlight,Webly Supervised Learning Meets Zero-shot Learning: A Hybrid Approach for Fine-grained Classification,"Li Niu, Rice University; Ashok Veeraraghavan, Rice University; Ashutosh Sabharwal,"3280,Poster,Webly Supervised Learning Meets Zero-shot Learning: A Hybrid Approach for Fine-grained Classification,"Li Niu, Rice University; Ashok Veeraraghavan, Rice University; Ashutosh Sabharwal,"3284,Spotlight,Cross-modal Deep Variational Hand Pose Estimation,"Adrian Spurr, ETH Zurich; Jie Song, ETHZ; Seonwook Park, ETH Zurich; Otmar HIlliges, ETH Zurich"3284,Poster,Cross-modal Deep Variational Hand Pose Estimation,"Adrian Spurr, ETH Zurich; Jie Song, ETHZ; Seonwook Park, ETH Zurich; Otmar HIlliges, ETH Zurich"3285,Poster,Occlusion Aware Unsupervised Learning of Optical Flow,"Yang Wang, Baidu USA; Yi Yang, ; Zhenheng Yang, ; Liang Zhao, Baidu USA; Wei Xu,"3286,Poster,PAD-Net: Multi-Tasks Guided Prediciton-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing,"Dan Xu, ; Wanli Ouyang, The University of Sydney; Xiaogang Wang, Chinese University of Hong Kong; Nicu Sebe, University of Trento"3289,Oral,OATM: Occlusion Aware Template Matching by Consensus Set Maximization,"Simon Korman, Weizmann Institute; Mark Milam, NGC; Stefano Soatto, UCLA"3289,Poster,OATM: Occlusion Aware Template Matching by Consensus Set Maximization,"Simon Korman, Weizmann Institute; Mark Milam, NGC; Stefano Soatto, UCLA"3292,Spotlight,MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,"Irtiza Hasan, University of Verona; Francesco Setti, ; Theodore Tsesmelis, ; Alessio Del Bue, Istituto Italiano di Tecnologia (IIT); Fabio Galasso, ; Marco Cristani, U. Verona"3292,Poster,MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,"Irtiza Hasan, University of Verona; Francesco Setti, ; Theodore Tsesmelis, ; Alessio Del Bue, Istituto Italiano di Tecnologia (IIT); Fabio Galasso, ; Marco Cristani, U. Verona"3295,Poster,Fooling Vision and Language Models Despite Localization and Attention Mechanism,"Xiaojun Xu, Shanghai Jiao Tong University; Xinyun Chen, UC Berkeley; Chang Liu, UC Berkeley; Anna Rohrbach, UC Berkeley; Trevor Darrell, UC Berkeley, USA; Dawn Song, UC Berkeley"3298,Spotlight,Learning Transferable Architectures for Scalable Image Recognition,"Barret Zoph, Google; Vijay Vasudevan, Google; Jonathon Shlens, Google; Quoc Le, Google"3298,Poster,Learning Transferable Architectures for Scalable Image Recognition,"Barret Zoph, Google; Vijay Vasudevan, Google; Jonathon Shlens, Google; Quoc Le, Google"3299,Poster,4DFAB: A Large Scale 4D Database for Facial Expression Analysis and Biometric Applications,"Shiyang Cheng, Imperial College London; Irene Kotsia, Middlesex University London; Maja Pantic, Imperial College London, UK; Stefanos Zafeiriou, Imperial College London"3303,Poster,An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption,"Xiyu Yu, The University of Sydney; Tongliang Liu, The University of Sydney; Mingming Gong, ; Kayhan Batmanghelich, University of Pittsburgh; Dacheng Tao, University of Sydney"3308,Spotlight,MovieGraphs: Towards Understanding Human-Centric Situations from Videos,"Paul Vicol, University of Toronto; Makarand Tapaswi, University of Toronto; Llus Castrejn, ; Sanja Fidler,"3308,Poster,MovieGraphs: Towards Understanding Human-Centric Situations from Videos,"Paul Vicol, University of Toronto; Makarand Tapaswi, University of Toronto; Llus Castrejn, ; Sanja Fidler,"3311,Poster,Nonlocal Low-Rank Tensor Factor Analysis for Image Restoration,"Xinyuan Zhang, Duke University; Xin Yuan, Nokia Bell Labs; Lawrence Carin,"3315,Poster,Hierarchical Recurrent Attention Networks for Structured Online Maps,"Namdar Homayounfar, Uber ATG; Wei-Chiu Ma, MIT; Shrinidhi Kowshika Lakshmikanth, Uber ATG; Raquel Urtasun, University of Toronto"3318,Oral,Surface Networks,"Ilya Kostrikov, NYU; Joan Bruna, New York University; Daniele Panozzo, NYU; Denis Zorin, NYU"3318,Poster,Surface Networks,"Ilya Kostrikov, NYU; Joan Bruna, New York University; Daniele Panozzo, NYU; Denis Zorin, NYU"3324,Spotlight,Deep Depth Completion of a Single RGB-D Image,"Yinda Zhang, Princeton; Thomas Funkhouser, Princeton"3324,Poster,Deep Depth Completion of a Single RGB-D Image,"Yinda Zhang, Princeton; Thomas Funkhouser, Princeton"3325,Poster,Convolutional Image Captioning,"Jyoti Aneja, UIUC; Aditya Deshpande, University of Illinois at UC; Alex Schwing,"3327,Poster,Geometric Multi-Model Fitting with a Convex Relaxation Algorithm,"Paul Amayo, Oxford ; Pedro Pinies, University of Oxford; Lina Paz, University of Oxford; Paul Newman, University of Oxford"3329,Spotlight,Lose The Views: Limited Angle CT Reconstruction via Implicit Sinogram Completion,"Rushil Anirudh, Lawrence Livermore National La; Hyojin Kim, Lawrence Livermore National Laboratory; Jayaraman J. Thiagarajan, LLNL; K. Aditya Mohan, Lawrence Livermore National Laboratory; Kyle Champley, Lawrence Livermore National Laboratory; Timo Bremer, Lawrence Livermore National Laboratory"3329,Poster,Lose The Views: Limited Angle CT Reconstruction via Implicit Sinogram Completion,"Rushil Anirudh, Lawrence Livermore National La; Hyojin Kim, Lawrence Livermore National Laboratory; Jayaraman J. Thiagarajan, LLNL; K. Aditya Mohan, Lawrence Livermore National Laboratory; Kyle Champley, Lawrence Livermore National Laboratory; Timo Bremer, Lawrence Livermore National Laboratory"3330,Spotlight,Now You Shake Me: Towards Automatic 4D Cinema,"Yuhao Zhou, University of Toronto; Makarand Tapaswi, University of Toronto; Sanja Fidler,"3330,Poster,Now You Shake Me: Towards Automatic 4D Cinema,"Yuhao Zhou, University of Toronto; Makarand Tapaswi, University of Toronto; Sanja Fidler,"3333,Poster,VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection,"Yin Zhou, Lawrence Berkeley National Lab; Oncel Tuzel,"3335,Poster,Image to Image Translation for Domain Adaptation,"Zak Murez, UCSD; Soheil Kolouri, HRL Laboratories, LLC; David Kriegman, University of California at San Diego; Ravi Ramamoorthi, University of California, San Diego; Kyungnam Kim, HRL Laboratories"3340,Spotlight,"Learning-Compression"" algorithms for neural net pruning""","Miguel Carreira-perpinan, UC Merced; Yerlan Idelbayev, UC Merced"3340,Poster,"Learning-Compression"" algorithms for neural net pruning""","Miguel Carreira-perpinan, UC Merced; Yerlan Idelbayev, UC Merced"3342,Spotlight,Reinforcement Cutting-Agent Learning for Video Object Segmentation,"Junwei Han, Northwestern Polytechnical U.; Le Yang, Northwestern Polytechnical Uni; Dingwen Zhang, ; Xiaojun Chang, Carnegie Mellon University; Xiaodan Liang, Carnegie Mellon University"3342,Poster,Reinforcement Cutting-Agent Learning for Video Object Segmentation,"Junwei Han, Northwestern Polytechnical U.; Le Yang, Northwestern Polytechnical Uni; Dingwen Zhang, ; Xiaojun Chang, Carnegie Mellon University; Xiaodan Liang, Carnegie Mellon University"3345,Poster,CNN Driven Sparse Multi-Level B-spline Image Registration,"Pingge Jiang, Drexel University; James Shackleford, Drexel University"3349,Poster,DocUNet: Document Image Unwarping via A Stacked U-Net,"Ke Ma, Stony Brook University; Zhixin Shu, Stony Brook University; Xue Bai, Megvii Inc; Jue Wang, Megvii; Dimitris Samaras,"3350,Poster,Texture Mapping for 3D Reconstruction with RGB-D Sensor,"Yanping Fu, WuHan University; Qingan Yan, JD.com; Long Yang, Northwest A&F University; Jie Liao , WuHan University; Chunxia Xiao, Wuhan University"3352,Poster,Sliced Wasserstein Distance for Learning Gaussian Mixture Models,"Soheil Kolouri, HRL Laboratories, LLC; Gustavo Rohde, University Virginia ; Heiko Hoffmann, HRL Laboratories, LLC"3362,Poster,Convolutional Sequence to Sequence Model for Human Dynamics,"Chen Li, ; Zhen Zhang, National University of Singapore; Wee Sun Lee, ; Gim Hee Lee, National Univeristy of Singapore"3373,Poster,Composing Two Objects of Interest for Flying Camera Photography,"ZIQUAN LAN, NUS; David Hsu, NUS; Gim Hee Lee, National University of SIngapore"3375,Poster,Time-resolved Light Transport Decomposition for Thermal Photometric Stereo,"Nobuhiro Ikeya, NAIST; Kenichiro Tanaka, NAIST; Tsuyoshi Takatani, NAIST; Hiroyuki Kubo, ; Takuya Funatomi, NAIST; Yasuhiro Mukaigawa, NAIST"3381,Spotlight,Large-scale Distance Metric Learning with Uncertainty,"Qi Qian, Alibaba Group; Shenghuo Zhu, Alibaba Group; Rong Jin, Alibaba Group; Jiasheng Tang, Alibaba Group; Hao Li, Alibaba Group"3381,Poster,Large-scale Distance Metric Learning with Uncertainty,"Qi Qian, Alibaba Group; Shenghuo Zhu, Alibaba Group; Rong Jin, Alibaba Group; Jiasheng Tang, Alibaba Group; Hao Li, Alibaba Group"3383,Poster,Aligning Infinite-Dimensional Covariance Matrices in Reproducing Kernel Hilbert Spaces for Domain Adaptation,"Zhen Zhang, WASHINGTON UNIVERSITY IN ST.LO; Mianzhi Wang, WASHINGTON UNIVERSITY IN ST.LOUIS; Yan Huang, ; Arye Nehorai, WASHINGTON UNIVERSITY IN ST.LOUIS"3392,Poster,R-FCN-3000 at 30fps: Decoupling Detection and Classification,"Bharat Singh, ; Hengduo Li, ; Abhishek Sharma, ; Larry Davis, University of Maryland, USA"3394,Spotlight,Distributable Consistent Multi-Graph Matching,"Nan Hu, Stanford Unviversity; Boris Thibert, ; Leonidas J. Guibas,"3394,Poster,Distributable Consistent Multi-Graph Matching,"Nan Hu, Stanford Unviversity; Boris Thibert, ; Leonidas J. Guibas,"3399,Oral,VirtualHome: Simulating Household Activities via Programs,"Xavier Puig, MIT; Kevin Ra, ; Marko Boben, ; Jiaman Li, University of Toronto; Tingwu Wang, ; Sanja Fidler, ; Antonio Torralba, MIT"3399,Poster,VirtualHome: Simulating Household Activities via Programs,"Xavier Puig, MIT; Kevin Ra, ; Marko Boben, ; Jiaman Li, University of Toronto; Tingwu Wang, ; Sanja Fidler, ; Antonio Torralba, MIT"3407,Poster,Robust Physical-World Attacks on Deep Learning Visual Classification,"Ivan Evtimov, University of Washington; Kevin Eykholt, University of Michigan; Earlence Fernandes, University of Washington; Tadayoshi Kohno, University of Washington; Bo Li, UC Berkeley; Atul Prakash, University of Michigan; Amir Rahmati, University of Michigan; Chaowei Xiao, University of Michigan; Dawn Song, UC Berkeley"3408,Poster,Feature Super-Resolution: Make Machine See More Clearly,"Weimin Tan, Fudan University; Bo Yan, Fudan University; Bahetiyaer Bare, Fudan University"3409,Poster,Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++,"David Acuna, University of Toronto; Huan Ling, UofT; Amlan Kar, University of Toronto; Sanja Fidler,"3412,Poster,CLEAR: Cumulative LEARning for One-Shot One-Class Image Recognition,"Jedrzej Kozerawski, UCSB; Matthew Turk, UC Santa Barbara USA"3427,Poster,"Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation","Mark Sandler, Google; Andrew Howard, Google; Menglong Zhu, ; Andrey Zhmoginov, Google; Liang-Chieh Chen,"3430,Spotlight,Learning Nested Structures in Deep Neural Networks,"Eunwoo Kim, Seoul National University; Chanho Ahn, Seoul National University; Songhwai Oh, Seoul National University"3430,Poster,Learning Nested Structures in Deep Neural Networks,"Eunwoo Kim, Seoul National University; Chanho Ahn, Seoul National University; Songhwai Oh, Seoul National University"3435,Poster,CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise,"Kuang-Huei Lee, Microsoft; Xiaodong He, ; Lei Zhang, Microsoft; Linjun Yang, Facebook"3436,Poster,Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN,"Shuai Li, University of Wollongong; Wanqing Li, ; Chris Cook, University of Wollongong; Ce Zhu, University of Electronic Science and Technology of China; Yanbo Gao, University of Electronic Science and Technology of China"3438,Spotlight,"Lions and Tigers and Bears: Capturing Non-Rigid, 3D, Articulated Shape from Images","Silvia Zuffi, IMATI-CNR; Angjoo Kanazawa, University of Maryland; Michael Black, Max Planck Institute for Intelligent Systems"3438,Poster,"Lions and Tigers and Bears: Capturing Non-Rigid, 3D, Articulated Shape from Images","Silvia Zuffi, IMATI-CNR; Angjoo Kanazawa, University of Maryland; Michael Black, Max Planck Institute for Intelligent Systems"3442,Poster,MoNet: Moments Embedding Network,"Mengran Gou, Northeastern University; Fei Xiong, University of Southern California ; Octavia Camps, Northeastern University, USA; Mario Sznaier,"3446,Poster,Self-calibrating polarising radiometric calibration,"Daniel Teo, SUTD; Boxin Shi, Peking University; Yinqiang Zheng, National Institute of Informatics, Japan; Sai-Kit Yeung,"3448,Poster,Representing and Learning High Dimensional Data with the Optimal Transport Map from a Probabilistic Viewpoint,"Serim Park, Oath; Matthew Thorpe,"3451,Poster,Differential Attention for Visual Question Answering,"Badri Patro, IIT Kanpur; Vinay P. Namboodiri, Indian Institute of Technology Kanpur"3454,Poster,Deep Ordinal Regression Network for Monocular Depth Estimation,"Huan Fu, The University of Sydney; Mingming Gong, ; Chaohui Wang, Universit Paris-Est; Kayhan Batmanghelich, University of Pittsburgh; Dacheng Tao, University of Sydney"3460,Poster,ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information,"Rodney LaLonde, University of Central Florida; Dong Zhang, University of Central Florida; Mubarak Shah, UCF"3468,Poster,Seeing Small Faces from Robust Anchor's Perspective,"Chenchen Zhu, Carnegie Mellon University; Ran Tao, Carnegie Mellon University; Khoa Luu, ; Marios Savvides,"3470,Poster,Gesture Recognition: Focus on the Hands,"Pradyumna Narayana, Colorado State University; Ross Beveridge, Colorado State University; Bruce Draper, Colorado State University"3482,Spotlight,A Common Framework for Interactive Texture Transfer,"Yifang Men, Peking University; Zhouhui Lian, ; Jianguo Xiao, Peking University"3482,Poster,A Common Framework for Interactive Texture Transfer,"Yifang Men, Peking University; Zhouhui Lian, ; Jianguo Xiao, Peking University"3483,Poster,PieAPP: Perceptual Image-Error Assessment through Pairwise Preference,"Ekta Prashnani, UCSB; Hong Cai, University of California, Santa Barbara; Yasamin Mostofi, UCSB; Pradeep Sen, University of California, Santa Barbara"3489,Spotlight,Guide Me: Interacting with Deep Networks,"Christian Rupprecht, Technische Unitversit?t M?nchen; Iro Laina, ; Federico Tombari, ; Nassir Navab, Technical University of Munich; Gregory Hager, Johns Hopkins University"3489,Poster,Guide Me: Interacting with Deep Networks,"Christian Rupprecht, Technische Unitversit?t M?nchen; Iro Laina, ; Federico Tombari, ; Nassir Navab, Technical University of Munich; Gregory Hager, Johns Hopkins University"3491,Spotlight,Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation,"Dan Xu, ; Wei Wang, University of Trento; Hao Tang, University of Trento; Nicu Sebe, University of Trento; Elisa Ricci, U. Perugia"3491,Poster,Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation,"Dan Xu, ; Wei Wang, University of Trento; Hao Tang, University of Trento; Nicu Sebe, University of Trento; Elisa Ricci, U. Perugia"3494,Poster,FFNet: Video Fast-Forwarding via Reinforcement Learning,"Shuyue Lan, Northwestern University; Rameswar Panda, UC Riverside; Qi Zhu, UC Riverside; Amit Roy-Chowdhury, UC Riverside"3495,Poster,Two can play this Game: Visual Dialog with Discriminative Visual Question Generation and Visual Question Answering,"Unnat Jain, UIUC; Lana Lazebnik, ; Alex Schwing,"3502,Poster,A Prior-Less Method for Multi-Face Tracking in Unconstrained Videos,"CHUNG-CHING LIN, IBM Research; Ying Hung, Rutgers University"3508,Poster,Analytical Modeling of Vanishing Points and Curves in Catadioptric Cameras,"Pedro Miraldo, Instituto Superior Tcnico, Lisboa; Francisco Girbal Eiras, University of Oxford; Srikumar Ramalingam,"3516,Oral,Egocentric Activity Recognition on a Budget,"Rafael Possas, University of Sydney; Sheila Maricela Pinto Caceres, University of Sydney; Fabio Ramos, University of Sydney"3516,Poster,Egocentric Activity Recognition on a Budget,"Rafael Possas, University of Sydney; Sheila Maricela Pinto Caceres, University of Sydney; Fabio Ramos, University of Sydney"3523,Spotlight,TextureGAN: Controlling Deep Image Synthesis with Texture Patches,"Wenqi Xian, ; Patsorn Sangkloy, Georgia Institute of Technology; Varun Agrawal, ; Amit Raj, Georgia Institute of Technolog; Jingwan Lu, Adobe Research; Chen Fang, Adobe Research; Fisher Yu, UC Berkeley; James Hays, Georgia Tech"3523,Poster,TextureGAN: Controlling Deep Image Synthesis with Texture Patches,"Wenqi Xian, ; Patsorn Sangkloy, Georgia Institute of Technology; Varun Agrawal, ; Amit Raj, Georgia Institute of Technolog; Jingwan Lu, Adobe Research; Chen Fang, Adobe Research; Fisher Yu, UC Berkeley; James Hays, Georgia Tech"3524,Spotlight,Rethinking Feature Distribution for Loss Functions in Image Classification,"Weitao Wan, Tsinghua University; Yuanyi Zhong, UIUC; Tianpeng Li, Tsinghua University; Jiansheng Chen, Tsinghua University"3524,Poster,Rethinking Feature Distribution for Loss Functions in Image Classification,"Weitao Wan, Tsinghua University; Yuanyi Zhong, UIUC; Tianpeng Li, Tsinghua University; Jiansheng Chen, Tsinghua University"3531,Poster,Coding Kendall's Shape Trajectories for 3D Action Recognition,"Amor Ben Tanfous, IMT Lille Douai; Hassen Drira, IMT Lille Douai; Boulbaba Ben Amor, IMT Lille Douai"3535,Poster,Latent RANSAC,"Simon Korman, Weizmann Institute; Roee Litman, Tel-Aviv University"3536,Spotlight,Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images,"Tribhuvanesh Orekondy, MPI-INF; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany"3536,Poster,Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images,"Tribhuvanesh Orekondy, MPI-INF; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany"3548,Spotlight,Boosting Domain Adaptation by Discovering Latent Domains,"Massimiliano Mancini, Sapienza University of Rome; Lorenzo Porzi, Mapillary Research; Samuel Rota Bul, Mapillary Research; Barbara Caputo, University of Rome La Sapienza, Italy; Elisa Ricci, U. Perugia"3548,Poster,Boosting Domain Adaptation by Discovering Latent Domains,"Massimiliano Mancini, Sapienza University of Rome; Lorenzo Porzi, Mapillary Research; Samuel Rota Bul, Mapillary Research; Barbara Caputo, University of Rome La Sapienza, Italy; Elisa Ricci, U. Perugia"3552,Poster,Fast and Robust Estimation for Unit-Norm Constrained Linear Fitting Problems,"Daiki Ikami, The University of Tokyo; Toshihiko Yamasaki, The University of Tokyo; Kiyoharu Aizawa,"3558,Poster,Local and Global Optimization Techniques in Graph-based Clustering,"Daiki Ikami, The University of Tokyo; Toshihiko Yamasaki, The University of Tokyo; Kiyoharu Aizawa,"3569,Poster,Generating a Fusion Image: One' s Identity and Another's Shape,"DongGyu Joo, KAIST; Doyeon Kim, KAIST; Junmo Kim, KAIST"3576,Poster,Categorizing Concepts with Basic Level for Vision-to-Language,"Hanzhang Wang, Tongji University; Hanli Wang, Tongji University; Kaisheng Xu, Tongji University"3577,Poster,Importance Weighted Adversarial Nets for Partial Domain Adaptation,"Jing Zhang, University of Wollongong; Zewei Ding, University of Wollongong; Wanqing Li, ; Philip Ogunbona, University of Wollongong"3580,Poster,AON: Towards Arbitrarily-Oriented Text Recognition,"Zhanzhan Cheng, Hikvision Research Institute; Yangliu Xu, Tongji University; Fan Bai, Fudan University; Yi Niu, Hikvision Research Institute; Shiliang Pu, ; Shuigeng Zhou, Fudan University"3583,Poster,Towards dense object tracking in a 2D honeybee hive,"Katarzyna Bozek, Okinawa Institute of Science a; Laetitia Hebert, ; Alexander Mikheyev, ; Greg Stephens, OIST Graduate University and Vrije Universiteit Amsterdam"3586,Oral,Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering,"Nguyen Duy Kien, Tohoku University; Takayuki Okatani, Tohoku University/RIKEN AIP"3586,Poster,Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering,"Nguyen Duy Kien, Tohoku University; Takayuki Okatani, Tohoku University/RIKEN AIP"3595,Oral,Efficient Optimization for Rank-based Loss Functions,"Pritish Mohapatra, IIIT Hyderabad; Michal Rolinek, Max Planck Institute for Intelligent Systems, Tuebingen; C.V. Jawahar, IIIT Hyderabad; Vladimir Kolmogorov, Institute of Science and Technology, Austria; M. Pawan Kumar,"3595,Poster,Efficient Optimization for Rank-based Loss Functions,"Pritish Mohapatra, IIIT Hyderabad; Michal Rolinek, Max Planck Institute for Intelligent Systems, Tuebingen; C.V. Jawahar, IIIT Hyderabad; Vladimir Kolmogorov, Institute of Science and Technology, Austria; M. Pawan Kumar,"3599,Poster,Learning Less is More - 6D Camera Localization via 3D Surface Regression,"Eric Brachmann, TU Dresden; Carsten Rother, University of Heidelberg"3602,Spotlight,xUnit: Learning a Spatial Activation Function for Efficient Image Restoration,"Idan Kligvasser, Technion; Tamar Rott Shaham, Technion; Tomer Michaeli, Technion"3602,Poster,xUnit: Learning a Spatial Activation Function for Efficient Image Restoration,"Idan Kligvasser, Technion; Tamar Rott Shaham, Technion; Tomer Michaeli, Technion"3604,Poster,Multi-task Learning by Maximizing Statistical Dependence,"Youssef Alami Mejjati, University of Bath; Darren Cosker, University of Bath; Kwang In Kim, University of Bath"3606,Poster,Deep Back-Projection Networks For Super-Resolution,"Muhammad Haris, Toyota Technological Institute; Greg Shakhnarovich, ; Norimichi Ukita, NAIST"3617,Poster,Encoder-Decoder Alignment for Zero-Pair Image-to-Image Translation,"Yaxing Wang, Computer vision center; Joost van de Weijer, Computer Vision Center Barcelona; Luis Herranz, Computer Vision Center"3618,Poster,Dynamic Feature Learning for Partial Face Recognition,"Lingxiao He, Institute of AutomationChines; Haiqing Li, ; qi Zhang, ; Zhenan Sun, CRIPAC"3623,Oral,MakeupGAN: Makeup Transfer via Cycle-Consistent Adversarial Networks,"Huiwen Chang, ; Jingwan Lu, Adobe Research; Fisher Yu, UC Berkeley; Adam Finkelstein, Princeton University"3623,Poster,MakeupGAN: Makeup Transfer via Cycle-Consistent Adversarial Networks,"Huiwen Chang, ; Jingwan Lu, Adobe Research; Fisher Yu, UC Berkeley; Adam Finkelstein, Princeton University"3626,Oral,Revisiting Deep Intrinsic Image Decompositions,"Qingnan Fan, Shandong University; David Wipf, Microsoft Research Asia; Jiaolong Yang, Microsoft Research Asia; Gang Hua, Microsoft Research; Baoquan Chen,"3626,Poster,Revisiting Deep Intrinsic Image Decompositions,"Qingnan Fan, Shandong University; David Wipf, Microsoft Research Asia; Jiaolong Yang, Microsoft Research Asia; Gang Hua, Microsoft Research; Baoquan Chen,"3629,Poster,Multi-Image Semantic Matching by Mining Consistent Features,"Qianqian Wang, Zhejiang University; Xiaowei Zhou, Zhejiang University; Kostas Daniilidis, University of Pennsylvania"3630,Poster,Indoor RGB-D Compass from a Single Line and Plane,"Pyojin Kim, Seoul National University; Brian Coltin, NASA Ames Research Center; H. Jin Kim,"3633,Poster,Learning Face Age Progression: A Pyramid Architecture of GANs,"Hongyu Yang, BEIHANG UNIVERSITY; Di Huang, ; Yunhong Wang, ; Anil Jain, MSU"3636,Poster,Multispectral Image Intrinsic Decomposition via Low Rank Constraint,"Qian Huang, Nanjing University; Zhu Weixin, Nanjing university; Yang Zhao, Nanjing University; Linsen Chen, Nanjing University; yao wang, new york university; Tao Yue, Nanjing Univ.; Xun Cao, EE Department, Nanjing Univ"3646,Poster,Non-Linear Temporal Subspace Representations for Activity Recognition,"Anoop Cherian, ; Suvrit Sra, MIT; Stephen Gould, Australian National University; Richard Hartley, Australian National University Australia"3656,Spotlight,CondenseNet: An Efficient DenseNet using Learned Group Convolutions,"Gao Huang, ; Shichen Liu, Tsinghua University; Laurens van der Maaten, Facebook; Kilian Weinberger, Cornell University"3656,Poster,CondenseNet: An Efficient DenseNet using Learned Group Convolutions,"Gao Huang, ; Shichen Liu, Tsinghua University; Laurens van der Maaten, Facebook; Kilian Weinberger, Cornell University"3659,Spotlight,Link and code: Fast indexing with graphs and compact regression codes,"Matthijs Douze, ; Herve Jegou, Facebook AI Research"3659,Poster,Link and code: Fast indexing with graphs and compact regression codes,"Matthijs Douze, ; Herve Jegou, Facebook AI Research"3662,Oral,StarGAN: Unified Generative Adversarial Networks for Controllable Multi-Domain Image-to-Image Translation,"Jaegul Choo, Korea University; Jung-Woo Ha, NAVER Corp; Munyoung Kim, The College of New Jersey; Yunjey Choi, Korea University; Minje Choi, Korea University; Sunghun Kim, HKUST"3662,Poster,StarGAN: Unified Generative Adversarial Networks for Controllable Multi-Domain Image-to-Image Translation,"Jaegul Choo, Korea University; Jung-Woo Ha, NAVER Corp; Munyoung Kim, The College of New Jersey; Yunjey Choi, Korea University; Minje Choi, Korea University; Sunghun Kim, HKUST"3673,Spotlight,Learning Deep Descriptors with Scale-Aware Triplet Networks,"Michel Keller, ETH Zrich; Zetao Chen, ETH Zurich; Fabiola Maffra, ETH Zrich; Patrik Schmuck, ETH Zurich; Margarita Chli, ETH Zurich"3673,Poster,Learning Deep Descriptors with Scale-Aware Triplet Networks,"Michel Keller, ETH Zrich; Zetao Chen, ETH Zurich; Fabiola Maffra, ETH Zrich; Patrik Schmuck, ETH Zurich; Margarita Chli, ETH Zurich"3680,Poster,MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features,"Liang-Chieh Chen, ; Alexander Hermans, RWTH Aachen University; George Papandreou, Google Inc.; Florian Schroff, Google Inc.; Peng Wang, Baidu; Hartwig Adam, Google"3687,Poster,Robust Classification with Convolutional Prototype Learning,"Hong-Ming Yang, Institute of Automation, Chinese Academy of Sciences; Xu-Yao Zhang, Institute of Automation, Chinese Academy of Sciences; Fei Yin, Institute of Automation, Chinese Academy of Sciences; cheng-lin Liu,"3695,Poster,Normalized Cut Loss for Weakly Supervised CNN Segmentation,"Meng Tang, UWO; Federico Perazzi, Disney Research Zurich; Abdelaziz Djelouah, The Walt Disney Company; Yuri Boykov, University of Western Ontario; Christopher Schroers, Disney Research Zurich"3702,Poster,Structured Uncertainty Prediction Networks,"Garoe Dorta, University of Bath; Sara Vicente, Anthropics Technology Ltd; Lourdes Agapito, University College London; Neill Campbell, University of bath; Ivor Simpson, Anthropics Technology Ltd"3711,Poster,CLIP-Q: Deep Network Compression Learning by In-Parallel Pruning-Quantization,"Frederick Tung, Simon Fraser University; Greg Mori,"3713,Poster,Inference in Higher Order MRF-MAP Problems with Small and Large Cliques,"Ishant Shanu, Iiit delhi; Chetan Arora, Indraprastha Institute of Information Technology Delhi; S.N. Maheshwari, IIT Delhi"3718,Oral,Ordinal Depth Supervision for 3D Human Pose Estimation,"Georgios Pavlakos, ; Xiaowei Zhou, Zhejiang University; Kostas Daniilidis, University of Pennsylvania"3718,Poster,Ordinal Depth Supervision for 3D Human Pose Estimation,"Georgios Pavlakos, ; Xiaowei Zhou, Zhejiang University; Kostas Daniilidis, University of Pennsylvania"3722,Poster,Generative Modeling using the Sliced Wasserstein Distance,"Ishan Deshpande, UIUC; Ziyu Zhang, Snap Research; Alex Schwing,"3723,Oral,Multi-Cell Classification by Convolutional Dictionary Learning with Class Proportion Priors,"Florence Yellin, Johns Hopkins University; Benjamin Haeffele, Johns Hopkins University; Rene Vidal, Johns Hopkins University"3723,Poster,Multi-Cell Classification by Convolutional Dictionary Learning with Class Proportion Priors,"Florence Yellin, Johns Hopkins University; Benjamin Haeffele, Johns Hopkins University; Rene Vidal, Johns Hopkins University"3727,Poster,CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes,"Yuhong Li, Beijing Univ. of Posts & Tels; Xiaofan Zhang, UIUC; deming Chen, UIUC"3736,Poster,Learning to Estimate 3D Human Pose and Shape from a Single Color Image,"Georgios Pavlakos, ; Luyang Zhu, Peking University; Xiaowei Zhou, Zhejiang University; Kostas Daniilidis, University of Pennsylvania"3739,Poster,Revisiting knowledge transfer for training object class detectors,"Jasper Uijlings, Google; Stefan Popov, Google; Vitto Ferrari,"3741,Spotlight,Learning Intelligent Dialogs for Bounding Box Annotation,"Ksenia Konyushkova, Google; Jasper Uijlings, Google; Christoph Lampert, ; Vittorio Ferrari, google"3741,Poster,Learning Intelligent Dialogs for Bounding Box Annotation,"Ksenia Konyushkova, Google; Jasper Uijlings, Google; Christoph Lampert, ; Vittorio Ferrari, google"3751,Poster,Aperture Supervision for Monocular Depth Estimation,"Pratul Srinivasan, Berkeley; Rahul Garg, ; Neal Wadhwa, ; Ren Ng, Berkeley; Jonathan Barron, Google"3761,Spotlight,Burst Denoising with Kernel Prediction Networks,"Ben Mildenhall, UC Berkeley; Jiawen Chen, Google; Jonathan Barron, Google; Robert Carroll, Google; Dillon Sharlet, ; Ren Ng, Berkeley"3761,Poster,Burst Denoising with Kernel Prediction Networks,"Ben Mildenhall, UC Berkeley; Jiawen Chen, Google; Jonathan Barron, Google; Robert Carroll, Google; Dillon Sharlet, ; Ren Ng, Berkeley"3771,Spotlight,Art of singular vectors and universal adversarial perturbations,"Valentin Khrulkov, Skoltech; Ivan Oseledets, Skoltech"3771,Poster,Art of singular vectors and universal adversarial perturbations,"Valentin Khrulkov, Skoltech; Ivan Oseledets, Skoltech"3780,Poster,A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos,"Michel Silva, Universidade de Minas Gerais; Washington Luis Ramos, Universidade Federal de Minas Gerais; Joo Pedro Ferreira, Universidade Federal de Minas Gerais; Felipe Chamone, Universidade Federal de Minas Gerais; Mario F Campos, Universidade Federal de Minas Gerais; Erickson Nascimento, Universidade Federal de Minas Gerais"3791,Spotlight,"A Low Power, High Throughput, Fully Event-Based Stereo System","Alexander Andreopoulos, IBM Research; Hirak Kashyap, UC Irvine and IBM; Tapan Nayak, IBM; Arnon Amir, IBM; Myron Flickner, IBM"3791,Poster,"A Low Power, High Throughput, Fully Event-Based Stereo System","Alexander Andreopoulos, IBM Research; Hirak Kashyap, UC Irvine and IBM; Tapan Nayak, IBM; Arnon Amir, IBM; Myron Flickner, IBM"3795,Poster,Learning Latent Super-Events to Detect Multiple Activities in Videos,"AJ Piergiovanni, Indiana University; Michael Ryoo, Indiana University"3798,Poster,Active Fixation Control to Predict Saccade Sequences,"Calden Wloka, York University; Iuliia Kotseruba, York University; John Tsotsos, York University Canada"3805,Poster,Two-Stream Convolutional Networks for Dynamic Texture Synthesis,"Matthew Tesfaldet, York University; Marcus Brubaker, York University; Konstantinos Derpanis, Ryerson University"3807,Spotlight,Jointly Localizing and Describing Events for Dense Video Captioning,"Yehao Li, Sun Yat-Sen University; Ting Yao, Microsoft Research Asia; Yingwei Pan, University of Science and Technology of China; Hongyang Chao, Sun Yat-sen University; Tao Mei, Microsoft Research Asia"3807,Poster,Jointly Localizing and Describing Events for Dense Video Captioning,"Yehao Li, Sun Yat-Sen University; Ting Yao, Microsoft Research Asia; Yingwei Pan, University of Science and Technology of China; Hongyang Chao, Sun Yat-sen University; Tao Mei, Microsoft Research Asia"3809,Poster,Learning Time/Memory-Efficient Deep Architectures with Budgeted Super Networks,"Tom Veniat, Lip6 - MLIA; Ludovic Denoyer, UPMC"3815,Spotlight,Customized Image Narrative Generation via Interactive Visual Question Generation and Answering,Andr3815,Poster,Customized Image Narrative Generation via Interactive Visual Question Generation and Answering,"Andrew Shin, The University of Tokyo; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"3817,Spotlight,Good Appearance Features for Multi-Target Multi-Camera Tracking,"Ergys Ristani, Duke University; Carlo Tomasi, Duke University"3817,Poster,Good Appearance Features for Multi-Target Multi-Camera Tracking,"Ergys Ristani, Duke University; Carlo Tomasi, Duke University"3818,Poster,Depth-Aware Stereo Video Retargeting,"Bing Li, University of Southern Califor; Chia-Wen Lin, ; Tiejun Huang, ; Boxin Shi, Peking University; Wen Gao, ; C.-C. Jay Kuo, University of Southern California"3822,Poster,Learning from Noisy Web Data with Category-level Supervision,"Li Niu, Rice University; Qingtao Tang, ; Ashok Veeraraghavan, Rice University; Ashutosh Sabharwal,"3826,Poster,"Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction","Daeyun Shin, UC Irvine; Charless Fowlkes, University of California, Irvine, USA; Derek Hoiem,"3827,Poster,SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels,"Matthias Fey, TU Dortmund; Jan Lenssen, TU Dortmund; Frank Weichert, TU Dortmund; Heinrich Mller, TU Dortmund"3838,Poster,Learning Depth from Monocular Videos using Direct Methods,"Chaoyang Wang, Carnegie Mellon University; Jose Buenaposada, Universidad Rey Juan Carlos; Rui Zhu, Carnegie Mellon University; Simon Lucey,"3855,Poster,Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning,"Vasili Ramanishka, Boston University; Yi-Ting Chen, Honda Research Institute USA; Teruhisa Misu, Honda Research Institute; Kate Saenko,"3857,Poster,Generative Adversarial Image Synthesis with Decision Tree Latent Controller,"Takuhiro Kaneko, NTT Corporation; Kaoru Hiramatsu, NTT Corporation; Kunio Kashino, NTT"3867,Poster,Cross-View Image Synthesis using Conditional Generative Adversarial Nets,"Krishna Regmi, Ucf; Ali Borji, UCF"3869,Poster,Focus Manipulation Detection via Photometric Histogram Analysis,"Can Chen, University of Delaware; Scott McCloskey, Honeywell; Jingyi Yu, University of Delaware, USA"3870,Poster,"Efficient, sparse representation of manifold distance matrices for classical scaling","Alexander Huth, University of Texas at Austin; Javier Turek, Intel Corporation"3871,Poster,A Robust Method for Strong Rolling Shutter Effects Correction Using Lines with Automatic Feature Selection,"Yizhen Lao, Institut Pascal; Omar Ait-Aider, Institut Pascal"3878,Poster,Learning Attribute Representations with Localization for Flexible Fashion Search,"Kenan Ak, National University of Singapo; Joo Hwee Lim, I2R, Astar; Ashraf Kassim, ; JO YEW THAM,"3882,Poster,Analysis of Hand Segmentation in the Wild,"Aisha Urooj, University of Central Florida; Ali Borji, UCF"3887,Poster,Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,"Apratim Bhattacharyya, MPI Informatics; Bernt Schiele, MPI Informatics Germany; Mario Fritz, MPI, Saarbrucken, Germany"3890,Oral,Accurate and Diverse Sampling of Sequences based on a ``Best of Many'' Sample Objective,"Apratim Bhattacharyya, MPI Informatics; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany"3890,Poster,Accurate and Diverse Sampling of Sequences based on a ``Best of Many'' Sample Objective,"Apratim Bhattacharyya, MPI Informatics; Mario Fritz, MPI, Saarbrucken, Germany; Bernt Schiele, MPI Informatics Germany"3903,Poster,Wrapped Gaussian Process Regression on Riemannian Manifolds,"Anton Mallasto, University of Copenhagen; Aasa Feragen, University of Copenhagen"3905,Poster,Between-class Learning for Image Classification,"Yuji Tokozume, The University of Tokyo; Yoshitaka Ushiku, ; Tatsuya Harada, University of Tokyo"3916,Spotlight,Unsupervised Person Image Synthesis in Arbitrary Poses,"Albert Pumarola, IRI (CSIC-UPC); Antonio Agudo, IRI (CSIC-UPC); Alberto Sanfeliu, IRI (CSIC-UPC); Francesc Moreno-Noguer, Institut de Robotica i Informatica Industrial (UPC/CSIC)"3916,Poster,Unsupervised Person Image Synthesis in Arbitrary Poses,"Albert Pumarola, IRI (CSIC-UPC); Antonio Agudo, IRI (CSIC-UPC); Alberto Sanfeliu, IRI (CSIC-UPC); Francesc Moreno-Noguer, Institut de Robotica i Informatica Industrial (UPC/CSIC)"3918,Poster,Visual Feature Attribution using Wasserstein GANs,"Christian Baumgartner, ETH Zurich; Lisa Koch, ETH Zurich; Kerem Tezcan, ETH Zurich; Jia Xi Ang, ETH Zurich; Ender Konukoglu, ETH Zurich"3923,Poster,ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes,"Yuhua Chen, CVL@ETHZ; Wen Li, ETH; Luc Van Gool, KTH"3941,Poster,Im2Struct: Recovering 3D Shape Structure from a Single RGB Image,"Chengjie Niu, National University of Defense Technology; Jun Li, ; Kai Xu, NUDT & Princeton Univeristy"3966,Oral,MapNet: An Allocentric Spatial Memory for Mapping Environments,"Joao Henriques, ; Andrea Vedaldi, U Oxford"3966,Poster,MapNet: An Allocentric Spatial Memory for Mapping Environments,"Joao Henriques, ; Andrea Vedaldi, U Oxford"3968,Oral,A Globally Optimal Solution to the Non-Minimal Relative Pose Problem,"Jesus Briales, University of Malaga; Laurent Kneip, ; Javier Gonzalez-Jimenez,"3968,Poster,A Globally Optimal Solution to the Non-Minimal Relative Pose Problem,"Jesus Briales, University of Malaga; Laurent Kneip, ; Javier Gonzalez-Jimenez,"3970,Spotlight,Robust Video Content Alignment and Compensation for Rain Removal in a CNN Framework,"Jie Chen, Nanyang Technological University; Cheen-Hau Tan, ; Junhui Hou, City University of Hong Kong; Lap-Pui Chau, Nanyang Technological University; He Li,"3970,Poster,Robust Video Content Alignment and Compensation for Rain Removal in a CNN Framework,"Jie Chen, Nanyang Technological University; Cheen-Hau Tan, ; Junhui Hou, City University of Hong Kong; Lap-Pui Chau, Nanyang Technological University; He Li,"3971,Spotlight,Unsupervised Learning and Segmentation of Complex Activities from Video,"Fadime Sener, University of Bonn; Angela Yao, University of Bonn"3971,Poster,Unsupervised Learning and Segmentation of Complex Activities from Video,"Fadime Sener, University of Bonn; Angela Yao, University of Bonn"3977,Spotlight,Inferring Light Fields from Shadows,"Manel Baradad, MIT; Vickie Ye, MIT; Adam Yedida, MIT; Fredo Durand, ; William Freeman, MIT/Google; Gregory Wornell, ; Antonio Torralba, MIT"3977,Poster,Inferring Light Fields from Shadows,"Manel Baradad, MIT; Vickie Ye, MIT; Adam Yedida, MIT; Fredo Durand, ; William Freeman, MIT/Google; Gregory Wornell, ; Antonio Torralba, MIT"3979,Poster,Bidirecional Retrieval Made Simple,"Jnatas Wehrmann, PUCRS; Rodrigo Barros, PUCRS"3980,Poster,A Twofold Siamese Network for Real-Time Object Tracking,"Anfeng He, USTC; Chong Luo, Microsoft Research Asia; Xinmei Tian, USTC; Wenjun Zeng,"3981,Poster,Multi-shot Pedestrian Re-identification via Sequential Decision Making,"Jianfu Zhang, Shanghai Jiaotong University; Naiyan Wang, tusimple; Liqing Zhang, Shanghai Jiaotong University"3989,Poster,Wide Compression: Tensor Ring Nets,"Wenqi Wang, Purdue University; YIfan Sun, Technicolor Research; Brian Eriksson, Adobe; Wenlin Wang, Duke University; Vaneet Aggarwal, Purdue University"3990,Spotlight,"Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions","Bichen Wu, UC Berkeley; Xiangyu Yue, UC Berkeley; Alvin Wan, UC Berkeley; Peter Jin, UC Berkeley; Sicheng Zhao, UC Berkeley; Noah Golmant, UC Berkeley; Amir Gholaminejad, UC Berkeley; Joseph Gonzalez, UC Berkeley; Kurt Keutzer, UC Berkeley"3990,Poster,"Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions","Bichen Wu, UC Berkeley; Xiangyu Yue, UC Berkeley; Alvin Wan, UC Berkeley; Peter Jin, UC Berkeley; Sicheng Zhao, UC Berkeley; Noah Golmant, UC Berkeley; Amir Gholaminejad, UC Berkeley; Joseph Gonzalez, UC Berkeley; Kurt Keutzer, UC Berkeley"3991,Poster,Improvements to context based self-supervised learning,"Terrell Mundhenk, LLNL; Daniel Ho, LLNL; Barry Chen, LLNL"3993,Poster,On the convergence of PatchMatch and its variants,"Thibaud EHRET, CMLA, ENS Cachan; Pablo Arias, CMLA, ENS Cachan"3995,Poster,Adversarial Feature Augmentation for Unsupervised Domain Adaptation,"Riccardo Volpi, IIT (Italy); Pietro Morerio, Istituto Italiano di Tecnologi; Silvio Savarese, ; Vittorio Murino, Istituto Italiano di Tecnologia"4001,Poster,Fast Monte-Carlo Localization on Aerial Vehicles using Approximate Continuous Belief Representations,"Aditya Dhawale, Carnegie Mellon University; Kumar Shaurya Shankar, Carnegie Mellon University; Nathan Michael, Carnegie Mellon University"4014,Poster,Learning Multi-Instance Enriched Image Representation via Non-Greedy Simultaneous L1 -Norm Minimization and Maximization,"Hua Wang, Colorado School of Mines"4023,Poster,Automatic Map Inference from Aerial Images,"Favyen Bastani, MIT CSAIL; Songtao He, MIT CSAIL; Mohammad Alizadeh, MIT CSAIL; Hari Balakrishnan, MIT CSAIL; Sam Madden, MIT CSAIL; Sanjay Chawla, Qatar Computing Research Institute; Sofiane Abbar, Qatar Computing Research Institute; David DeWitt, MIT CSAIL"4057,Spotlight,Captioning Images with Style Transfer from Unaligned Text Corpora,"Alexander Mathews, Australian National University; Xuming He, ShanghaiTech; Lexing Xie, Australian National University, Data61"4057,Poster,Captioning Images with Style Transfer from Unaligned Text Corpora,"Alexander Mathews, Australian National University; Xuming He, ShanghaiTech; Lexing Xie, Australian National University, Data61"4066,Poster,Exploiting Transitivity for Learning Person Re-identification Models on a Budget,"Sourya Roy, UC Riverside ; Sujoy Paul, UC Riverside; Neal Young, UC Riverside ; Amit Roy-Chowdhury, UC Riverside"4075,Poster,Salience Guided Depth Calibration for Perceptually Optimized Compressive Light Field 3D Display,"WENJUAN LIAO, NTU, Singapore"4083,Spotlight,Extreme 3D Face Reconstruction: Looking Past Occlusions,"Anh Tran, USC; Tal Hassner, Open Univ Israel; Iacopo Masi, USC; Grard Medioni,"4083,Poster,Extreme 3D Face Reconstruction: Looking Past Occlusions,"Anh Tran, USC; Tal Hassner, Open Univ Israel; Iacopo Masi, USC; Grard Medioni,"4093,Spotlight,Deflecting Adversarial Attacks with Pixel Deflection,"Aaditya Prakash, Brandeis University; Nick Moran, Bradeis University; Solomon Garber, Brandeis University; Antonella DiLillo, Brandeis University; James Storer, Brandeis University"4093,Poster,Deflecting Adversarial Attacks with Pixel Deflection,"Aaditya Prakash, Brandeis University; Nick Moran, Bradeis University; Solomon Garber, Brandeis University; Antonella DiLillo, Brandeis University; James Storer, Brandeis University"4098,Spotlight,Boosting Adversarial Attacks with Momentum,"Yinpeng Dong, Tsinghua Univeristy; Fangzhou Liao, Tsinghua University; Tianyu Pang, Tsinghua University; Hang Su, Tsinghua University; Jun Zhu, Tsinghua University; Xiaolin Hu, tsinghua; Jianguo Li, Intel Lab"4098,Poster,Boosting Adversarial Attacks with Momentum,"Yinpeng Dong, Tsinghua Univeristy; Fangzhou Liao, Tsinghua University; Tianyu Pang, Tsinghua University; Hang Su, Tsinghua University; Jun Zhu, Tsinghua University; Xiaolin Hu, tsinghua; Jianguo Li, Intel Lab"4099,Poster,A Robust Generative Framework for Generalized Zero-Shot Learning,"Vinay Verma, IIT Kanpur; Gundeep Arora, IIT Kanpur; Ashish Mishra, IIT MADRAS; Piyush Rai, IIT Kanpur"4109,Poster,3D Registration of Curves and Surfaces using Local Differential Information,"Carolina Raposo, Institute of Systems and Robot; Joao Barreto, University of Coimbra, Portugal"4110,Poster,Trust your Model: Light Field Depth Estimation with inline Occlusion Handling,"Hendrik Schilling, Universitt Heidelberg; Maximilian Diebold, Heidelberg University; Carsten Rother, University of Heidelberg; Bernd Jhne, University of Heidelberg"4112,Poster,Partially Shared Multi-Task Convolutional Neural Network with Local Constraint for Face Attribute Learning,"Jiajiong Cao, ; Yingming Li, Zhejiang University; Zhongfei Zhang,"4133,Spotlight,Decoupled Networks,"Weiyang Liu, Georgia Tech; Zhen Liu, ; Zhiding Yu, Carnegie Mellon University; Bo Dai, ; Yisen Wang, Tsinghua University; Thomas Breuel, ; James Rehg, Georgia Institute of Technology; Jan Kautz, NVIDIA; Le Song, Georgia Institute of Technology"4133,Poster,Decoupled Networks,"Weiyang Liu, Georgia Tech; Zhen Liu, ; Zhiding Yu, Carnegie Mellon University; Bo Dai, ; Yisen Wang, Tsinghua University; Thomas Breuel, ; James Rehg, Georgia Institute of Technology; Jan Kautz, NVIDIA; Le Song, Georgia Institute of Technology"4149,Poster,Learning Structure and Strength of CNN Filters for Small Sample Size Training,"Rohit Keshari, IIIT Delhi; Mayank Vatsa, IIIT Dehli; Richa Singh, IIT Dehli; Afzel Noore, WVU"4172,Poster,Motion Segmentation by Exploiting Complementary Geometric Models,"Xun Xu, National University of Singapore; Loong Fah Cheong, National University of Singapore; Zhuwen Li, Intel Labs"4186,Poster,Unsupervised Learning of Single View Depth Estimation and Visual Odometry with Deep Feature Reconstruction,"Huangying Zhan, The University of Adelaide; Ravi Garg, The University of Adelaide; Chamara Weerasekera, The University of Adelaide; Kejie Li, The University of Adelaide; Harsh Agarwal, Indian Institute of Technology (BHU); Ian Reid,"4208,Poster,GAGAN: Geometry Aware Generative Adverserial Networks,"Jean Kossaifi, Imperial College London; Linh Tran, Imperial College London; Yannis Panagakis, ; Maja Pantic, Imperial College London, UK"4217,Spotlight,Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks,"Ruth Fong, University of Oxford; Andrea Vedaldi, U Oxford"4217,Poster,Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks,"Ruth Fong, University of Oxford; Andrea Vedaldi, U Oxford"4227,Poster,SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks,"Julian Faraone, University of Sydney; Nicholas Fraser, Xilinx; Michaela Blott, Xilinx; Philip Leong,"4238,Poster,Learning Representations for Single Cells in Microscopy Images,"Juan Caicedo, Broad Institute of Harvard and; Claire Mcquin, Broad Institute of Harvard and MIT; Allen Goodman, Broad Institute of Harvard and MIT; Shantanu Singh, Broad Institute of Harvard and MIT; Anne Carpenter, Broad Institute of Harvard and MIT"4239,Poster,"Estimation of Camera Locations in Highly Corrupted Scenarios: All About the Base, No Shape Trouble","Yunpeng Shi, University of Minnesota; Gilad Lerman, University of Minnesota"4244,Poster,Deep Spatial Feature Reconstruction for Partial Person Re-identification,"Lingxiao He, Institute of AutomationChines; Jian Liang, CASIA; Haiqing Li, ; Zhenan Sun, CRIPAC"4249,Poster,Cross-Dataset Adaptation for Visual Question Answering,"Wei-Lun Chao, USC; Hexiang Hu, ; Fei Sha, University of Southern California"4254,Poster,Eye In-Painting with Exemplar Generative Adversarial Networks,"Brian Dolhansky, Facebook; Cristian Canton Ferrer, Facebook"4255,Poster,Learning Visual Knowledge Memory Networks for Visual Question Answering,"Zhou Su, ; Jianguo Li, Intel Lab; Zhiqiang Shen, Fudan University; Yurong Chen,"4265,Poster,Compassionately Conservative Balanced Cuts for Image Segmentation,"Nathan Cahill, Rochester Institute of Technol; Tyler Hayes, Rochester Institute of Tech; Renee Meinhold, Rochester Institute of Technology; John Hamilton, RIT"4272,Poster,Neural Motifs: Scene Graph Parsing with Global Context,"Rowan Zellers, University of Washington; Mark Yatskar, University of Washington; Samuel Thomson, Carnegie Mellon University; Yejin Choi, University of Washington"4286,Poster,Alternating-Stereo VINS: Observability Analysis and Performance Evaluation,"Mrinal Kanti Paul, Google; Stergios Roumeliotis, Google"4296,Poster,clcNet: Improving the Efficiency of Convolutional Neural Network using Channel Local Convolutions,"Dongqing Zhang, ImaginationAI LLC"4304,Spotlight,Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution,"Ying Qu, The University of Tennessee; Hairong Qi, University of Tennessee; Chiman Kwan,"4304,Poster,Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution,"Ying Qu, The University of Tennessee; Hairong Qi, University of Tennessee; Chiman Kwan,"4307,Oral,A Volumetric Descriptive Network for 3D Object Synthesis,"Jianwen Xie, UCLA; Zilong Zheng, ucla"4307,Poster,A Volumetric Descriptive Network for 3D Object Synthesis,"Jianwen Xie, UCLA; Zilong Zheng, ucla"4335,Poster,Anatomical Priors in Convolutional Networks for Unsupervised Biomedical Segmentation,"Adrian Dalca, ; John Guttag, ; Mert Sabuncu, Cornell"NaN,Oral,Learning Face Age Progression: A Pyramid Architecture of GANs,"Hongyu Yang, BEIHANG UNIVERSITY; Di Huang, ; Yunhong Wang, ; Anil Jain, MSU"


================================================
FILE: 2018-Paper.md
================================================
[2018-12-31](2018/12/31.md): 16篇论文速递，涉及CNN、语义分割、GAN、3D和显著性目标检测等方向。

[2018-12-24~12-28](2018/12/24-28.md): 涉及CNN、目标检测、目标跟踪、GAN、姿态估计、SLAM、超分辨率和Zero-Shot Learning等方向。

[2018-12-17~12-21](2018/12/17-21.md): 涉及CNN、GAN、姿态估计和Meta-Learning等方向。

[2018-12-10](2018/12/10.md): 12篇论文速递，涉及图像分类、目标检测、图像分割、GAN和三维重建等方向。

[2018-11-20](2018/11/20.md): 20篇论文速递，涉及CNN、Face、图像分类、目标检测、图像分割、GAN、Re-ID、SLAM和迁移学习等方向。

[2018-11-19](2018/11/19.md): 12篇论文速递，涉及CNN、Face、3D、OCR、GAN和目标检测等方向。

[2018-11-05~11-09](2018/11/05-09.md): 43篇论文速递，涉及CNN、图像分类、数据增广、Face、图像分割、OCR、GAN、风格迁移、目标跟踪、数据集和姿态估计等方向。

[2018-10-17](2018/10/17.md): 2篇论文速递，都是ECCV 2018 paper，都是关于语义分割（Semantic Segmentation），一篇提出双边分割网络（Bilateral Segmentation Network，BiSeNet）在不牺牲空间分辨率（spatial resolution）的情况下来实现实时inference速度；另一篇提出UDA框架和CBST框架，并引入空间先验（spatial prior）来细化生成的标签。

[2018-10-12](2018/10/12.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出IoU-Net，用来学习来预测每个检测到的边界框与匹配的ground truth 之间的IoU；另一篇提出DetNet，这是一种专门用于物体检测的新型 backbone 网络。

[2018-08-25](2018/08/25.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出新的弱监督和半监督框架可实现含无限数量标签的语义分割；另一篇提出使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。

[2018-08-15](2018/08/15.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成；另一篇提出利用FiLM来调节语言上基于图像的卷积网络计算，解决视推理问题。

[2018-08-11](2018/08/11.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出新的基于Disentangled Representations网络，实现图像到图像转换；另一篇提出新的SPG masks，可有效地生成高质量的目标定位图。

[2018-08-07](2018/08/07.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出新的网格自动编码的卷积神经网络，用于生成3D人脸；另一篇提出新的RFNet，实现看图说话（image caption）。

[2018-08-03](2018/08/03.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出新的基于卷积神经网络（CNN）的密度估计方法来解决图像中人群计数的问题；另一篇是提出实时立体匹配的端到端深度架构StereoNet，实现了亚像素匹配精度的深度预测。

[2018-07-31](2018/07/31.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出semi-convolutional等创新点来改进Mask RCNN；另一篇是提出CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络，实现超分辨率。

[2018-07-27](2018/07/27.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出对目标周围的视觉上下文建模，来实现目标检测数据集的增广；另一篇是提出一种综合贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。

[2018-07-23](2018/07/23.md): 2篇论文速递，都是ECCV 2018 paper，一篇提出卷积块注意力模块，它可以无缝地集成到任何CNN架构中；另一篇是利用 GAN技术实现多视图3D重建。

[2018-07-19](2018/07/19.md): 2篇论文速递，都是ECCV 2018 paper，一篇关于语义分割方向，另一篇是关于深度预测方向。

[2018-07-07](2018/07/07.md): 2篇论文速递，都是图像分割方向（CVPR 2018），一篇提出CCB-Cut损失，另一篇是对FCN网络进行了改进。注意，两篇都是CVPR 2018文章。

[2018-07-06](2018/07/06.md): 2篇论文速递，都是目标检测方向，一篇是RefineNet，其是SSD算法、RPN网络和FPN算法的结合，另一篇是DES，其是基于SSD网络进行了改进。注意，两篇都是CVPR 2018文章。

[2018-07-05](2018/07/05.md): 4篇论文速递，都是GAN方向，包括根据文本生成图像和多域图像生成等方向。其中一篇是IJCAI 2018。

[2018-07-02](2018/07/02.md): 2篇论文速递，都是图像分割方向，包括运动捕捉图像的语义分割、将FCN和GAN结合的巩膜分割。其中一篇是ACM SIGGRAPH 2018，另一篇是BTAS 2018。

[2018-06-29](2018/06/29.md): 4篇论文速递，都是人脸方向，包括人脸识别、人脸表情识别、人脸情绪分类和人脸属性预测。其中一篇是CVPR 2018 workshop。

[2018-06-23](2018/06/23.md): 4篇论文速递，都是CVPR 2018论文，包括zero-shot learning、图像合成和图像转换等方向。

[2018-06-19](2018/06/19.md): 4篇论文速递，都是目标检测方向，包括行人检测、车辆检测、指纹检测和目标跟踪等。

[2018-06-15](2018/06/15.md): 4篇论文速递，都是人脸方向，包括人脸识别、人脸检测和人脸表情识别。其中一篇是CVPR 2018。

[2018-06-13](2018/06/13.md): 4篇论文，都是图像分割（image segmentation）方向，其实3篇是对U-Net网络进行了改进。

[2018-06-11](2018/06/11.md): 4篇论文，涉及CNN pruning、新的人脸识别数据集、森林树木分类和交通标志检测等方向。

[2018-06-08](2018/06/08.md): 4篇论文，涉及胶囊网络、迁移学习、优化CNN和手指检测等方向（含一篇NIPS 2017、一篇ICMR 2018和一篇 VCIP 2017 ）。

[2018-06-06](2018/06/06.md): 4篇论文，涉及目标跟踪、GAN、Zero-Shot Learning、视频分类和行人重识别等方向（含一篇IJCAI 2018和一篇IROS 2018 submission ）。

[2018-05-29](2018/05/29.md): 4篇论文，涉及图像分类、视频分类和语义分割等方向（含一篇ICLR 2018和一篇CVPR 2018）。

[2018-05-24](2018/05/24.md): 4篇论文，涉及活体检测、SFM、视差估计、Zero-short Learning和3D shape等方向。

[2018-05-22](2018/05/22.md): 4篇论文，涉及图像分割、视频分割、目标追踪和异常检测等方向。

[2018-05-19](2018/05/19.md): 4篇论文，涉及人脸识别（综述）、人脸检测、3D 目标检测和姿态估计和目标检测等方向（含2篇CVPR 2018）。

[2018-05-16](2018/05/16.md): 4篇论文，涉及单目图像深度估计、6-DoF跟踪、图像合成和动作捕捉等方向（含1篇CVPR 2018论文和1篇ICRA 2018论文）

================================================
FILE: 2019/01/01-04.md
================================================
【计算机视觉论文速递】2019-01-01~01-04

- [x] 2019-01-01
- [ ] 2019-01-02
- [x] 2019-01-03
- [x] 2019-01-04

本文分享共52篇论文，涉及人脸识别、图像分类、目标检测、语义分割、GAN、姿态估计、风格迁移和文本检测等方向。

[TOC]

# Face

**《Support Vector Guided Softmax Loss for Face Recognition》**

arXiv：https://arxiv.org/abs/1812.11317

> Face recognition has witnessed significant progresses due to the advances of deep convolutional neural networks (CNNs), the central challenge of which, is feature discrimination. To address it, one group tries to exploit mining-based strategies (\textit{e.g.}, hard example mining and focal loss) to focus on the informative examples. The other group devotes to designing margin-based loss functions (\textit{e.g.}, angular, additive and additive angular margins) to increase the feature margin from the perspective of ground truth class. Both of them have been well-verified to learn discriminative features. However, they suffer from either the ambiguity of hard examples or the lack of discriminative power of other classes. In this paper, we design a novel loss function, namely support vector guided softmax loss (SV-Softmax), which adaptively emphasizes the mis-classified points (support vectors) to guide the discriminative features learning. So the developed SV-Softmax loss is able to eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes, and thus results in more discrimiantive features. To the best of our knowledge, this is the first attempt to inherit the advantages of mining-based and margin-based losses into one framework. Experimental results on several benchmarks have demonstrated the effectiveness of our approach over state-of-the-arts.

**《Improving Face Anti-Spoofing by 3D Virtual Synthesis》**

arXiv：https://arxiv.org/abs/1901.00488

> Face anti-spoofing is crucial for the security of face recognition systems. Learning based methods especially deep learning based methods need large-scale training samples to reduce overfitting. However, acquiring spoof data is very expensive since the live faces should be re-printed and re-captured in many views. In this paper, we present a method to synthesize virtual spoof data in 3D space to alleviate this problem. Specifically, we consider a printed photo as a flat surface and mesh it into a 3D object, which is then randomly bent and rotated in 3D space. Afterward, the transformed 3D photo is rendered through perspective projection as a virtual sample. The synthetic virtual samples can significantly boost the anti-spoofing performance when combined with a proposed data balancing strategy. Our promising results open up new possibilities for advancing face anti-spoofing using cheap and large-scale synthetic data.

**《Face Recognition: A Novel Multi-Level Taxonomy based Survey》**

arXiv：https://arxiv.org/abs/1901.00713

> In a world where security issues have been gaining growing importance, face recognition systems have attracted increasing attention in multiple application areas, ranging from forensics and surveillance to commerce and entertainment. To help understanding the landscape and abstraction levels relevant for face recognition systems, face recognition taxonomies allow a deeper dissection and comparison of the existing solutions. This paper proposes a new, more encompassing and richer multi-level face recognition taxonomy, facilitating the organization and categorization of available and emerging face recognition solutions; this taxonomy may also guide researchers in the development of more efficient face recognition solutions. The proposed multi-level taxonomy considers levels related to the face structure, feature support and feature extraction approach. Following the proposed taxonomy, a comprehensive survey of representative face recognition solutions is presented. The paper concludes with a discussion on current algorithmic and application related challenges which may define future research directions for face recognition.

# Image Classification

**《Multi-class Classification without Multi-class Labels》**

ICLR 2019

arXiv：https://arxiv.org/abs/1901.00544

> This work presents a new strategy for multi-class classification that requires no class-specific labels, but instead leverages pairwise similarity between examples, which is a weaker form of annotation. The proposed method, meta classification learning, optimizes a binary classifier for pairwise similarity prediction and through this process learns a multi-class classifier as a submodule. We formulate this approach, present a probabilistic graphical model for it, and derive a surprisingly simple loss function that can be used to learn neural network-based models. We then demonstrate that this same framework generalizes to the supervised, unsupervised cross-task, and semi-supervised settings. Our method is evaluated against state of the art in all three learning paradigms and shows a superior or comparable accuracy, providing evidence that learning multi-class classification without multi-class labels is a viable learning option.


# Object Detection

**《Large-Scale Object Detection of Images from Network Cameras in Variable Ambient Lighting Conditions》**

MIPR 2019

arXiv：https://arxiv.org/abs/1812.11901

> Computer vision relies on labeled datasets for training and evaluation in detecting and recognizing objects. The popular computer vision program, YOLO ("You Only Look Once"), has been shown to accurately detect objects in many major image datasets. However, the images found in those datasets, are independent of one another and cannot be used to test YOLO's consistency at detecting the same object as its environment (e.g. ambient lighting) changes. This paper describes a novel effort to evaluate YOLO's consistency for large-scale applications. It does so by working (a) at large scale and (b) by using consecutive images from a curated network of public video cameras deployed in a variety of real-world situations, including traffic intersections, national parks, shopping malls, university campuses, etc. We specifically examine YOLO's ability to detect objects in different scenarios (e.g., daytime vs. night), leveraging the cameras' ability to rapidly retrieve many successive images for evaluating detection consistency. Using our camera network and advanced computing resources (supercomputers), we analyzed more than 5 million images captured by 140 network cameras in 24 hours. Compared with labels marked by humans (considered as "ground truth"), YOLO struggles to consistently detect the same humans and cars as their positions change from one frame to the next; it also struggles to detect objects at night time. Our findings suggest that state-of-the art vision solutions should be trained by data from network camera with contextual information before they can be deployed in applications that demand high consistency on object detection.

# Image Segmentation

**《Impact of Ground Truth Annotation Quality on Performance of Semantic Image Segmentation of Traffic Conditions》**

arXiv：https://arxiv.org/abs/1901.00001

> Preparation of high-quality datasets for the urban scene understanding is a labor-intensive task, especially, for datasets designed for the autonomous driving applications. The application of the coarse ground truth (GT) annotations of these datasets without detriment to the accuracy of semantic image segmentation (by the mean intersection over union - mIoU) could simplify and speedup the dataset preparation and model fine tuning before its practical application. Here the results of the comparative analysis for semantic segmentation accuracy obtained by PSPNet deep learning architecture are presented for fine and coarse annotated images from Cityscapes dataset. Two scenarios were investigated: scenario 1 - the fine GT images for training and prediction, and scenario 2 - the fine GT images for training and the coarse GT images for prediction. The obtained results demonstrated that for the most important classes the mean accuracy values of semantic image segmentation for coarse GT annotations are higher than for the fine GT ones, and the standard deviation values are vice versa. It means that for some applications some unimportant classes can be excluded and the model can be tuned further for some classes and specific regions on the coarse GT dataset without loss of the accuracy even. Moreover, this opens the perspectives to use deep neural networks for the preparation of such coarse GT datasets.

**《Flow Based Self-supervised Pixel Embedding for Image Segmentation》**

arXiv：https://arxiv.org/abs/1901.00520

> We propose a new self-supervised approach to image feature learning from motion cue. This new approach leverages recent advances in deep learning in two directions: 1) the success of training deep neural network in estimating optical flow in real data using synthetic flow data; and 2) emerging work in learning image features from motion cues, such as optical flow. Building on these, we demonstrate that image features can be learned in self-supervision by first training an optical flow estimator with synthetic flow data, and then learning image features from the estimated flows in real motion data. We demonstrate and evaluate this approach on an image segmentation task. Using the learned image feature representation, the network performs significantly better than the ones trained from scratch in few-shot segmentation tasks.


# Visual Tracking

**《SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks》**

arXiv：https://arxiv.org/abs/1812.11703

> Siamese network based trackers formulate tracking as convolutional feature cross-correlation between target template and searching region. However, Siamese trackers still have accuracy gap compared with state-of-the-art algorithms and they cannot take advantage of feature from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform depth-wise and layer-wise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on four large tracking benchmarks, including OTB2015, VOT2018, UAV123, and LaSOT. Our model will be released to facilitate further studies based on this problem.

# GAN

**《Generating Multiple Objects at Spatially Distinct Locations》**

ICLR 2019
 
arXiv：https://arxiv.org/abs/1901.00686

> Recent improvements to Generative Adversarial Networks (GANs) have made it possible to generate realistic images in high resolution based on natural language descriptions such as image captions. Furthermore, conditional GANs allow us to control the image generation process through labels or even natural language descriptions. However, fine-grained control of the image layout, i.e. where in the image specific objects should be located, is still difficult to achieve. This is especially true for images that should contain multiple distinct objects at different spatial locations. We introduce a new approach which allows us to control the location of arbitrarily many objects within an image by adding an object pathway to both the generator and the discriminator. Our approach does not need a detailed semantic layout but only bounding boxes and the respective labels of the desired objects are needed. The object pathway focuses solely on the individual objects and is iteratively applied at the locations specified by the bounding boxes. The global pathway focuses on the image background and the general image layout. We perform experiments on the Multi-MNIST, CLEVR, and the more complex MS-COCO data set. Our experiments show that through the use of the object pathway we can control object locations within images and can model complex scenes with multiple objects at various locations. We further show that the object pathway focuses on the individual objects and learns features relevant for these, while the global pathway focuses on global image characteristics and the image background.


# 3D

**《Fast and Globally Optimal Rigid Registration of 3D Point Sets by Transformation Decomposition》**

arXiv：https://arxiv.org/abs/1812.11307

> The rigid registration of two 3D point sets is a fundamental problem in computer vision. The current trend is to solve this problem globally using the BnB optimization framework. However, the existing global methods are slow for two main reasons: the computational complexity of BnB is exponential to the problem dimensionality (which is six for 3D rigid registration), and the bound evaluation used in BnB is inefficient. In this paper, we propose two techniques to address these problems. First, we introduce the idea of translation invariant vectors, which allows us to decompose the search of a 6D rigid transformation into a search of 3D rotation followed by a search of 3D translation, each of which is solved by a separate BnB algorithm. This transformation decomposition reduces the problem dimensionality of BnB algorithms and substantially improves its efficiency. Then, we propose a new data structure, named 3D Integral Volume, to accelerate the bound evaluation in both BnB algorithms. By combining these two techniques, we implement an efficient algorithm for rigid registration of 3D point sets. Extensive experiments on both synthetic and real data show that the proposed algorithm is three orders of magnitude faster than the existing state-of-the-art global methods.

**《Skeleton Transformer Networks: 3D Human Pose and Skinned Mesh from Single RGB Image》**

ACCV

arXiv：https://arxiv.org/abs/1812.11328

> In this paper, we present Skeleton Transformer Networks (SkeletonNet), an end-to-end framework that can predict not only 3D joint positions but also 3D angular pose (bone rotations) of a human skeleton from a single color image. This in turn allows us to generate skinned mesh animations. Here, we propose a two-step regression approach. The first step regresses bone rotations in order to obtain an initial solution by considering skeleton structure. The second step performs refinement based on heatmap regressor using a 3D pose representation called cross heatmap which stacks heatmaps of xy and zy coordinates. By training the network using the proposed 3D human pose dataset that is comprised of images annotated with 3D skeletal angular poses, we showed that SkeletonNet can predict a full 3D human pose (joint positions and bone rotations) from a single image in-the-wild.

**《Learning Generalizable Physical Dynamics of 3D Rigid Objects》**

arXiv：https://arxiv.org/abs/1901.00466

> Humans have a remarkable ability to predict the effect of physical interactions on the dynamics of objects. Endowing machines with this ability would allow important applications in areas like robotics and autonomous vehicles. In this work, we focus on predicting the dynamics of 3D rigid objects, in particular an object's final resting position and total rotation when subjected to an impulsive force. Different from previous work, our approach is capable of generalizing to unseen object shapes - an important requirement for real-world applications. To achieve this, we represent object shape as a 3D point cloud that is used as input to a neural network, making our approach agnostic to appearance variation. The design of our network is informed by an understanding of physical laws. We train our model with data from a physics engine that simulates the dynamics of a large number of shapes. Experiments show that we can accurately predict the resting position and total rotation for unseen object geometries.

**《GeoNet: Deep Geodesic Networks for Point Cloud Analysis》**

arXiv：https://arxiv.org/abs/1901.00680

> Surface-based geodesic topology provides strong cues for object semantic analysis and geometric modeling. However, such connectivity information is lost in point clouds. Thus we introduce GeoNet, the first deep learning architecture trained to model the intrinsic structure of surfaces represented as point clouds. To demonstrate the applicability of learned geodesic-aware representations, we propose fusion schemes which use GeoNet in conjunction with other baseline or backbone networks, such as PU-Net and PointNet++, for down-stream point cloud analysis. Our method improves the state-of-the-art on multiple representative tasks that can benefit from understandings of the underlying surface topology, including point upsampling, normal estimation, mesh reconstruction and non-rigid shape classification.


# Text Recognition

**《Detecting Text in the Wild with Deep Character Embedding Network》**

ACCV 2018

arXiv：https://arxiv.org/abs/1901.00363

> Most text detection methods hypothesize texts are horizontal or multi-oriented and thus define quadrangles as the basic detection unit. However, text in the wild is usually perspectively distorted or curved, which can not be easily tackled by existing approaches. In this paper, we propose a deep character embedding network (CENet) which simultaneously predicts the bounding boxes of characters and their embedding vectors, thus making text detection a simple clustering task in the character embedding space. The proposed method does not require strong assumptions of forming a straight line on general text detection, which provides flexibility on arbitrarily curved or perspectively distorted text. For character detection task, a dense prediction subnetwork is designed to obtain the confidence score and bounding boxes of characters. For character embedding task, a subnet is trained with contrastive loss to project detected characters into embedding space. The two tasks share a backbone CNN from which the multi-scale feature maps are extracted. The final text regions can be easily achieved by a thresholding process on character confidence and embedding distance of character pairs. We evaluated our method on ICDAR13, ICDAR15, MSRA-TD500, and Total-Text. The proposed method achieves state-of-the-art or comparable performance on all these datasets, and shows substantial improvement in the irregular-text datasets, i.e. Total-Text.


**《Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks》**

arXiv：https://arxiv.org/abs/1812.11894

> Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our proposed text recognition architecture by achieving state of the art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset.

**《A High-Performance CNN Method for Offline Handwritten Chinese Character Recognition and Visualization》**

arXiv：https://arxiv.org/abs/1812.11489

> Recent researches introduced fast, compact and efficient convolutional neural networks (CNNs) for offline handwritten Chinese character recognition (HCCR). However, many of them did not address the problem of the network interpretability. We propose a new architecture of a deep CNN with a high recognition performance which is capable of learning deep features for visualization. A special characteristic of our model is the bottleneck layers which enable us to retain its expressiveness while reducing the number of multiply-accumulate operations and the required storage. We introduce a modification of global weighted average pooling (GWAP) - global weighted output average pooling (GWOAP). This paper demonstrates how they allow us to calculate class activation maps (CAMs) in order to indicate the most relevant input character image regions used by our CNN to identify a certain class. Evaluating on the ICDAR-2013 offline HCCR competition dataset, we show that our model enables a relative 0.83% error reduction having 49% fewer parameters and the same computational cost compared to the current state-of-the-art single-network method trained only on handwritten data. Our solution outperforms even recent residual learning approaches.

**《Lipi Gnani - A Versatile OCR for Documents in any Language Printed in Kannada Script》**

submitted to ACM Transactions

arXiv：https://arxiv.org/abs/1901.00413

> A Kannada OCR, named Lipi Gnani, has been designed and developed from scratch, with the motivation of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary. The training and test sets have been collected from over 35 books published between the period 1970 to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all the punctuation marks, special symbols, Indo-Arabic and Kannada numerals and also the interspersed English words. Several minor and major original contributions have been done in developing this OCR at the different processing stages such as binarization, line and character segmentation, recognition and Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases, better than the Google's Tesseract OCR, as shown by the results. To the knowledge of the authors, this is the maiden report of a complete Kannada OCR, handling all the issues involved. Currently, there is no dictionary based postprocessing, and the obtained results are due solely to the recognition process. Four benchmark test databases containing scanned pages from books in Kannada, Sanskrit, Konkani and Tulu languages, but all of them printed in Kannada script, have been created. The word level recognition accuracy of Lipi Gnani is 4% higher on the Kannada dataset than that of Google's Tesseract OCR, 8% higher on the datasets of Tulu and Sanskrit, and 25% higher on the Konkani dataset.

**《Handwritten Indic Character Recognition using Capsule Networks》**

ASPCON 2018

arXiv：https://arxiv.org/abs/1901.00166

> Convolutional neural networks(CNNs) has become one of the primary algorithms for various computer vision tasks. Handwritten character recognition is a typical example of such task that has also attracted attention. CNN architectures such as LeNet and AlexNet have become very prominent over the last two decades however the spatial invariance of the different kernels has been a prominent issue till now. With the introduction of capsule networks, kernels can work together in consensus with one another with the help of dynamic routing, that combines individual opinions of multiple groups of kernels called capsules to employ equivariance among kernels. In the current work, we have implemented capsule network on handwritten Indic digits and character datasets to show its superiority over networks like LeNet. Furthermore, it has also been shown that they can boost the performance of other networks like LeNet and AlexNet.


# Super-Resolution

**《Image Super-Resolution via RL-CSC: When Residual Learning Meets Convolutional Sparse Coding》**

arXiv：https://arxiv.org/abs/1812.11950

github：https://github.com/axzml/RL-CSC

> We propose a simple yet effective model for Single Image Super-Resolution (SISR), by combining the merits of Residual Learning and Convolutional Sparse Coding (RL-CSC). Our model is inspired by the Learned Iterative Shrinkage-Threshold Algorithm (LISTA). We extend LISTA to its convolutional version and build the main part of our model by strictly following the convolutional form, which improves the network's interpretability. Specifically, the convolutional sparse codings of input feature maps are learned in a recursive manner, and high-frequency information can be recovered from these CSCs. More importantly, residual learning is applied to alleviate the training difficulty when the network goes deeper. Extensive experiments on benchmark datasets demonstrate the effectiveness of our method. RL-CSC (30 layers) outperforms several recent state-of-the-arts, e.g., DRRN (52 layers) and MemNet (80 layers) in both accuracy and visual qualities. Codes and more results are available at https://github.com/axzml/RL-CSC.


# Depth Estimation

**《High Quality Monocular Depth Estimation via Transfer Learning》**

arXiv：https://arxiv.org/abs/1812.11941

> Accurate depth estimation from images is a fundamental task in many applications including scene understanding and reconstruction. Existing solutions for depth estimation often produce blurry approximations of low resolution. This paper presents a convolutional neural network for computing a high-resolution depth map given a single RGB image with the help of transfer learning. Following a standard encoder-decoder architecture, we leverage features extracted using high performing pre-trained networks when initializing our encoder along with augmentation and training strategies that lead to more accurate results. We show how, even for a very simple decoder, our method is able to achieve detailed high-resolution depth maps. Our network, with fewer parameters and training iterations, outperforms state-of-the-art on two datasets and also produces qualitatively better results that capture object boundaries more faithfully. Code and corresponding pre-trained weights are made publicly available.

**《Unsupervised monocular stereo matching》**

arXiv：https://arxiv.org/abs/1812.11671

> At present, deep learning has been applied more and more in monocular image depth estimation and has shown promising results. The current more ideal method for monocular depth estimation is the supervised learning based on ground truth depth, but this method requires an abundance of expensive ground truth depth as the supervised labels. Therefore, researchers began to work on unsupervised depth estimation methods. Although the accuracy of unsupervised depth estimation method is still lower than that of supervised method, it is a promising research direction. 

> In this paper, Based on the experimental results that the stereo matching models outperforms monocular depth estimation models under the same unsupervised depth estimation model, we proposed an unsupervised monocular vision stereo matching method. In order to achieve the monocular stereo matching, we constructed two unsupervised deep convolution network models, one was to reconstruct the right view from the left view, and the other was to estimate the depth map using the reconstructed right view and the original left view. The two network models are piped together during the test phase. The output results of this method outperforms the current mainstream unsupervised depth estimation method in the challenging KITTI dataset.

**《Epipolar Geometry based Learning of Multi-view Depth and Ego-Motion from Monocular Sequences》**

ICVGIP 2018 Best Pape

arXiv：https://arxiv.org/abs/1812.11922

> Deep approaches to predict monocular depth and ego-motion have grown in recent years due to their ability to produce dense depth from monocular images. The main idea behind them is to optimize the photometric consistency over image sequences by warping one view into another, similar to direct visual odometry methods. One major drawback is that these methods infer depth from a single view, which might not effectively capture the relation between pixels. Moreover, simply minimizing the photometric loss does not ensure proper pixel correspondences, which is a key factor for accurate depth and pose estimations. 

> In contrast, we propose a 2-view depth network to infer the scene depth from consecutive frames, thereby learning inter-pixel relationships. To ensure better correspondences, thereby better geometric understanding, we propose incorporating epipolar constraints to make the learning more geometrically sound. We use the Essential matrix obtained using Nist'er's Five Point Algorithm, to enforce meaningful geometric constraints, rather than using it as training labels. This allows us to use lesser no. of trainable parameters compared to state-of-the-art methods. The proposed method results in better depth images and pose estimates, which capture the scene structure and motion in a better way. Such a geometrically constrained learning performs successfully even in cases where simply minimizing the photometric error would fail.

# Re-ID

**《EANet: Enhancing Alignment for Cross-Domain Person Re-identification》**

arXiv：https://arxiv.org/abs/1812.11369

github：https://github.com/huanghoujing/EANet

> Person re-identification (ReID) has achieved significant improvement under the single-domain setting. However, directly exploiting a model to new domains is always faced with huge performance drop, and adapting the model to new domains without target-domain identity labels is still challenging. In this paper, we address cross-domain ReID and make contributions for both model generalization and adaptation. First, we propose Part Aligned Pooling (PAP) that brings significant improvement for cross-domain testing. Second, we design a Part Segmentation (PS) constraint over ReID feature to enhance alignment and improve model generalization. Finally, we show that applying our PS constraint to unlabeled target domain images serves as effective domain adaptation. We conduct extensive experiments between three large datasets, Market1501, CUHK03 and DukeMTMC-reID. Our model achieves state-of-the-art performance under both source-domain and cross-domain settings. For completeness, we also demonstrate the complementarity of our model to existing domain adaptation methods. The code is available at https://github.com/huanghoujing/EANet.

# Human Pose Estimation

**《CamLoc: Pedestrian Location Detection from Pose Estimation on Resource-constrained Smart-cameras》**

arXiv：https://arxiv.org/abs/1812.11209

> Recent advancements in energy-efficient hardware technology is driving the exponential growth we are experiencing in the Internet of Things (IoT) space, with more pervasive computations being performed near to data generation sources. A range of intelligent devices and applications performing local detection is emerging (activity recognition, fitness monitoring, etc.) bringing with them obvious advantages such as reducing detection latency for improved interaction with devices and safeguarding user data by not leaving the device. Video processing holds utility for many emerging applications and data labelling in the IoT space. However, performing this video processing with deep neural networks at the edge of the Internet is not trivial. In this paper we show that pedestrian location estimation using deep neural networks is achievable on fixed cameras with limited compute resources. Our approach uses pose estimation from key body points detection to extend pedestrian skeleton when whole body not in image (occluded by obstacles or partially outside of frame), which achieves better location estimation performance (infrence time and memory footprint) compared to fitting a bounding box over pedestrian and scaling. We collect a sizable dataset comprising of over 2100 frames in videos from one and two surveillance cameras pointing from different angles at the scene, and annotate each frame with the exact position of person in image, in 42 different scenarios of activity and occlusion. We compare our pose estimation based location detection with a popular detection algorithm, YOLOv2, for overlapping bounding-box generation, our solution achieving faster inference time (15x speedup) at half the memory footprint, within resource capabilities on embedded devices, which demonstrate that CamLoc is an efficient solution for location estimation in videos on smart-cameras.

**《Rethinking on Multi-Stage Networks for Human Pose Estimation》**

arXiv：https://arxiv.org/abs/1901.00148

注：Face++出品

> Existing pose estimation approaches can be categorized into single-stage and multi-stage methods. While a multi-stage architecture is seemingly more suitable for the task, the performance of current multi-stage methods is not as competitive as single-stage ones. This work studies this issue. We argue that the current unsatisfactory performance comes from various insufficient design in current methods. We propose several improvements on the architecture design, feature flow, and loss function. The resulting multi-stage network outperforms all previous works and obtains the best performance on COCO keypoint challenge 2018. The source code will be released.


# 6DoF Pose Estimation

《PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation》

arXiv：https://arxiv.org/abs/1812.11788

homepage：https://zju-3dv.github.io/pvnet/

> This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.

# Style Transfer 

**《Ancient Painting to Natural Image: A New Solution for Painting Processing》**

WACV 2019

arXiv：https://arxiv.org/abs/1901.00224

> Collecting a large-scale and well-annotated dataset for image processing has become a common practice in computer vision. However, in the ancient painting area, this task is not practical as the number of paintings is limited and their style is greatly diverse. We, therefore, propose a novel solution for the problems that come with ancient painting processing. This is to use domain transfer to convert ancient paintings to photo-realistic natural images. By doing so, the ancient painting processing problems become natural image processing problems and models trained on natural images can be directly applied to the transferred paintings. Specifically, we focus on Chinese ancient flower, bird and landscape paintings in this work. A novel Domain Style Transfer Network (DSTN) is proposed to transfer ancient paintings to natural images which employ a compound loss to ensure that the transferred paintings still maintain the color composition and content of the input paintings. The experiment results show that the transferred paintings generated by the DSTN have a better performance in both the human perceptual test and other image processing tasks than other state-of-art methods, indicating the authenticity of the transferred paintings and the superiority of the proposed method.

# Feature Point

**《Interest Point Detection based on Adaptive Ternary Coding》**

arXiv：https://arxiv.org/abs/1901.00031

> In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark pixels. Interest points are extracted from this blob significance measurement. By labeling the statuses of ternary bright, uncertain, and dark, the proposed detector shows more robustness to image noise and quantization errors. Moreover, the adaptive strategy for the ternary cording, which relies on two thresholds that automatically converge to the median of the local region in measurement, enables this coding to be insensitive to the image local contrast. As a result, the proposed detector is invariant to illumination changes. The state-of-the-art results are achieved on the standard datasets, and also in the face recognition application.

**《DCI: Discriminative and Contrast Invertible Descriptor》**

arXiv：https://arxiv.org/abs/1901.00027

> Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative ability of the descriptor under illumination changes, a Laplace gradient based histogram is proposed. A robust contrast flipping estimate is proposed based on the divergence of a local region. Experiments on fine-grained object recognition and retrieval applications demonstrate the superior performance of DCI descriptor to others.


# Crowd Counting

**《Mask-aware networks for crowd counting》**

arXiv：https://arxiv.org/abs/1901.00039

> Crowd counting problem aims to count the number of objects within an image or a frame in the videos and is usually solved by estimating the density map generated from the object location annotations. The values in the density map, by nature, take two possible states: zero indicating no object around, a non-zero value indicating the existence of objects and the value denoting the local object density. In contrast to traditional methods which do not differentiate the density prediction of these two states, we propose to use a dedicated network branch to predict the object/non-object mask and then combine its prediction with the input image to produce the density map. Our rationale is that the mask prediction could be better modeled as a binary segmentation problem and the difficulty of estimating the density could be reduced if the mask is known. A key to the proposed scheme is the strategy of incorporating the mask prediction into the density map estimator. To this end, we study five possible solutions, and via analysis and experimental validation we identify the most effective one. Through extensive experiments on five public datasets, we demonstrate the superior performance of the proposed approach over the baselines and show that our network could achieve the state-of-the-art performance.

# Datasets

**《A Remote Sensing Image Dataset for Cloud Removal》**

arXiv：https://arxiv.org/abs/1901.00600

datasets：https://github.com/BUPTLdy/RICE_DATASET

> Cloud-based overlays are often present in optical remote sensing images, thus limiting the application of acquired data. Removing clouds is an indispensable pre-processing step in remote sensing image analysis. Deep learning has achieved great success in the field of remote sensing in recent years, including scene classification and change detection. However, deep learning is rarely applied in remote sensing image removal clouds. The reason is the lack of data sets for training neural networks. In order to solve this problem, this paper first proposed the Remote sensing Image Cloud rEmoving dataset (RICE). The proposed dataset consists of two parts: RICE1 contains 500 pairs of images, each pair has images with cloud and cloudless size of 512*512; RICE2 contains 450 sets of images, each set contains three 512*512 size images. , respectively, the reference picture without clouds, the picture of the cloud and the mask of its cloud. The dataset is freely available at https://github.com/BUPTLdy/RICE_DATASET.


# Other

**《Fast Perceptual Image Enhancement》**

arXiv：https://arxiv.org/abs/1812.11852

> The vast majority of photos taken today are by mobile phones. While their quality is rapidly growing, due to physical limitations and cost constraints, mobile phone cameras struggle to compare in quality with DSLR cameras. This motivates us to computationally enhance these images. We extend upon the results of Ignatov et al., where they are able to translate images from compact mobile cameras into images with comparable quality to high-resolution photos taken by DSLR cameras. However, the neural models employed require large amounts of computational resources and are not lightweight enough to run on mobile devices. We build upon the prior work and explore different network architectures targeting an increase in image quality and speed. With an efficient network architecture which does most of its processing in a lower spatial resolution, we achieve a significantly higher mean opinion score (MOS) than the baseline while speeding up the computation by 6.3 times on a consumer-grade CPU. This suggests a promising direction for neural-network-based photo enhancement using the phone hardware of the future.


**《Actor Conditioned Attention Maps for Video Action Detection**

arXiv：https://arxiv.org/abs/1812.11631

> Interactions with surrounding objects and people contain important information towards understanding human actions. In order to model such interactions explicitly, we propose to generate attention maps that rank each spatio-temporal region's importance to a detected actor. We refer to these as Actor-Conditioned Attention Maps (ACAM), and these maps serve as weights to the features extracted from the whole scene. These resulting actor-conditioned features help focus the learned model on regions that are important/relevant to the conditioned actor. Another novelty of our approach is in the use of pre-trained object detectors, instead of region proposals, that generalize better to videos from different sources. Detailed experimental results on the AVA 2.1 datasets demonstrate the importance of interactions, with a performance improvement of 5 mAP with respect to state of the art published results.

**《BNN+: Improved Binary Network Training》**

arXiv：https://arxiv.org/abs/1812.11800

> Deep neural networks (DNN) are widely used in many applications. However, their deployment on edge devices has been difficult because they are resource hungry. Binary neural networks (BNN) help to alleviate the prohibitive resource requirements of DNN, where both activations and weights are limited to 1-bit. We propose an improved binary training method (BNN+), by introducing a regularization function that encourages training weights around binary values. In addition to this, to enhance model performance we add trainable scaling factors to our regularization functions. Furthermore, we use an improved approximation of the derivative of the sign activation function in the backward computation. These additions are based on linear operations that are easily implementable into the binary training framework. We show experimental results on CIFAR-10 obtaining an accuracy of 86.7%, on AlexNet and 91.3% with VGG network. On ImageNet, our method also outperforms the traditional BNN method and XNOR-net, using AlexNet by a margin of 4% and 2% top-1 accuracy respectively.

**《Kymatio: Scattering Transforms in Python》**

arXiv：https://arxiv.org/abs/1812.11214

homepage：https://www.kymat.io/

> The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at this https URL

**《An introduction to domain adaptation and transfer learning》**

arXiv：https://arxiv.org/abs/1812.11806

> In machine learning, if the training data is an unbiased sample of an underlying distribution, then the learned classification function will make accurate predictions for new samples. However, if the training data is not an unbiased sample, then there will be differences between how the training data is distributed and how the test data is distributed. Standard classifiers cannot cope with changes in data distributions between training and test phases, and will not perform well. Domain adaptation and transfer learning are sub-fields within machine learning that are concerned with accounting for these types of changes. Here, I present an introduction to these fields, guided by the question: when and how can a classifier generalize from a source to a target domain? I will start with a brief introduction into risk minimization, and how transfer learning and domain adaptation expand upon this framework. Following that, I discuss three special cases of data set shift, namely prior, covariate and concept shift. For more complex domain shifts, there are a wide variety of approaches. These are categorized into: importance-weighting, subspace mapping, domain-invariant spaces, feature augmentation, minimax estimators and robust algorithms. A number of points will arise, which I will discuss in the last section. I conclude with the remark that many open questions will have to be addressed before transfer learners and domain-adaptive classifiers become practical.

**《Learning Efficient Detector with Semi-supervised Adaptive Distillation》**

arXiv：https://arxiv.org/abs/1901.00366

github：https://github.com/Tangshitao/Semi-supervised-Adaptive-Distillation

> Knowledge Distillation (KD) has been used in image classification for model compression. However, rare studies apply this technology on single-stage object detectors. Focal loss shows that the accumulated errors of easily-classified samples dominate the overall loss in the training process. This problem is also encountered when applying KD in the detection task. For KD, the teacher-defined hard samples are far more important than any others. We propose ADL to address this issue by adaptively mimicking the teacher's logits, with more attention paid on two types of hard samples: hard-to-learn samples predicted by teacher with low certainty and hard-to-mimic samples with a large gap between the teacher's and the student's prediction. ADL enlarges the distillation loss for hard-to-learn and hard-to-mimic samples and reduces distillation loss for the dominant easy samples, enabling distillation to work on the single-stage detector first time, even if the student and the teacher are identical. Besides, ADL is effective in both the supervised setting and the semi-supervised setting, even when the labeled data and unlabeled data are from different distributions. For distillation on unlabeled data, ADL achieves better performance than existing data distillation which simply utilizes hard targets, making the student detector surpass its teacher. On the COCO database, semi-supervised adaptive distillation (SAD) makes a student detector with a backbone of ResNet-50 surpasses its teacher with a backbone of ResNet-101, while the student has half of the teacher's computation complexity. The code is avaiable at https://github.com/Tangshitao/Semi-supervised-Adaptive-Distillation

**《On Minimum Discrepancy Estimation for Deep Domain Adaptation》**

Accepted in Joint IJCAI/ECAI/AAMAS/ICML 2018 Workshop

arXiv：https://arxiv.org/abs/1901.00282

> In the presence of large sets of labeled data, Deep Learning (DL) has accomplished extraordinary triumphs in the avenue of computer vision, particularly in object classification and recognition tasks. However, DL cannot always perform well when the training and testing images come from different distributions or in the presence of domain shift between training and testing images. They also suffer in the absence of labeled input data. Domain adaptation (DA) methods have been proposed to make up the poor performance due to domain shift. In this paper, we present a new unsupervised deep domain adaptation method based on the alignment of second order statistics (covariances) as well as maximum mean discrepancy of the source and target data with a two stream Convolutional Neural Network (CNN). We demonstrate the ability of the proposed approach to achieve state-of the-art performance for image classification on three benchmark domain adaptation datasets: Office-31 [27], Office-Home [37] and Office-Caltech [8].

**《EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning》**

arXiv：https://arxiv.org/abs/1901.00212

github：https://arxiv.org/abs/1901.00212

> Over the last few years, deep learning techniques have yielded significant improvements in image inpainting. However, many of these techniques fail to reconstruct reasonable structures as they are commonly over-smoothed and/or blurry. This paper develops a new approach for image inpainting that does a better job of reproducing filled regions exhibiting fine details. We propose a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network. The edge generator hallucinates edges of the missing region (both regular and irregular) of the image, and the image completion network fills in the missing regions using hallucinated edges as a priori. We evaluate our model end-to-end over the publicly available datasets CelebA, Places2, and Paris StreetView, and show that it outperforms current state-of-the-art techniques quantitatively and qualitatively.

**《Not All Words are Equal: Video-specific Information Loss for Video Captioning》**

arXiv：https://arxiv.org/abs/1901.00097

> An ideal description for a given video should fix its gaze on salient and representative content, which is capable of distinguishing this video from others. However, the distribution of different words is unbalanced in video captioning datasets, where distinctive words for describing video-specific salient objects are far less than common words such as 'a' 'the' and 'person'. The dataset bias often results in recognition error or detail deficiency of salient but unusual objects. To address this issue, we propose a novel learning strategy called Information Loss, which focuses on the relationship between the video-specific visual content and corresponding representative words. Moreover, a framework with hierarchical visual representations and an optimized hierarchical attention mechanism is established to capture the most salient spatial-temporal visual information, which fully exploits the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized architecture outperforms state-of-the-art video captioning methods on MSVD with CIDEr score 87.5, and achieves superior CIDEr score 47.7 on MSR-VTT. We also show that our Information Loss is generic which improves various models by significant margins.


**《Extreme Relative Pose Estimation for RGB-D Scans via Scene Completion》**

arXiv：https://arxiv.org/abs/1901.00063

> Estimating the relative rigid pose between two RGB-D scans of the same underlying environment is a fundamental problem in computer vision, robotics, and computer graphics. Most existing approaches allow only limited maximum relative pose changes since they require considerable overlap between the input scans. We introduce a novel deep neural network that extends the scope to extreme relative poses, with little or even no overlap between the input scans. The key idea is to infer more complete scene information about the underlying environment and match on the completed scans. In particular, instead of only performing scene completion from each individual scan, our approach alternates between relative pose estimation and scene completion. This allows us to perform scene completion by utilizing information from both input scans at late iterations, resulting in better results for both scene completion and relative pose estimation. Experimental results on benchmark datasets show that our approach leads to considerable improvements over state-of-the-art approaches for relative pose estimation. In particular, our approach provides encouraging relative pose estimates even between non-overlapping scans.

**《A Survey on Multi-output Learning》**

arXiv：https://arxiv.org/abs/1901.00248

> Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. This paper thus attempts to fill in this gap to provide a comprehensive review on this area. We first introduce different stages of the life cycle of the output labels. Then we present the paradigm on multi-output learning, including its myriads of output structures, definitions of its different sub-problems, model evaluation metrics and popular data repositories used in the study. Subsequently, we review a number of state-of-the-art multi-output learning methods, which are categorized based on the challenges.

**《Deep Frame Prediction for Video Coding》**

submitted to IEEE Trans

arXiv：https://arxiv.org/abs/1901.00062

> We propose a novel frame prediction method using a deep neural network (DNN), with the goal of improving video coding efficiency. The proposed DNN makes use of decoded frames, at both encoder and decoder, to predict textures of the current coding block. Unlike conventional inter-prediction, the proposed method does not require any motion information to be transferred between the encoder and the decoder. Still, both uni-directional and bi-directional prediction are possible using the proposed DNN, which is enabled by the use of the temporal index channel, in addition to color channels. In this study, we developed a jointly trained DNN for both uni- and bi-directional prediction, as well as separate networks for uni- and bi-directional prediction, and compared the efficacy of both approaches. The proposed DNNs were compared with the conventional motion-compensated prediction in the latest video coding standard, HEVC, in terms of BD-Bitrate. The experiments show that the proposed joint DNN (for both uni- and bi-directional prediction) reduces the luminance bitrate by about 3.9%, 2.2%, and 2.2% in the Low delay P, Low delay, and Random access configurations, respectively. In addition, using the separately trained DNNs brings further bit savings of about 0.4%-0.8%.

**《CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions》**

arXiv：https://arxiv.org/abs/1901.00850

github & datasets：https://cs.jhu.edu/~cxliu/2019/clevr-ref+

> Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. 
In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended.

**《A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels》**

WACV 2019

arXiv：https://arxiv.org/abs/1901.00711

> Image classification models built into visual support systems and other assistive devices need to provide accurate predictions about their environment. We focus on an application of assistive technology for people with visual impairments, for daily activities such as shopping or cooking. In this paper, we provide a new benchmark dataset for a challenging task in this application - classification of fruits, vegetables, and refrigerated products, e.g. milk packages and juice cartons, in grocery stores. To enable the learning process to utilize multiple sources of structured information, this dataset not only contains a large volume of natural images but also includes the corresponding information of the product from an online shopping website. Such information encompasses the hierarchical structure of the object classes, as well as an iconic image of each type of object. This dataset can be used to train and evaluate image classification models for helping visually impaired people in natural environments. Additionally, we provide benchmark results evaluated on pretrained convolutional neural networks often used for image understanding purposes, and also a multi-view variational autoencoder, which is capable of utilizing the rich product information in the dataset.

**《Active Learning with TensorBoard Projector》**

arXiv：https://arxiv.org/abs/1901.00675

> An ML-based system for interactive labeling of image datasets is contributed in TensorBoard Projector to speed up image annotation performed by humans. The tool visualizes feature spaces and makes it directly editable by online integration of applied labels, and it is a system for verifying and managing machine learning data pertaining to labels. We propose realistic annotation emulation to evaluate the system design of interactive active learning, based on our improved semi-supervised extension of t-SNE dimensionality reduction. Our active learning tool can significantly increase labeling efficiency compared to uncertainty sampling, and we show that less than 100 labeling actions are typically sufficient for good classification on a variety of specialized image datasets. Our contribution is unique given that it needs to perform dimensionality reduction, feature space visualization and editing, interactive label propagation, low-complexity active learning, human perceptual modeling, annotation emulation and unsupervised feature extraction for specialized datasets in a production-quality implementation.

**《Edge-Semantic Learning Strategy for Layout Estimation in Indoor Environment》**

arXiv：https://arxiv.org/abs/1901.00621

> Visual cognition of the indoor environment can benefit from the spatial layout estimation, which is to represent an indoor scene with a 2D box on a monocular image. In this paper, we propose to fully exploit the edge and semantic information of a room image for layout estimation. More specifically, we present an encoder-decoder network with shared encoder and two separate decoders, which are composed of multiple deconvolution (transposed convolution) layers, to jointly learn the edge maps and semantic labels of a room image. We combine these two network predictions in a scoring function to evaluate the quality of the layouts, which are generated by ray sampling and from a predefined layout pool. Guided by the scoring function, we apply a novel refinement strategy to further optimize the layout hypotheses. Experimental results show that the proposed network can yield accurate estimates of edge maps and semantic labels. By fully utilizing the two different types of labels, the proposed method achieves state-of-the-art layout estimation performance on benchmark datasets.

**《Photo-Sketching: Inferring Contour Drawings from Images》**

homepage(include code)：http://www.cs.cmu.edu/~mengtial/proj/sketch/

WACV 2019
	
arXiv：https://arxiv.org/abs/1901.00542

> Edges, boundaries and contours are important subjects of study in both computer graphics and computer vision. On one hand, they are the 2D elements that convey 3D shapes, on the other hand, they are indicative of occlusion events and thus separation of objects or semantic concepts. In this paper, we aim to generate contour drawings, boundary-like drawings that capture the outline of the visual scene. Prior art often cast this problem as boundary detection. However, the set of visual cues presented in the boundary detection output are different from the ones in contour drawings, and also the artistic style is ignored. We address these issues by collecting a new dataset of contour drawings and proposing a learning-based method that resolves diversity in the annotation and, unlike boundary detectors, can work with imperfect alignment of the annotation and the actual ground truth. Our method surpasses previous methods quantitatively and qualitatively. Surprisingly, when our model fine-tunes on BSDS500, we achieve the state-of-the-art performance in salient boundary detection, suggesting contour drawing might be a scalable alternative to boundary annotation, which at the same time is easier and more interesting for annotators to draw.

**《Visualizing Deep Similarity Networks》**

arXiv：https://arxiv.org/abs/1901.00536

> For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for classification networks, but applicable to the problem domains better suited to similarity learning. The visualization shows how similarity networks that are fine-tuned learn to focus on different features. We also generalize our approach to embedding networks that use different pooling strategies and provide a simple mechanism to support image similarity searches on objects or sub-regions in the query image.

**《Resource-Scalable CNN Synthesis for IoT Applications》**

arXiv：https://arxiv.org/abs/1901.00738

> State-of-the-art image recognition systems use sophisticated Convolutional Neural Networks (CNNs) that are designed and trained to identify numerous object classes. Such networks are fairly resource intensive to compute, prohibiting their deployment on resource-constrained embedded platforms. On one hand, the ability to classify an exhaustive list of categories is excessive for the demands of most IoT applications. On the other hand, designing a new custom-designed CNN for each new IoT application is impractical, due to the inherent difficulty in developing competitive models and time-to-market pressure. To address this problem, we investigate the question of: "Can one utilize an existing optimized CNN model to automatically build a competitive CNN for an IoT application whose objects of interest are a fraction of categories that the original CNN was designed to classify, such that the resource requirement is proportionally scaled down?" We use the term resource scalability to refer to this concept, and develop a methodology for automated synthesis of resource scalable CNNs from an existing optimized baseline CNN. The synthesized CNN has sufficient learning capacity for handling the given IoT application requirements, and yields competitive accuracy. The proposed approach is fast, and unlike the presently common practice of CNN design, does not require iterative rounds of training trial and error.

**《Linear colour segmentation revisited》**

注：颜色分割，有意思的研究方向

arXiv：https://arxiv.org/abs/1901.00534

github：https://github.com/visillect/segmentation

> In this work we discuss the known algorithms for linear colour segmentation based on a physical approach and propose a new modification of segmentation algorithm. This algorithm is based on a region adjacency graph framework without a pre-segmentation stage. Proposed edge weight functions are defined from linear image model with normal noise. The colour space projective transform is introduced as a novel pre-processing technique for better handling of shadow and highlight areas. The resulting algorithm is tested on a benchmark dataset consisting of the images of 19 natural scenes selected from the Barnard's DXC-930 SFU dataset and 12 natural scene images newly published for common use. The dataset is provided with pixel-by-pixel ground truth colour segmentation for every image. Using this dataset, we show that the proposed algorithm modifications lead to qualitative advantages over other model-based segmentation algorithms, and also show the positive effect of each proposed modification. The source code and datasets for this work are available for free access at https://github.com/visillect/segmentation.

**《Visualizing Deep Similarity Networks》**

arXiv：https://arxiv.org/abs/1901.00536

> For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for classification networks, but applicable to the problem domains better suited to similarity learning. The visualization shows how similarity networks that are fine-tuned learn to focus on different features. We also generalize our approach to embedding networks that use different pooling strategies and provide a simple mechanism to support image similarity searches on objects or sub-regions in the query image.


================================================
FILE: 2019/03/12.md
================================================
【计算机视觉论文速递】2019-03-12

- [x] 2019-03-12

本文分享共10篇论文（含5篇CVPR 2019），涉及目标检测、人脸检测和语义分割等方向。

[TOC]

# 目标检测

**《ScratchDet: Exploring to Train Single-Shot Object Detectors from Scratch》**

Date：20190309

Author：京东等(CVPR 2019)

> Abstract：Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification dataset ImageNet, which incurs some additional problems: 1) T he classification and detection have different degrees of sensitivity to translation, resulting in the learning objective bias; 2) The architecture is limited by the classification network, leading to the inconvenience of modification. To cope with these problems, training detectors from scratch is a feasible solution. However, the detectors trained from scratch generally perform worse than the pretrained ones, even suffer from the convergence issue in training. In this paper, we explore to train object detectors from scratch robustly. By analysing the previous work on optimization landscape, we find that one of the overlooked points in current trained-from-scratch detector is the BatchNorm. Resorting to the stable and predictable gradient brought by BatchNorm, detectors can be trained from scratch stably while keeping the favourable performance independent to the network architecture. Taking this advantage, we are able to explore various types of networks for object detection, without suffering from the poor convergence. By extensive experiments and analysis on downsampling factor, we propose the Root-ResNet backbone network, which makes full use of the information from original images. Our ScratchDet achieves the state-of-the-art accuracy on PASCAL VOC 2007, 2012 and MS COCO among all the train-from-scratch detectors and even performs better than several one-stage pretrained methods. 

arXiv：https://arxiv.org/abs/1810.08425v3

github：https://github.com/KimSoybean/ScratchDet

# 人脸检测

**《MSFD:Multi-Scale Receptive Field Face Detector》(ICPR 2018)**

Date：20190311

Author：北京邮电大学

> Abstract：We aim to study the multi-scale receptive fields of a single convolutional neural network to detect faces of varied scales. This paper presents our Multi-Scale Receptive Field Face Detector (MSFD), which has superior performance on detecting faces at different scales and enjoys real-time inference speed. MSFD agglomerates context and texture by hierarchical structure. More additional information and rich receptive field bring significant improvement but generate marginal time consumption. We simultaneously propose an anchor assignment strategy which can cover faces with a wide range of scales to improve the recall rate of small faces and rotated faces. To reduce the false positive rate, we train our detector with focal loss which keeps the easy samples from overwhelming. As a result, MSFD reaches superior results on the FDDB, Pascal-Faces and WIDER FACE datasets, and can run at 31 FPS on GPU for VGA-resolution images.

ariXiv：https://arxiv.org/abs/1903.04147

# 语义分割

**《Structured Knowledge Distillation for Semantic Segmentation》(CVPR 2019)**

Date：20190311

Author：阿德莱德大学 & 微软亚洲研究院 & 北航

> Abstract：In this paper, we investigate the knowledge distillation strategy for training small semantic segmentation networks by making use of large networks. We start from the straightforward scheme, pixel-wise distillation, which applies the distillation scheme adopted for image classification and performs knowledge distillation for each pixel separately. We further propose to distill the structured knowledge from large networks to small networks, which is motivated by that semantic segmentation is a structured prediction problem. We study two structured distillation schemes: (i) pair-wise distillation that distills the pairwise similarities, and (ii) holistic distillation that uses GAN to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by extensive experiments on three scene parsing datasets: Cityscapes, Camvid and ADE20K.

ariXiv：https://arxiv.org/abs/1903.04197

# 深度估计

**《Group-wise Correlation Stereo Network》(CVPR 2019)**

Date：20190310

Author：香港中文大学 & 商汤科技

> Abstract：Stereo matching estimates the disparity between a rectified image pair, which is of great importance to depth sensing, autonomous driving, and other related tasks. Previous works built cost volumes with cross-correlation or concatenation of left and right features across all disparity levels, and then a 2D or 3D convolutional neural network is utilized to regress the disparity maps. In this paper, we propose to construct the cost volume by group-wise correlation. The left features and the right features are divided into groups along the channel dimension, and correlation maps are computed among each group to obtain multiple matching cost proposals, which are then packed into a cost volume. Group-wise correlation provides efficient representations for measuring feature similarities and will not lose too much information like full correlation. It also preserves better performance when reducing parameters compared with previous methods. The 3D stacked hourglass network proposed in previous works is improved to boost the performance and decrease the inference computational cost. Experiment results show that our method outperforms previous methods on Scene Flow, KITTI 2012, and KITTI 2015 datasets.

ariXiv：https://arxiv.org/abs/1903.04025
github：https://github.com/xy-guo/GwcNet

**《Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation》(CVPR 2019)**

Date：20190311

Author：意大利特伦托大学 & 华为技术爱尔兰公司

> Abstract：Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a \emph{student} network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating the opposite disparity. A third network exploits the inconsistency between the original and the reconstructed input frame in order to output a refined depth map. Finally, knowledge distillation is exploited, such as to transfer information from the refinement network to the student. Our extensive experimental evaluation demonstrate the effectiveness of the proposed framework which outperforms state of the art unsupervised methods on the KITTI benchmark.

ariXiv：https://arxiv.org/abs/1903.04202

# 6D目标姿态估计

**《Instance- and Category-level 6D Object Pose Estimation》**

Date：20190311

Author：帝国理工学院

> Abstract：6D object pose estimation is an important task that determines the 3D position and 3D rotation of an object in camera-centred coordinates. By utilizing such a task, one can propose promising solutions for various problems related to scene understanding, augmented reality, control and navigation of robotics. Recent developments on visual depth sensors and low-cost availability of depth data significantly facilitate object pose estimation. Using depth information from RGB-D sensors, substantial progress has been made in the last decade by the methods addressing the challenges such as viewpoint variability, occlusion and clutter, and similar looking distractors. Particularly, with the recent advent of convolutional neural networks, RGB-only based solutions have been presented. However, improved results have only been reported for recovering the pose of known instances, i.e., for the instance-level object pose estimation tasks. More recently, state-of-the-art approaches target to solve object pose estimation problem at the level of categories, recovering the 6D pose of unknown instances. To this end, they address the challenges of the category-level tasks such as distribution shift among source and target domains, high intra-class variations, and shape discrepancies between objects.

ariXiv：https://arxiv.org/abs/1903.04229

# GAN

**《Video Generation from Single Semantic Label Map》(CVPR 2019)**

Date：20190311

Author：商汤科技

> Abstract：This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difficult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the first frame. Then we animate the scene based on its semantic meaning to obtain the temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical flow as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods.

ariXiv：https://arxiv.org/abs/1903.04480
github：https://github.com/junting/seg2vid

# 场景文本

**《MTRNet: A Generic Scene Text Eraser》**

Date：20190311

Author：昆士兰科技大学

> Abstract：Text removal algorithms have been proposed for uni-lingual scripts with regular shapes and layouts. However, to the best of our knowledge, a generic text removal method which is able to remove all or user-specified text regions regardless of font, script, language or shape is not available. Developing such a generic text eraser for real scenes is a challenging task, since it inherits all the challenges of multi-lingual and curved text detection and inpainting. To fill this gap, we propose a mask-based text removal network (MTRNet). MTRNet is a conditional adversarial generative network (cGAN) with an auxiliary mask. The introduced auxiliary mask not only makes the cGAN a generic text eraser, but also enables stable training and early convergence on a challenging large-scale synthetic dataset, initially proposed for text detection in real scenes. What's more, MTRNet achieves state-of-the-art results on several real-world datasets including ICDAR 2013, ICDAR 2017 MLT, and CTW1500, without being explicitly trained on this data, outperforming previous state-of-the-art methods trained directly on these datasets.

ariXiv：https://arxiv.org/abs/1903.04092

# 服装关键点检测

**《Spatial-Aware Non-Local Attention for Fashion Landmark Detection》**

Date：20190311

Author：北京大学 & 西安交通大学 & 京东 

> Abstract：Fashion landmark detection is a challenging task even using the current deep learning techniques, due to the large variation and non-rigid deformation of clothes. In order to tackle these problems, we propose Spatial-Aware Non-Local (SANL) block, an attentive module in deep neural network which can utilize spatial information while capturing global dependency. Actually, the SANL block is constructed from the non-local block in the residual manner which can learn the spatial related representation by taking a spatial attention map from Grad-CAM. We then establish our fashion landmark detection framework on feature pyramid network, equipped with four SANL blocks in the backbone. It is demonstrated by the experimental results on two large-scale fashion datasets that our proposed fashion landmark detection approach with the SANL blocks outperforms the current state-of-the-art methods considerably. Some supplementary experiments on fine-grained image classification also show the effectiveness of the proposed SANL block.

ariXiv：https://arxiv.org/abs/1903.04104

# 耳朵识别

**《The Unconstrained Ear Recognition Challenge 2019》**

Date：20190311

Author：University of Ljubljana等

> Abstract：This paper presents a summary of the 2019 Unconstrained Ear Recognition Challenge (UERC), the second in a series of group benchmarking efforts centered around the problem of person recognition from ear images captured in uncontrolled settings. The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze performance of the technology from various viewpoints, such as generalization abilities to unseen data characteristics, sensitivity to rotations, occlusions and image resolution and performance bias on sub-groups of subjects, selected based on demographic criteria, i.e. gender and ethnicity. Research groups from 12 institutions entered the competition and submitted a total of 13 recognition approaches ranging from descriptor-based methods to deep-learning models. The majority of submissions focused on deep learning approaches and hybrid techniques combining hand-crafted and learned image descriptors. Our analysis shows that hybrid and deep-learning-based approaches significantly outperform traditional hand-crafted approaches. We argue that this is a good indicator of where ear recognition will be heading in the future. Furthermore, the results in general improve upon the UERC 2017 and display the steady advancement of the ear recognition.

ariXiv：https://arxiv.org/abs/1903.04143

================================================
FILE: 2019-Paper.md
================================================
[2019-03-12](2019/03/12.md)：本文分享共10篇论文（含5篇CVPR 2019），涉及目标检测和语义分割等。

[2019-01-01~01-04](2019/01/01-04.md): 52篇论文速递，涉及人脸识别和图像分类等。

2019-11-23: [8篇目标检测最新论文(EfficientDet/EdgeNet/ASFF等）](https://mp.weixin.qq.com/s/qRr0199V1X-E5kTlXTcvig)

2019-12-09: [16篇目标检测最新论文（ATSS/MnasFPN/SAPD/CSPNet等）](https://mp.weixin.qq.com/s/q_0NntaL04zh5GYPC55oqw)

2019-12-18: [4篇实时语义分割最新论文（MSFNet/LiteSeg/FDDWNet/RGPNet）](https://mp.weixin.qq.com/s/-WD5adiSWOxRIT3nZf6R_Q)


================================================
FILE: 2020-Paper.md
================================================
- 2020-12-10：[这三篇目标检测论文刚刚开源了！AutoAssign/可变形DETR/DeFCN](https://mp.weixin.qq.com/s/nAp6O1KDew7FSYSlcfIJOg)
- 2020-09-24：[ECCV 2020 目标检测论文大盘点（49篇论文）](https://mp.weixin.qq.com/s/SRj6H4pK1BdzHjiDAC_2NA)
- 2020-09-22：[ECCV 2020 论文开源项目合集](https://github.com/amusi/ECCV2020-Code)
- 2020-09-22：[CVPR 2020 论文开源项目合集](https://github.com/amusi/CVPR2020-Code)
- 2020-09-22：[300+篇CVPR 2020代码开源的论文，全在这里了！](https://mp.weixin.qq.com/s/6ns4tktWhAbW2Ru_CkWu4Q)
- 2020-06-11：[270篇CVPR 2020代码开源的论文，全在这里了！](https://mp.weixin.qq.com/s/9tIrqcJsF_P-4JZag6cCGw)
- 2020-02-21：[一文看尽7篇目标跟踪最新论文（ABCTracker/MAST/L1DPF-M等）](https://mp.weixin.qq.com/s/I9Eq3RnIT0XQvWc5GILmVA)
- 2020-02-19：[一文看尽10篇目标检测最新论文（MetaOD/P-RSDet/MatrixNets等）](https://mp.weixin.qq.com/s/x0b73c_5CYCUw4zUaei75g)
- 2020-02-20：[一文看尽9篇语义分割最新论文（GPSNet/Graph-FCN/HMANet等）](https://mp.weixin.qq.com/s/E687mJnB-y8BSDjT5EhqvQ)

================================================
FILE: 2021-Paper.md
================================================
- 2021-02-22
  - [更深、更轻量级的Transformer！Facebook提出：DeLighT](https://mp.weixin.qq.com/s/WzpCfog3iSqQLlra5CB4pA)
  - [思谋科技春招/社招内推](https://mp.weixin.qq.com/s/oi7ZVDtHUZFpL3YvYKjpKw)
- 2021-02-21
  - [北京通用人工智能研究院面向全球诚聘英才（顶尖/领军/青年人才和博士后）](https://mp.weixin.qq.com/s/aQsc_9GJzFPOIeT24GURig)
  - [Transformer为何能闯入CV界秒杀CNN？](https://mp.weixin.qq.com/s/eSOabk8eVE5j6bOXUGkrgQ)
  - [重磅！谷歌开源AI模型“搜索引擎”，NLP、CV都能用](https://mp.weixin.qq.com/s/IciE5TuVJ-yeqIcNSe6bcA)
- 2021-02-20
  - [我建了一个AI算法岗求职群！](https://mp.weixin.qq.com/s/HMgNghCmvJfSsYW7-hpZkw)
  - [微软亚洲研究院机器学习组招聘研究实习和全职岗位](https://mp.weixin.qq.com/s/vSPv4EbBzsyWqojrmIG0dQ)
  - [OpenCV再升级！修改一行代码，将图像匹配效果提升14%！](https://mp.weixin.qq.com/s/Sz6ZWEi6tkP82f5mv0ZgMw)
- 2021-02-19
  - [伪标签还能这样用？半监督力作UPS（ICLR 2021）大揭秘！](https://mp.weixin.qq.com/s/yTvrJ7kEYVJEfGoh07peCA)
  - [有转正机会！阿里达摩院多模态理解组招收研究型实习生](https://mp.weixin.qq.com/s/WaatXxcsZ4mAn1-y1-coZA)
- 2021-02-18
  - [最新！全球学术排名出炉：21所中国大学位居世界100强](https://mp.weixin.qq.com/s/AVSdMgc7Y1lSV8t5neTwQg)
  - [遮挡场景下视频实例分割怎么做？牛津阿里最新开源OVIS数据集！](https://mp.weixin.qq.com/s/1X03y8qWsfaFj7BSkuH7Ug)
  - [百度视觉技术部招聘人脸检测方向实习生](https://mp.weixin.qq.com/s/JeST0MXntfa-yBKQ4G9pTw)
- 2021-02-17
  - [ETHZ 计算机视觉实验室招聘医学图像分析方向博士后研究员](https://mp.weixin.qq.com/s/BIKxE1FTFb2Pjmqf8lBHzw)
  - [非科班小白上岸的学习路线](https://mp.weixin.qq.com/s/KYp26YsLlwHLBboO8BoXzQ)
- 2021-02-16
  - [2021 Fall 申请回忆录 (北美 CS PhD)](https://mp.weixin.qq.com/s/ElQ4Ncmzc4JFfeyPq5VNOA)
  - [华为cloud&AI fellow预研团队招聘计算机视觉研究实习生](https://mp.weixin.qq.com/s/QA4d-DtItRto374Ur0mrgQ)
  - [FastFormers：实现Transformers在CPU上223倍的推理加速](https://mp.weixin.qq.com/s/GK0K8EdUkaFTPdFgyDBOvQ)
- 2021-02-15
  - [基于深度学习的图像匹配技术一览](https://mp.weixin.qq.com/s/q3jKLKpLBSWtdC7b0nG5yw)
  - [高通(上海)招聘机器学习研究工程师](https://mp.weixin.qq.com/s/4jlTWmKHoY7yeKfgwDDl8Q)
- 2021-02-14
  - [情人节「告白生成器」来了！这个AI能让偶像对你说情话，过于真实！](https://mp.weixin.qq.com/s/97GKzeC45ZrN29sOv6XEKw)
  - [视觉Transformer上榜！DeepMind科学家：2020年AI领域十大研究进展](https://mp.weixin.qq.com/s/r02ZBtEySw6TeSx1Z_UM-w)
  - [更精准地生成字幕！哥大&Facebook提出Vx2Text：多模态融合，性能更强！](https://mp.weixin.qq.com/s/DwQTTV8iuDQWYybDIJXwgA)
  - [vivo2021春季校园招聘正式启动！](https://mp.weixin.qq.com/s/VzjsYxjJk6Z0MDvXSATi5A)
- 2021-02-13
  - [本科、硕士、博士的区别（终极版）](https://mp.weixin.qq.com/s/5zoivx6aEroIl56fpXCrQg)
  - [最强ResNet变体！归一化再见！DeepMind提出NFNet，代码已开源！](https://mp.weixin.qq.com/s/RvcSxw91TJHM52SfbPO-YQ)
  - [OPPO 2021届春招启动！附内推码](https://mp.weixin.qq.com/s/hRqA6ByCS7ffUbCGnLl2CQ)
  - [如何用AI实现视频防抖？台湾大学和谷歌提出NeRViS：无需裁剪的全帧视频稳定算法](https://mp.weixin.qq.com/s/OgwlA_6i7MRgGng_Zl1EaA)
- 2021-02-12
  - [中科院和微软提出RelationNet++：基于Transformer的目标检测网络](https://mp.weixin.qq.com/s/C2YaMlOOpQXFNSdfpEcapg)
  - [微软2021年暑期实习生招聘开始！](https://mp.weixin.qq.com/s/lPgmuvNLFH44fsUClR4p0A)
- 2021-02-11
  - [代码已开源！效果远超Transformer！AAAI 2021最佳论文Informer：最强最快的序列预测神器](https://mp.weixin.qq.com/s/PqfRD8YsKHVVDNRUt0zrmw)
  - [重磅！京东21篇论文入选AI顶会AAAI 2021](https://mp.weixin.qq.com/s/iyyhuocIx6SbfQHvmXzTjg)
- 2021-02-10
  - [Transformer再下一城！ReID各项任务全面领先，阿里&浙大提出TransReID](https://mp.weixin.qq.com/s/keSckWmEpIEo8iarL5cdoQ)
  - [Facebook等提出实时3D人脸姿态估计新方法，代码已开源！](https://mp.weixin.qq.com/s/tRMY-fZg9pQlZZRPObC4Xw)
  - [百度研究院IDL招聘CV算法研究员/资深研究员](https://mp.weixin.qq.com/s/yCYQzZlVDc5MCgjSkBaMog)
- 2021-02-09
  - [字节跳动CV方向算法实习生招聘](https://mp.weixin.qq.com/s/EQullDBEdAPcskk7GLzIkg)
  - [小哥质疑谷歌顶会CV论文有错！并且拿出了复现代码来证明](https://mp.weixin.qq.com/s/jPcSjRv8P8Xt8HM47tPbMQ)
- 2021-02-08
  - [2021年AI算法岗求职群来了！](https://mp.weixin.qq.com/s/AeBzpHJ1lvYSCBM4iG4dOA)
  - [泛化神器！李沐等人提出两种正则化技术：在CV和NLP均有大幅度提升](https://mp.weixin.qq.com/s/erJXAd6tlDzC1Hl66_cfJA)
  - [2020-2021 CV算法实习面经(京东/商汤/思谋/依图/图森/字节/腾讯)](https://mp.weixin.qq.com/s/zWgySSJb0u4g30qHP7FSlA)
  - [近距离看CNN训练！360度可视化，网友：美得不真实](https://mp.weixin.qq.com/s/-96w8C-6Kdpb71BD3CptKA)
- 2021-02-07
  - [火爆GitHub！3.6k Star的可视化神器开源！](https://mp.weixin.qq.com/s/fydJ7ceX_vNykAQpONkJGQ)
  - [DeepMind重新设计高性能ResNet！无需激活归一化层](https://mp.weixin.qq.com/s/nqlPI2wKiKJ4WHLdjqBmew)
  - [中国成都举办！ACM MM 2021 Call for Papers](https://mp.weixin.qq.com/s/D6yHytR0y2DhJ-LkUOucOw)
  - [阿里巴巴招聘研究型实习生（视频动作检测识别方向）](https://mp.weixin.qq.com/s/A_OCYdBXVFikN7ALvUj7Jw)
- 2021-02-06
  - [效果远超Transformer！AAAI 2021最佳论文Informer：最强最快的序列预测神器](https://mp.weixin.qq.com/s/89URM73C8_I6bJrxbHgjpw)
  - [南京大学提出置换注意力机制！SA-Net：融合空域与通道注意力](https://mp.weixin.qq.com/s/vXcWm0YIVk9ufQokfLd6cA)
  - [元气森林招聘CV算法工程师（实习+社招）](https://mp.weixin.qq.com/s/g0CvlJdf_16xAUf3Wj_N0Q)
- 2021-02-05
  - [AAAI 2021全部大奖出炉！华人霸屏！北航、华科校友获最佳论文，华南理工获杰出论文](https://mp.weixin.qq.com/s/GbRVtP8ssAKblxpLd48gQQ)
  - [港中文提出SMCA：加快DETR收敛](https://mp.weixin.qq.com/s/xFCDsJfZQsF2ZFYgU0YOHg)
  - [百度深度学习实验室（IDL）招聘算法实习生](https://mp.weixin.qq.com/s/srmBnOHj6FF2IXj1qaD1Sg)
- 2021-02-04
  - [目标检测比赛思路、tricks集锦、资料汇总](https://mp.weixin.qq.com/s/wO5jo20AFd66bko3aGFUpg)
  - [新荣耀公司招聘AI算法专家和工程师（校招/社招）](https://mp.weixin.qq.com/s/EIzeULuoPZ7vugV-FDR9xQ)
- 2021-02-03
  - [78万奖金！天池最新CV赛事来了](https://mp.weixin.qq.com/s/gQv9erTcXE08n742YDgH5g)
  - [当频域（DCT）遇见CNN](https://mp.weixin.qq.com/s/7-S_OysxXiDS_kU4kqOfjQ)
  - [字节跳动AI Lab机器人研究组招聘（全职&实习）](https://mp.weixin.qq.com/s/p5EbatmgKLJaDV2JDARJcA)
  - [利用Transformer替代MSA从蛋白序列中学习Contact Map](https://mp.weixin.qq.com/s/IrK9p7BHuANNo5xyJ1T-bQ)
- 2021-02-02
  - [谈CVPR 2021审稿](https://mp.weixin.qq.com/s/wO4sWonhW93ra--fP0rgbg)
  - [重新思考语义分割范式：SETR](https://mp.weixin.qq.com/s/csq0E2E6Xf9uLHSHLzybdA)
  - [炸裂！英伟达A100深度学习性能实测：训练速度可达V100的3.5倍](https://mp.weixin.qq.com/s/3pRxVGco-0ZTWUHVVZiKWg)
- 2021-02-01
  - [视觉Transformer之简单总结](https://mp.weixin.qq.com/s/E1wSmEB7bKRiS-DqIo6Oqw)
  - [南京大学提出CPD：通过视频-文本对匹配的视频预训练模型](https://mp.weixin.qq.com/s/Hr9r39K_gFT9YvWv1VNhxg)
  - [深圳市大数据研究院政务实验室招聘研究科学家和数据工程师](https://mp.weixin.qq.com/s/zz9tKe9paNpjVrrsPCRSLQ)
- 2021-01-31
  - [重磅！微信二维码引擎OpenCV开源！3行代码让你拥有微信扫码能力](https://mp.weixin.qq.com/s/7372HCgVSNgoqRgmYBN-zw)
  - [麦克马斯特大学计算机系褚令洋招收硕士生、博士生啦！](https://mp.weixin.qq.com/s/6fU6GYsThh1-E3panB7dJg)
- 2021-01-30
  - [ResNet被全面超越了！是那个Transformer干的，依图科技开源“可大可小”T2T-ViT，轻量版优于MobileNet](https://mp.weixin.qq.com/s/2FkikfslpawtT1YTBUxy_Q)
  - [香港城市大学Yixuan Yuan教授招聘CV博士后](https://mp.weixin.qq.com/s/7ZOedcBp3MzdkLoO2d7xUw)
  - [深度学习中的3个秘密：集成，知识蒸馏和自蒸馏](https://mp.weixin.qq.com/s/YyLTd8B7M4f3hBTybrnUSQ)
- 2021-01-29
  - [没有NMS！阿里巴巴和阿大提出PSS：更简单有效的端到端目标检测](https://mp.weixin.qq.com/s/rir6WllUNIbc3ynJ84i08w)
  - [对话腾讯17级员工张正友博士：有关梦想、成长和焦虑](https://mp.weixin.qq.com/s/DHvSrHyvYlAgGAe_w1q7-Q)
  - [不要再用arxiv链接了！为了让论文引用更规范，上交毕业生、南加州大学华人博士创建了一个小工具](https://mp.weixin.qq.com/s/6FN4TR08sIgys3b6ecksbg)
- 2021-01-28
  - [CNN+Transformer！谷歌提出BoTNet：新主干网络！在ImageNet上达84.7%准确率！](https://mp.weixin.qq.com/s/oQf5KioEOTG_UvzR3sVCzA)
  - [2840页的计算机毕业论文！德州奥斯汀华人博士究竟写了啥？](https://mp.weixin.qq.com/s/wFmUpjKpb9DBbBhy-RVYOw)
  - [用计算机视觉来做异常检测](https://mp.weixin.qq.com/s/KJ6eLE693uXw9I9U_opN-w)
- 2021-01-07
  - [2021年微软研究博士奖研金名单出炉！三位华人博士生入选，每人42000美元](https://mp.weixin.qq.com/s/r3nLeWr6mg2YrbpzYqIppg)
  - [堪比当年的LSTM！Transformer引爆AI圈：它是万能的](https://mp.weixin.qq.com/s/0fuECJKMVY65R2ouTzq40g)
- 2021-01-26
  - [霸榜Github：又一款OCR神器面世！](https://mp.weixin.qq.com/s/4qOx63DbZn0HSKo248M2zQ)
  - [图灵奖得主Yann LeCun的六十年](https://mp.weixin.qq.com/s/JZohY_7VjZeBTOAkW979pA)
  - [腾讯优图招聘CV算法实习生](https://mp.weixin.qq.com/s/ys4ELCVMB_LKUhfWhCDUnw)
- 2021-01-25
  - [龙泉寺贤超法师：用AI为古籍经书识别、断句、翻译](https://mp.weixin.qq.com/s/lirORSFfvdWjxyMuYJEfyA)
  - [我用YOLO-V5实现行人社交距离风险提示，代码开源！](https://mp.weixin.qq.com/s/ItlJ8UFHfpZZvqqh3FC-xw)
  - [阿里巴巴达摩院招聘(全职/实习生)](https://mp.weixin.qq.com/s/FQjj5ZhLzNewA7uFJ8ShOQ)
- 2021-01-24
  - [旷视提出MomentumBN：缓解自监督学习的大batch要求，涨点明显！](https://mp.weixin.qq.com/s/nj03Y7Zjs2tJsGmDBd8XgA)
  - [CenterFusion：雷达和摄像头融合的3D目标检测方法，代码已开源！](https://mp.weixin.qq.com/s/fQUxgwK-E72IQvpZuTjmkQ)
- 2021-01-23
  - [接连两次霸榜GitHub！这个AI开发神器是真的强，带GUI和海量算法！](https://mp.weixin.qq.com/s/0Gk11vqhY_UL9rtryAcsQw)
  - [AI领域最有影响力100人！Hinton只排31、Bengio竟然没有上榜？](https://mp.weixin.qq.com/s/2xTRpOS7kMzmhbQ-eXF3vQ)
  - [字节跳动招聘CV算法实习生](https://mp.weixin.qq.com/s/krbNXaARh81KwVBAlx-__A)
- 2021-01-22
- [一位普通背景的2021海内外博士申请总结](https://mp.weixin.qq.com/s/ga1qUE7DdunCQw40tL3Svw)
  - [NeurIPS 2021 | 基于细粒度动态网络的目标检测器](https://mp.weixin.qq.com/s/fSGT9VEWFMTuLF88FsAj5Q)
- 2021-01-21

  - [CVPR 2021评审出炉！得分惨不忍睹，面对奇葩评审该如何反击？](https://mp.weixin.qq.com/s/qBdZ48GwsIRH-FPkvr8BCw)
  - [涨点神器！重新标记ImageNet，让CNN涨点明显！代码已开源](https://mp.weixin.qq.com/s/Gh6ofA3XOFT5Of7eUnD6cw)

  - [商汤研究院招聘啦！实习/校招/社招全都有](https://mp.weixin.qq.com/s/XhDxrTvv2SDtCBgGFgdpCA)
- 2021-01-20

  - [伯克利大神一人投中16篇！ICLR 2021论文接收统计出炉](https://mp.weixin.qq.com/s/1M0J2zkqwhEb1F1ncZ3JOg)
  - [全面升级！FastReID V1.0正式开源：Beyond reID](https://mp.weixin.qq.com/s/NSoSRpxgVVnn0NzeDGnwig)
  - [ICLR 2021 | 自解释神经网络—Shapley Explanation Networks](https://mp.weixin.qq.com/s/yuDNGBihMOrXDyetxbzYwQ)
- 2021-01-19
  - [顶刊TPAMI 2021 | 换个损失函数就能实现数据扩增？](https://mp.weixin.qq.com/s/x90x9zJ4V7TzNjD2UrrQgQ)
  - [阿里巴巴ICBU技术部招人啦](https://mp.weixin.qq.com/s/ItYUBeOsBsm7KRPr_5F46g)
- 2021-01-18
  - [一文梳理缺陷检测方法](https://mp.weixin.qq.com/s/lpNSgrQOFtSeeSK7IIJBMQ)
  - [AAAI 2021 3D目标检测论文大盘点（CIA-SSD/Voxel R-CNN等）](https://mp.weixin.qq.com/s/sXJI5MBsHL4IiAOXhhYYcQ)
  - [康奈尔大学对博士生的要求](https://mp.weixin.qq.com/s/sBSK_jlNfx24qz-rgWAIbg)
- 2021-01-17
  - [加快PyTorch训练速度！掌握这17种方法，让你省时省力！](https://mp.weixin.qq.com/s/MiNGHhcY7qFScRBwUKirkQ)
  - [顶刊TPAMI 2021 | 一文打尽无监督多类域适应：理论，算法与实践](https://mp.weixin.qq.com/s/FlBC0AZ7rP9mTIX0Vkv5NQ)
  - [目标检测的模型集成方法及实验](https://mp.weixin.qq.com/s/R_qWBAOAC5vmOOB0KhMDRg)
- 2021-01-16
  - [学术论文投稿与返修（Rebuttal）经验分享](https://mp.weixin.qq.com/s/qbW33ff-gV815rSHsHgOoA)
  - [AAAI 2021 目标检测论文大盘点（YOLObile/R3Det/StarNet等）](https://mp.weixin.qq.com/s/XIZDK-hMZrvdHlZG5Bf9AQ)
  - [9行代码提高少样本学习泛化能力！ICLR2021 Oral | 利用一个样本估计类别数据分布](https://mp.weixin.qq.com/s/C1V57cVBHPnTAh2dFSzXUA)
- 2021-01-15
  - [阿里CV面试官对招人的几点看法](https://mp.weixin.qq.com/s/0q5vc8cIi17F5tNipteLRQ)
  - [ICLR 2021 | SEED：自监督蒸馏学习，显著提升小模型性能！](https://mp.weixin.qq.com/s/3u3JvPwAkgcduVH_krikAg)
- 2021-01-14
  - [在阿里Top组10年，我学到了什么？](https://mp.weixin.qq.com/s/hu2obo45uIduPpJ_mKJzmg)
  - [80GB医学影像数据集发布！OCTA-500公开下载](https://mp.weixin.qq.com/s/UO0l3dbBkB8mN8X9ag7ejw)
  - [2021首届海洋目标智能感知国际挑战赛 冠军方案分享](https://mp.weixin.qq.com/s/uUIJBxM0PATHSRxDbWbyTg)
- 2021-01-13
  - [清华&旷视提出RepVGG：让你的CNN一卷到底！](https://mp.weixin.qq.com/s/taVab243g5GY8zjjPYUo2A)
  - [商汤科技招聘目标检测和目标跟踪算法实习生](https://mp.weixin.qq.com/s/hvN_eSxJeS-wrySIhr1bIg)
  - [精准生成Fake人脸！Amazon全新GAN模型给你全方位无死角美颜](https://mp.weixin.qq.com/s/l-eShMV9Avcz1TPH61t5hw)
- 2021-01-12
  - [Facebook提出全景分割涨点技巧大礼包！全新训练及数据采样&增强策略、跨尺度泛化强](https://mp.weixin.qq.com/s/wJSKs3lsDZhIqYVNLd59Gg)
  - [TPAMI 2021 | 深度学习行人重识别综述与展望](https://mp.weixin.qq.com/s/-iBKS-q7QiTbOsxLAwrktQ)
- 2021-01-11
  - [涨点神器！SoftPool：一种新的池化方法，带你起飞！](https://mp.weixin.qq.com/s/g_1oUfCGoYFB1ydnZ6mDqw)
  - [ICLR最高分论文揭秘模型泛化，GNN是潜力股！](https://mp.weixin.qq.com/s/zVpPBPf8HGuz3IbpQ5okVA)
- 2021-01-10
  - [关于人工智能算法岗位的一点思考](https://mp.weixin.qq.com/s/vG-l9zdmdDJ8nVJo6HIKDw)
  - [华为cloud&AI招聘计算机视觉研究实习生](https://mp.weixin.qq.com/s/YDq7Jmd3pvZahT-ludqu4w)
- 2021-01-09
  - [腾讯首位17级杰出科学家正式诞生！腾讯AI Lab负责人张正友博士获此殊荣](https://mp.weixin.qq.com/s/piV6sYkvhrJnDEO9EtLPnw)
  - [缺陷检测比赛Top3方案分享](https://mp.weixin.qq.com/s/qqv12LyAvM5m8nAoe2ELrw)
  - [商汤科技SCG招聘算法实习生](https://mp.weixin.qq.com/s/7h6YRrPjHZWPNqhiVfcjvQ)
  - [最佳论文！商汤提出手机端实时单目三维重建系统 | ISMAR 2021](https://mp.weixin.qq.com/s/IJ7c9B6Nq3qqD2Opsk77Fw)

================================================
FILE: 2023-Paper.md
================================================
- 2023-07-07
  - [面试被问到了：手撕Transformer。。。](https://mp.weixin.qq.com/s/4Yg3Vj2p0PFpfqJyjf3BMA)
  - [首篇综述！Open Vocabulary学习综述：全面调研](https://mp.weixin.qq.com/s/StAv8FqQQj5HEnCZlu7V2A)
  - [炸裂！微软新作LongNet：将Transformer扩展到10亿个Tokens](https://mp.weixin.qq.com/s/HryMQN75QZH90ClcBrm56g)
- 2023-07-06
  - [清华朱军团队新作！使用4位整数训练Transformer，提速35.1%！](https://mp.weixin.qq.com/s/78gsgYlzAxk-1_xp99ccdA)
  - [Nature子刊！上海交大&上海AI Lab提出胸部X-ray疾病诊断基础模型](https://mp.weixin.qq.com/s/IOKc4-tyf_uVmQbQnSujWw)
  - [华为大模型登Nature正刊！审稿人：让人们重新审视预报模型的未来](https://mp.weixin.qq.com/s/jkCWwZFsTr0XiXyFxCpanA)
- 2023-07-05
  - [中科大和腾讯发布首篇《多模态大语言模型综述》](https://mp.weixin.qq.com/s/VVCfIbWNqkyGb9QXDuAaeg)
  - [港中大和商汤提出HPS v2：为文本生成图像模型提供更可靠的评价指标](https://mp.weixin.qq.com/s/ndRaGn5dszcADjFbPzolRQ)
- 2023-07-03
  - [RSPrompter：遥感图像实例分割利器！基于SAM实现自动分割！](https://mp.weixin.qq.com/s/8dtTS0ByGxYTLAOnie9Qdg)
- 2023-07-02
  - [面试被问到了：用C++手撕NMS。。。](https://mp.weixin.qq.com/s/m0iBEXY2jGLCtb7sOL5xgw)
  - [仅靠“口才”就能解决视觉任务！商汤提出Shikra：新一代多模态大模型](https://mp.weixin.qq.com/s/QqNy2Mqc_OL4uZa6SAZWkg)
  - [CVPR 2023 | 英伟达提出BundleSDF：对未知物体进行6D追踪和3D重建](https://mp.weixin.qq.com/s/0WnqmiK7CTBC7wH0oJ7xPA)
- 2023-07-01
  - [CVPR 2023 | 清华大学提出GAM：可泛化的一阶平滑优化器](https://mp.weixin.qq.com/s/JkQM3RzqvNiLPgJJXuv8zw)
  - [AIGC时代的ImageNet！百万生成图片助力AI生成图片检测器研发](https://mp.weixin.qq.com/s/64Fpten46cZcpqJ-mdkWMA)

================================================
FILE: README.md
================================================
# daily-paper-computer-vision
**记录每天整理的计算机视觉/深度学习/机器学习相关方向的论文**

- [CV 优质论文速递](#PaperDaily)
- [CV 顶会/顶刊（2017-2023）](#TopPaper)

<a name="PaperDaily"></a>

## CV 优质论文速递

- [2023年（日更中）](2023-Paper.md)

为了方便内容沉淀和检索，现已在[【CVer计算机视觉】](https://github.com/amusi/CVPR2023-Papers-with-Code/blob/master/CVer%E5%AD%A6%E6%9C%AF%E4%BA%A4%E6%B5%81%E7%BE%A4.png) 中来完成**CV/AI优质论文、项目和应用速递**的每日更新，欢迎各位 CVer 加入！互相学习，一起进步~

[【CVer计算机视觉】](https://github.com/amusi/CVPR2023-Papers-with-Code/blob/master/CVer%E5%AD%A6%E6%9C%AF%E4%BA%A4%E6%B5%81%E7%BE%A4.png) 是最大的计算机视觉AI知识星球！每日更新！第一时间分享的方向涵盖：目标检测、语义分割、目标跟踪、Transformer、多模态、大模型、NeRF、扩散模型、深度估计、超分辨率、3D目标检测、CNN、GAN、竞赛解决方案、人脸识别、数据增广、人脸检测、数据集、NAS、AutoML、图像分割、SLAM、实例分割、人体姿态估计、视频目标分割、Re-ID、医学图像分割、显著性目标检测、自动驾驶、人群密度估计、PyTorch、人脸、车道线检测、去雾 、全景分割、行人检测、文本检测、OCR、6D姿态估计、 边缘检测、场景文本检测、视频实例分割、3D点云、模型压缩、人脸对齐、超分辨、去噪、强化学习、行为识别、OpenCV、场景文本识别、去雨、机器学习、风格迁移、视频目标检测、去模糊、显著性检测、剪枝、活体检测、人脸关键点检测、3D目标跟踪、视频修复、人脸表情识别、时序动作检测、图像检索、异常检测等

![CVer学术交流群](./CVer学术交流群.png)

<a name="TopPaper"></a>

## CV 顶会/顶刊

### 2023

**CVPR 2023**

- 论文列表：https://openaccess.thecvf.com/CVPR2023?day=all
- 论文和代码：https://github.com/amusi/CVPR2023-Papers-with-Code

**IJCAI 2023**

论文列表：https://ijcai-23.org/main-track-accepted-papers/

**ICLR 2023**

- 论文列表：https://openreview.net/group?id=ICLR.cc/2023/Conference#notable-top-5-

### 2022

**NIPS 2022**

- 论文列表：https://nips.cc/Conferences/2022/Schedule?type=Poster 和 https://openreview.net/group?id=NeurIPS.cc/2022/Conference

**CVPR 2022**

- 论文列表：https://openaccess.thecvf.com/CVPR2022?day=all
- 论文和代码：https://github.com/amusi/CVPR2023-Papers-with-Code/blob/master/CVPR2022-Papers-with-Code.md

**ECCV 2022**

- 论文列表：https://www.ecva.net/papers.php 和 https://eccv2022.ecva.net/program/accepted-papers/

- 论文和代码：https://github.com/amusi/ECCV2022-Papers-with-Code

**ACM MM 2022**

- 论文列表：https://2022.acmmm.org/accepted-papers/

**WACV 2022**

- 论文列表：https://openaccess.thecvf.com/WACV2023

**MICCAI 2022**

- 论文列表：https://conferences.miccai.org/2022/papers/ 和 https://link.springer.com/book/10.1007/978-3-031-16431-6

**AAAI 2022**

- 论文列表：https://aaai-2022.virtualchair.net/papers.html?filter=keywords&search=Poster+Session+12&cluster=Red+3

**ICLR 2022**

- 论文列表：https://openreview.net/group?id=ICLR.cc/2022/Conference#oral-submissions

### 2021

**ICLR 2021**

- 论文列表：https://docs.google.com/spreadsheets/d/1n58O0lgGI5kI0QQY9f4BDDpNB4oFjb5D51yMr9fHAK4/edit#gid=1546418007
- OpenReview数据：https://github.com/evanzd/ICLR2021-OpenReviewData
- [ICLR 2021 Stats & Graphs](https://github.com/sharonzhou/ICLR2021-Stats)

**AAAI 2021**

- 论文列表：https://aaai.org/Conferences/AAAI-21/wp-content/uploads/2020/12/AAAI-21_Accepted-Paper-List.Main_.Technical.Track_.pdf

**WACV 2021**

- 论文列表：http://wacv2021.thecvf.com/program

### 2020

**CVPR 2020**

- [CVPR 2020所有录用论文清单](http://openaccess.thecvf.com/CVPR2020.py)
- CVPR 2020论文PDF下载（1467篇论文）：[百度云链接](https://pan.baidu.com/s/1DoPNWXpwEkzQdPOrLsO21w) 密码: te6h
- [CVPR 2020 论文开源代码合集](https://github.com/amusi/CVPR2020-Code)

**ECCV 2020**

- [ECCV 2020 论文开源代码合集](https://github.com/amusi/ECCV2020-Code)

**NIPS 2020**

- 论文合集：https://neurips.cc/Conferences/2020/AcceptedPapersInitial

- 带代码的论文合集：https://www.paperdigest.org/2020/11/neurips-2020-papers-with-code-data/

**ACM MM 2020**

- 论文合集：https://dblp.org/db/conf/mm/mm2020.html
- 论文合集：https://2020.acmmm.org/main-track-list.html

**MICCAI 2020**

- 论文合集：https://drive.google.com/drive/folders/1GDKe2raJf4ylWqb1jxGmnsR384kmjYBb?usp=sharing

### 2019

**CVPR 2019**

- [CVPR 2019所有录用论文清单](<http://openaccess.thecvf.com/CVPR2019.py>) 
- CVPR 2019论文PDF下载（1294篇论文）：[百度云链接](https://pan.baidu.com/s/19ef0HOz4hduDpcEK2PY9Kw ) 密码: mwgv
- [CVPR 2019 开源代码合集](<https://github.com/amusi/CVPR2019-Code>)

**ICCV 2019**

- [ICCV 2019所有录用论文清单](<http://openaccess.thecvf.com/ICCV2019.py>) 
- ICCV 2019论文PDF下载（1075篇论文）：[百度云链接](https://pan.baidu.com/s/1snDhED1Y-6qbV1ImQoYIPA ) 密码: h7c2

**NeurIPS 2019**

- NeurIPS 2019 录用论文名单（1427篇）：[百度云链接](https://pan.baidu.com/s/1TxD263qqXmja3fBZVwtP3g)  密码：04wn 

**IJCAI 2019**

- IJCAI 2019所有录用论文清单（847篇）：[百度云链接](https://pan.baidu.com/s/1mVEowSZLBcz3X-_CZt7svA)  密码：v6ps

### 2018

**CVPR 2018**

- [CVPR 2018所有录用论文清单](2018/cvpr2018-paper-list.csv) 
- CVPR 2018论文PDF下载（979篇论文）：[百度云链接](https://pan.baidu.com/s/1lYEM_kkw1PWTkQzUvjG2pw)   密码: 6pgk 

**ECCV 2018**

- [ECCV 2018所有录用论文清单](http://openaccess.thecvf.com/ECCV2018.py) 
- ECCV 2018论文PDF下载：[百度云链接](https://pan.baidu.com/s/1Mg0Kw9bepUK6_vqqVSOjNQ)   密码: mh97

### 2017

**CVPR 2017**

- CVPR 2017论文PDF下载：[百度云链接](https://pan.baidu.com/s/1RP1wQBFxs8BT0KBLiukxBw)   密码: hnzg