Learning Deep Representation for Face Alignment with Auxiliary AttributesIEEE Transactions on Pattern Analysis and Machine Intelligence

About

Authors
Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang
Year
2015
DOI
10.1109/TPAMI.2015.2469286
Subject
Computational Theory and Mathematics / Software / Applied Mathematics / Artificial Intelligence / Computer Vision and Pattern Recognition

Similar

Learning deep representations via extreme learning machines

Authors:
Wenchao Yu, Fuzhen Zhuang, Qing He, Zhongzhi Shi
2015

Face recognition: a novel deep learning approach

Authors:
Sh. Ch. Pang, Zh. Zh. Yu
2015

Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

Authors:
Honglak Lee, Roger Grosse, Rajesh Ranganath, Andrew Y. Ng
2009

PREVENTION OF SERUM HEPATITIS

Authors:
TorbenK. With
1960

EARLY DETECTION OF LEAD TOXICITY

Authors:
T WITH
1970

Text

0162-8828 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2015.2469286, IEEE Transactions on Pattern Analysis and Machine Intelligence

SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Learning Deep Representation for Face

Alignment with Auxiliary Attributes

Zhanpeng Zhang, Ping Luo, Chen Change Loy, Member, IEEE and Xiaoou Tang, Fellow, IEEE

Abstract—In this study, we show that landmark detection or face alignment task is not a single and independent problem. Instead, its robustness can be greatly improved with auxiliary information. Specifically, we jointly optimize landmark detection together with the recognition of heterogeneous but subtly correlated facial attributes, such as gender, expression, and appearance attributes.

This is non-trivial since different attribute inference tasks have different learning difficulties and convergence rates. To address this problem, we formulate a novel tasks-constrained deep model, which not only learns the inter-task correlation but also employs dynamic task coefficients to facilitate the optimization convergence when learning multiple complex tasks. Extensive evaluations show that the proposed task-constrained learning (i) outperforms existing face alignment methods, especially in dealing with faces with severe occlusion and pose variation, and (ii) reduces model complexity drastically compared to the state-of-the-art methods based on cascaded deep model.

Index Terms—Face Alignment, Face Landmark Detection, Deep Learning, Convolutional Network

F 1 INTRODUCTION

Face alignment, or detecting semantic facial landmarks (e.g., eyes, nose, mouth corners) is a fundamental component in many face analysis tasks, such as facial attribute inference [1], face verification [2], [3], and face recognition [4]. Though great strides have been made in this field (see Sec. 2), robust facial landmark detection remains a formidable challenge in the presence of partial occlusion and large head pose variations (Fig. 1).

Landmark detection is traditionally approached as a single and independent problem. Popular approaches include template fitting approaches [5], [6], [7], [8] and regression-based methods [9], [10], [11], [12], [13].

More recently, deep models have been applied too.

For example, Sun et al. [14] propose to detect facial landmarks by coarse-to-fine regression using a cascade of deep convolutional neural networks (CNN). This method shows superior accuracy compared to previous methods [10], [15] and existing commercial systems.

Nevertheless, the method requires a complex and unwieldy cascade architecture of deep model.

We believe that facial landmark detection is not a standalone problem, but its estimation can be influenced by a number of heterogeneous and subtly correlated factors. Changes on a face are often governed by the same rules determined by the intrinsic facial structure. For instance, when a kid is smiling, his mouth is widely opened (the second image in

Fig. 1(a)). Effectively discovering and exploiting such an intrinsically correlated facial attribute would help in detecting the mouth corners more accurately. Also, • The authors are with the Department of Information Engineering, The

Chinese University of Hong Kong, Hong Kong.

E-mail: {zz013, lp011, ccloy, xtang}@ie.cuhk.edu.hk the inter-ocular distance is smaller in faces with large yaw rotation (the first image in Fig. 1(a)). Such pose information can be leveraged as an additional source of information to constrain the solution space of landmark estimation. Indeed, the input and solution spaces of face alignment can be effectively divided given auxiliary face attributes. In a small experiment, we average a set of face images according to different attributes, as shown in Fig. 1(b)), where the frontal and smiling faces show the mouth corners, while there are no specific details for the image averaged over the whole dataset. Given the rich auxiliary information, treating facial landmark detection in isolation is counterproductive.

This study aims to investigate the possibility of optimizing facial landmark detection (the main task) by leveraging auxiliary information from attribute inference tasks. Potential auxiliary tasks include head pose estimation, gender classification, age estimation [16], facial expression recognition, or facial attribute inference [17]. Given the multiple tasks, deep convolutional network appears to be a viable model choice since it allows for joint features learning and multi-objective inference. Typically, one can formulate a cost function that encompasses all the tasks and use the cost function in the network back-propagation learning.

We show that this conventional multi-task learning scheme is challenging in our problem. There are several reasons. First, the different tasks of face alignment and attribute inference are inherently different in learning difficulties. For instance, learning to identify “wearing glasses” attribute is easier than determining if one is smiling. Second, we rarely have auxiliary tasks with similar number of positive/negative cases.

For instance, male/female classification enjoys more balanced samples than facial expression recognition.

As a result, different tasks have different convergence 0162-8828 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2015.2469286, IEEE Transactions on Pattern Analysis and Machine Intelligence