doi: 10.3934/mfc.2022018
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

Expression recognition method combining convolutional features and Transformer

1. 

School of Electronic Information Engineering, Beihang University, China

2. 

Elite Digital Technology Co., Beijing, China

*Corresponding author: Xiaoning Zhu

Received  December 2021 Revised  March 2022 Early access June 2022

Expression recognition has been an important research direction in the field of psychology, which can be used in traffic, medical, security, and criminal investigation by expressing human feelings through the muscles in the corners of the mouth, eyes, and face. Most of the existing research work uses convolutional neural networks (CNN) to recognize face images and thus classify expressions, which does achieve good results, but CNN do not have enough ability to extract global features. The Transformer has advantages for global feature extraction, but the Transformer is more computationally intensive and requires a large amount of training data. So, in this paper, we use the hierarchical Transformer, namely Swin Transformer, for the expression recognition task, and its computational power will be greatly reduced. At the same time, it is fused with a CNN model to propose a network architecture that combines the Transformer and CNN, and to the best of our knowledge, we are the first to combine the Swin Transformer with CNN and use it in an expression recognition task. We then evaluate the proposed method on some publicly available expression datasets and can obtain competitive results.

Citation: Xiaoning Zhu, Zhongyi Li, Jian Sun. Expression recognition method combining convolutional features and Transformer. Mathematical Foundations of Computing, doi: 10.3934/mfc.2022018
References:
[1]

T. Pang and A. Hussain, Constants across culture in the face and emotion, Journal of Personality and Social Psychology, 17 (1971). 

[2]

X. S. WeiC. L. ZhangH. Zhang and J. Wu, Deep bimodal regression of apparent personality traits from short video sequences, IEEE Transactions on Affective Computing, 9 (2017), 303-315. 

[3]

X. S. WeiY. Z. SongO. Mac AodhaJ. X. WuY. PengJ. TangJ. Yang and S Belongie, Fine-grained image analysis with deep learning: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, (2021). 

[4]

K. HeX. ZhangS. Ren and J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778. 

[5]

A. VaswaniN. ShazeerN. ParmarJ. UszkoreitL. JonesAN. GomezL. Kaiser and I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems, (2017). 

[6]

T. Ma, M. Mao, H. Zheng, P. Gao, X. Wang, S. Han, E. Ding, B. Zhang and D. Doermann, Oriented object detection with transformer, preprint, (2021), arXiv: 2106.03146.

[7]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, preprint, (2020), arXiv: 2010.11929.

[8]

C. SunA. ShrivastavaS. Singh and A. Gupta, Revisiting unreasonable effectiveness of data in deep learning era, Proceedings of the IEEE International Conference on Computer vision, (2017). 

[9]

W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, preprint, (2021), arXiv: 2102.12122. doi: 10.1109/ICCV48922.2021.00061.

[10]

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan and L. Zhang, Cvt: Introducing convolutions to vision transformers, preprint, (2021), arXiv: 2103.15808. doi: 10.1109/ICCV48922.2021.00009.

[11]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, preprint, (2021), arXiv: 2103.14030. doi: 10.1109/ICCV48922.2021.00986.

[12]

T. BaltrusaitisM. Mahmoud and P. Robinson, Cross-dataset learning and person-specific normalisation for automatic Action Unit detection, IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, (2015), 1-6. 

[13]

C. ShanS. Gong and P. W. McOwan, Facial expression recognition based on Local Binary Patterns: A comprehensive study, Image and Vision Computing, 27 (2009), 803-816. 

[14]

B. JiangB. MartinezM. F. Valstar and M. Pantic, Decision level fusion of domain specific regions for facial action recognition, 2014 22nd international conference on pattern recognition, (2014), 1776-1781. 

[15]

B. Fasel, Robust face analysis using convolutional neural networks, International Conference on Pattern, (2002). 

[16]

C. Pramerdorfer and M. Kampel, Facial expression recognition using convolutional neural networks: State of the art, preprint, (2016), arXiv: 1612.02903.

[17]

A. KrizhevskyI. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25 (2012), 1097-1105. 

[18]

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, (2014), arXiv: 1409.1556.

[19]

A. F. Agarap, Deep learning using rectified linear units (relu), preprint, (2018), arXiv: 1803.08375.

[20]

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, International Conference on Machine Learning, (2015), 448-456. 

[21]

B. Li and D. Lima, Facial expression recognition via ResNet-50, International Journal of Cognitive Computing in Engineering, 2 (2021), 57-64. 

[22]

D. Orozco1C. LeeY. Arabadzhi and D. Gupta, Transfer learning for facial expression recognition, Florida State Univ.: Tallahassee, (2018). 

[23]

P. LuceyJ. F. CohnT. KanadeJ. SaragihZ. Ambadar and I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, (2010). 

[24]

M. LyonsS. AkamatsuM. Kamachi and J. Gyoba, Coding facial expressions with gabor wavelets, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, (1998), 200-205. 

[25]

J. DengW. DongR. SocherL. J. LiK. Li and F.-F. Li, Imagenet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, (2009). 

[26]

N. CarionF. MassaG. SynnaeveN. UsunierA. Kirillov and S. Zagoruyko, End-to-end object detection with transformers, European Conference on Computer Vision, (2020). 

[27]

H. TouvronM. CordM. DouzeF. MassaA. Sablayrolles and H. Jegou, Training data-efficient image transformers and distillation through attention, International Conference on Machine Learning, (2021). 

[28]

B. SunL. LiG. ZhouX. WuJ. HeL. YuD. Li and Q. Wei, Combining multimodal features within a fusion network for emotion recognition in the wild, Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, (2015), 497-502. 

[29]

R. GirshickJ. DonahueT. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE Computer Society, (2013). 

[30]

J. LiD. ZhangJ. ZhangJ. ZhangT. LiY. XiaQ. Yan and L. Xun, Facial expression recognition with faster R-CNN, Procedia Computer Science, 107 (2017), 135-140. 

[31]

S. RenK. HeR. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (2017), 1137-1149. 

[32]

A. MollahosseiniD. Chan and M. H. Mahoor, Going deeper in facial expression recognition using deep neural networks, 2016 IEEE Winter Conference on Applications of Computer Vision, (2016). 

[33]

C. SzegedyW. LiuY. JiaP. SermanetS. ReedD. AnguelovD. ErhanV. Vanhoucke and A. Rabinovich, Going deeper with convolutions, IEEE Computer Society, (2014). 

[34]

G. E. HintonS. Osindero and Y. W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18 (2014), 1527-1554.  doi: 10.1162/neco.2006.18.7.1527.

[35]

G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 313 (2006), 504-507.  doi: 10.1126/science.1127647.

[36]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9 (1997), 1735-1780. 

[37]

I. GoodfellowJ. Pouget-AbadieM. MirzaB. XuD. Warde-FarleyS. OzairA. Courville and Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems, (2014), 2672-2680. 

[38]

Z. PengW. HuangS. GuL. XieY. WangJ. Jiao and Q. Ye, Conformer: Local features coupling global representations for visual recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 367-376. 

[39]

Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan and Z. Liu, Mobile-former: Bridging mobilenet and transformer, preprint, (2021), arXiv: 2108.05895.

[40]

A. StergiouR. Poppe and G. Kalliatakis, Refining activation downsampling with SoftPool, Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 10357-10366. 

[41]

R. Müller, S. Kornblith and G. E. Hinton, When does label smoothing help?, preprint, (2019), arXiv: 1906.02629.

[42]

E. BarsoumC. ZhangC. C. Ferrer and Z. Zhang, Training deep networks for facial expression recognition with crowd-sourced label distribution, Proceedings of the 18th ACM International Conference on Multimodal Interaction, (2016). 

[43]

I. J. GoodfellowD. ErhanP. L. CarrierA. CourvilleM. MirzaB. HamnerW. CukierskiY. TangD. ThalerD. H. LeeY. ZhouC. RamaiahF. FengR. LiX. WangD. AthanasakisJ. Shawe-TaylorM. MilakovJ. ParkR. IonescuM. PopescuC. GrozeaJ. BergstraJ. XieL. RomaszkoB. XuZ. Chuang and Y. Bengio, Challenges in representation learning: A report on three machine learning contests, International Conference on Neural Information Processing, (2013), 117-124. 

[44]

A. MollahosseiniB. Hasani and M. H. Mahoor, Affectnet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing, 10 (2017), 18-31. 

[45]

S. Li and W. Deng, Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition, IEEE Transactions on Image Processing, 28 (2018), 356-370.  doi: 10.1109/TIP.2018.2868382.

[46]

A. Bulat and G. Tzimiropoulos, How far are we from solving the 2d and 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks), Proceedings of the IEEE International Conference on Computer Vision, (2017), 1021-1030. 

[47]

S. MiaoH. Xu and Z. Han, Recognizing facial expressions using a shallow convolutional neural network, IEEE Access, 7 (2019), 78000-78011. 

[48]

K. WangX. PengJ. YangD. Meng and Y. Qiao, Region attention networks for pose and occlusion robust facial expression recognition, IEEE Transactions on Image Processing, 29 (2020), 4057-4069. 

[49]

K. WangX. PengJ. YangS. Lu and Y. Qiao, Suppressing uncertainties for large-scale facial expression recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 6897-6906. 

[50]

X. FanZ. DengK. WangX. Peng and Y. Qiao, Learning discriminative representation for facial expression recognition from uncertainties, 2020 IEEE International Conference on Image Processing, (2020), 903-907. 

[51]

H. Li, M. Sui, F. Zhao, Z. Zha and F. Wu, MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, preprint, (2021), arXiv: 2106.04520.

[52]

X. ZhaoX. LiangL. LiuT. LiY. HanN. Vasconcelos and S. Yan, Peak-piloted deep network for facial expression recognition, European Conference on Computer Vision, (2016), 425-442. 

[53]

H. DingS. K. Zhou and R. Chellappa, Facenet2expnet: Regularizing a deep face recognition net for expression recognition, 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition, (2017), 118-126. 

[54]

S. MinaeeM. Minaei and A. Abdolrashidi, Deep-Emotion: Facial expression recognition using attentional convolutional network, Sensors, 21 (2021), 3046. 

[55]

Z. CuiT. SongY. Wang and Q. Ji, Knowledge augmented deep neural networks for joint facial expression and action unit recognition, Advances in Neural Information Processing Systems, 33 (2020). 

[56]

M. Aouayeb, W. Hamidouche, C. Soladie, K. Kpalma and R. Seguier, Learning vision transformer with squeeze and excitation for facial expression recognition, preprint, (2021), arXiv: 2107.03107.

[57]

T. H. VoG. S. LeeH. J. Yang and S. H. Kim, Pyramid with super resolution for In-the-Wild facial expression recognition, IEEE Access, 8 (2020), 131988-132001. 

[58]

F. Ma, B. Sun and S. Li, Robust facial expression recognition with convolutional visual transformers, preprint, (2021), arXiv: 2103.16854.

show all references

References:
[1]

T. Pang and A. Hussain, Constants across culture in the face and emotion, Journal of Personality and Social Psychology, 17 (1971). 

[2]

X. S. WeiC. L. ZhangH. Zhang and J. Wu, Deep bimodal regression of apparent personality traits from short video sequences, IEEE Transactions on Affective Computing, 9 (2017), 303-315. 

[3]

X. S. WeiY. Z. SongO. Mac AodhaJ. X. WuY. PengJ. TangJ. Yang and S Belongie, Fine-grained image analysis with deep learning: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, (2021). 

[4]

K. HeX. ZhangS. Ren and J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778. 

[5]

A. VaswaniN. ShazeerN. ParmarJ. UszkoreitL. JonesAN. GomezL. Kaiser and I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems, (2017). 

[6]

T. Ma, M. Mao, H. Zheng, P. Gao, X. Wang, S. Han, E. Ding, B. Zhang and D. Doermann, Oriented object detection with transformer, preprint, (2021), arXiv: 2106.03146.

[7]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, preprint, (2020), arXiv: 2010.11929.

[8]

C. SunA. ShrivastavaS. Singh and A. Gupta, Revisiting unreasonable effectiveness of data in deep learning era, Proceedings of the IEEE International Conference on Computer vision, (2017). 

[9]

W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, preprint, (2021), arXiv: 2102.12122. doi: 10.1109/ICCV48922.2021.00061.

[10]

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan and L. Zhang, Cvt: Introducing convolutions to vision transformers, preprint, (2021), arXiv: 2103.15808. doi: 10.1109/ICCV48922.2021.00009.

[11]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, preprint, (2021), arXiv: 2103.14030. doi: 10.1109/ICCV48922.2021.00986.

[12]

T. BaltrusaitisM. Mahmoud and P. Robinson, Cross-dataset learning and person-specific normalisation for automatic Action Unit detection, IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, (2015), 1-6. 

[13]

C. ShanS. Gong and P. W. McOwan, Facial expression recognition based on Local Binary Patterns: A comprehensive study, Image and Vision Computing, 27 (2009), 803-816. 

[14]

B. JiangB. MartinezM. F. Valstar and M. Pantic, Decision level fusion of domain specific regions for facial action recognition, 2014 22nd international conference on pattern recognition, (2014), 1776-1781. 

[15]

B. Fasel, Robust face analysis using convolutional neural networks, International Conference on Pattern, (2002). 

[16]

C. Pramerdorfer and M. Kampel, Facial expression recognition using convolutional neural networks: State of the art, preprint, (2016), arXiv: 1612.02903.

[17]

A. KrizhevskyI. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25 (2012), 1097-1105. 

[18]

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, (2014), arXiv: 1409.1556.

[19]

A. F. Agarap, Deep learning using rectified linear units (relu), preprint, (2018), arXiv: 1803.08375.

[20]

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, International Conference on Machine Learning, (2015), 448-456. 

[21]

B. Li and D. Lima, Facial expression recognition via ResNet-50, International Journal of Cognitive Computing in Engineering, 2 (2021), 57-64. 

[22]

D. Orozco1C. LeeY. Arabadzhi and D. Gupta, Transfer learning for facial expression recognition, Florida State Univ.: Tallahassee, (2018). 

[23]

P. LuceyJ. F. CohnT. KanadeJ. SaragihZ. Ambadar and I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, (2010). 

[24]

M. LyonsS. AkamatsuM. Kamachi and J. Gyoba, Coding facial expressions with gabor wavelets, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, (1998), 200-205. 

[25]

J. DengW. DongR. SocherL. J. LiK. Li and F.-F. Li, Imagenet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, (2009). 

[26]

N. CarionF. MassaG. SynnaeveN. UsunierA. Kirillov and S. Zagoruyko, End-to-end object detection with transformers, European Conference on Computer Vision, (2020). 

[27]

H. TouvronM. CordM. DouzeF. MassaA. Sablayrolles and H. Jegou, Training data-efficient image transformers and distillation through attention, International Conference on Machine Learning, (2021). 

[28]

B. SunL. LiG. ZhouX. WuJ. HeL. YuD. Li and Q. Wei, Combining multimodal features within a fusion network for emotion recognition in the wild, Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, (2015), 497-502. 

[29]

R. GirshickJ. DonahueT. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE Computer Society, (2013). 

[30]

J. LiD. ZhangJ. ZhangJ. ZhangT. LiY. XiaQ. Yan and L. Xun, Facial expression recognition with faster R-CNN, Procedia Computer Science, 107 (2017), 135-140. 

[31]

S. RenK. HeR. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (2017), 1137-1149. 

[32]

A. MollahosseiniD. Chan and M. H. Mahoor, Going deeper in facial expression recognition using deep neural networks, 2016 IEEE Winter Conference on Applications of Computer Vision, (2016). 

[33]

C. SzegedyW. LiuY. JiaP. SermanetS. ReedD. AnguelovD. ErhanV. Vanhoucke and A. Rabinovich, Going deeper with convolutions, IEEE Computer Society, (2014). 

[34]

G. E. HintonS. Osindero and Y. W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18 (2014), 1527-1554.  doi: 10.1162/neco.2006.18.7.1527.

[35]

G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 313 (2006), 504-507.  doi: 10.1126/science.1127647.

[36]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9 (1997), 1735-1780. 

[37]

I. GoodfellowJ. Pouget-AbadieM. MirzaB. XuD. Warde-FarleyS. OzairA. Courville and Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems, (2014), 2672-2680. 

[38]

Z. PengW. HuangS. GuL. XieY. WangJ. Jiao and Q. Ye, Conformer: Local features coupling global representations for visual recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 367-376. 

[39]

Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan and Z. Liu, Mobile-former: Bridging mobilenet and transformer, preprint, (2021), arXiv: 2108.05895.

[40]

A. StergiouR. Poppe and G. Kalliatakis, Refining activation downsampling with SoftPool, Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 10357-10366. 

[41]

R. Müller, S. Kornblith and G. E. Hinton, When does label smoothing help?, preprint, (2019), arXiv: 1906.02629.

[42]

E. BarsoumC. ZhangC. C. Ferrer and Z. Zhang, Training deep networks for facial expression recognition with crowd-sourced label distribution, Proceedings of the 18th ACM International Conference on Multimodal Interaction, (2016). 

[43]

I. J. GoodfellowD. ErhanP. L. CarrierA. CourvilleM. MirzaB. HamnerW. CukierskiY. TangD. ThalerD. H. LeeY. ZhouC. RamaiahF. FengR. LiX. WangD. AthanasakisJ. Shawe-TaylorM. MilakovJ. ParkR. IonescuM. PopescuC. GrozeaJ. BergstraJ. XieL. RomaszkoB. XuZ. Chuang and Y. Bengio, Challenges in representation learning: A report on three machine learning contests, International Conference on Neural Information Processing, (2013), 117-124. 

[44]

A. MollahosseiniB. Hasani and M. H. Mahoor, Affectnet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing, 10 (2017), 18-31. 

[45]

S. Li and W. Deng, Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition, IEEE Transactions on Image Processing, 28 (2018), 356-370.  doi: 10.1109/TIP.2018.2868382.

[46]

A. Bulat and G. Tzimiropoulos, How far are we from solving the 2d and 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks), Proceedings of the IEEE International Conference on Computer Vision, (2017), 1021-1030. 

[47]

S. MiaoH. Xu and Z. Han, Recognizing facial expressions using a shallow convolutional neural network, IEEE Access, 7 (2019), 78000-78011. 

[48]

K. WangX. PengJ. YangD. Meng and Y. Qiao, Region attention networks for pose and occlusion robust facial expression recognition, IEEE Transactions on Image Processing, 29 (2020), 4057-4069. 

[49]

K. WangX. PengJ. YangS. Lu and Y. Qiao, Suppressing uncertainties for large-scale facial expression recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 6897-6906. 

[50]

X. FanZ. DengK. WangX. Peng and Y. Qiao, Learning discriminative representation for facial expression recognition from uncertainties, 2020 IEEE International Conference on Image Processing, (2020), 903-907. 

[51]

H. Li, M. Sui, F. Zhao, Z. Zha and F. Wu, MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, preprint, (2021), arXiv: 2106.04520.

[52]

X. ZhaoX. LiangL. LiuT. LiY. HanN. Vasconcelos and S. Yan, Peak-piloted deep network for facial expression recognition, European Conference on Computer Vision, (2016), 425-442. 

[53]

H. DingS. K. Zhou and R. Chellappa, Facenet2expnet: Regularizing a deep face recognition net for expression recognition, 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition, (2017), 118-126. 

[54]

S. MinaeeM. Minaei and A. Abdolrashidi, Deep-Emotion: Facial expression recognition using attentional convolutional network, Sensors, 21 (2021), 3046. 

[55]

Z. CuiT. SongY. Wang and Q. Ji, Knowledge augmented deep neural networks for joint facial expression and action unit recognition, Advances in Neural Information Processing Systems, 33 (2020). 

[56]

M. Aouayeb, W. Hamidouche, C. Soladie, K. Kpalma and R. Seguier, Learning vision transformer with squeeze and excitation for facial expression recognition, preprint, (2021), arXiv: 2107.03107.

[57]

T. H. VoG. S. LeeH. J. Yang and S. H. Kim, Pyramid with super resolution for In-the-Wild facial expression recognition, IEEE Access, 8 (2020), 131988-132001. 

[58]

F. Ma, B. Sun and S. Li, Robust facial expression recognition with convolutional visual transformers, preprint, (2021), arXiv: 2103.16854.

Figure 1.  Network structure diagram combining Swin Transformer Block and CNN Block
Figure 2.  Network architecture diagram of Swin Transformer Block
Figure 3.  Network architecture diagram of CNN Block
Figure 4.  C-T Module
Figure 5.  T-C Module
Table 1.  Classification accuracy on the FERPlus
Method Year Network Accuracy
SHCNN[47] 2019 CNN 0.8654
RAN[48] 2019 ResNet+Self-Attention 0.8855
SCN[49] 2020 CNN+Self-Attention 0.8801
LDR[50] 2020 ResNet 0.876
MViT[51] 2021 Mask Vision Transformer 0.8922
CVT[10] 2021 ResNet+Transformer 0.8881
Ours 2021 Transformer+CNN 0.874
Method Year Network Accuracy
SHCNN[47] 2019 CNN 0.8654
RAN[48] 2019 ResNet+Self-Attention 0.8855
SCN[49] 2020 CNN+Self-Attention 0.8801
LDR[50] 2020 ResNet 0.876
MViT[51] 2021 Mask Vision Transformer 0.8922
CVT[10] 2021 ResNet+Transformer 0.8881
Ours 2021 Transformer+CNN 0.874
Table 2.  Classification accuracy on the CK+
Method Year Network Accuracy
PPDN[52] 2017 CNN 0.973
FN2EN[53] 2016 CNN 0.986
Deep-Emotion[54] 2019 CNN 0.98
Knowledge augmented DNN[55] 2020 CNN+knowledge model 0.9759
ViT+SE[56] 2021 Transformer+SE 0.9980
Ours 2021 Transformer+CNN 0.982
Method Year Network Accuracy
PPDN[52] 2017 CNN 0.973
FN2EN[53] 2016 CNN 0.986
Deep-Emotion[54] 2019 CNN 0.98
Knowledge augmented DNN[55] 2020 CNN+knowledge model 0.9759
ViT+SE[56] 2021 Transformer+SE 0.9980
Ours 2021 Transformer+CNN 0.982
Table 3.  Classification accuracy on the AffectNet-8
Method Year Network Accuracy
RAN[48] 2020 ResNet+Self-Attention 0.595
SCN[49] 2020 CNN+Self-Attention 0.6063
PSR[57] 2020 CNN 0.6068
CVT[10] 2021 ResNet+Transformer 0.6125
MViT[51] 2021 Mask Vision Transformer 0.6457
Ours 2021 Transformer+CNN 0.607
Method Year Network Accuracy
RAN[48] 2020 ResNet+Self-Attention 0.595
SCN[49] 2020 CNN+Self-Attention 0.6063
PSR[57] 2020 CNN 0.6068
CVT[10] 2021 ResNet+Transformer 0.6125
MViT[51] 2021 Mask Vision Transformer 0.6457
Ours 2021 Transformer+CNN 0.607
Table 4.  Classification accuracy on the RAF-DB
Method Year Network Accuracy
RAN[48] 2020 ResNet+Self-Attention 0.8690
SCN[49] 2020 CNN+Self-Attention 0.8703
CVT[10] 2021 ResNet+Transformer 0.8814
MViT[51] 2021 Mask Vision Transformer 0.8862
Robust CVT[58] 2021 Transformer+CNN 0.8814
Ours 2021 Transformer+CNN 0.878
Method Year Network Accuracy
RAN[48] 2020 ResNet+Self-Attention 0.8690
SCN[49] 2020 CNN+Self-Attention 0.8703
CVT[10] 2021 ResNet+Transformer 0.8814
MViT[51] 2021 Mask Vision Transformer 0.8862
Robust CVT[58] 2021 Transformer+CNN 0.8814
Ours 2021 Transformer+CNN 0.878
Table 5.  Effect of the presence or absence of CNN Block on the experimental results
Method CNN Block FERPlus CK+ AffectNet-8 RAF-DB
Swin no 0.855 0.975 0.587 0.855
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Method CNN Block FERPlus CK+ AffectNet-8 RAF-DB
Swin no 0.855 0.975 0.587 0.855
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Table 6.  Effect of face alignment operation on experimental results
Method Face alignment FERPlus CK+ AffectNet-8 RAF-DB
Swin+CNN(Ours) no 0.860 0.965 0.594 0.864
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Method Face alignment FERPlus CK+ AffectNet-8 RAF-DB
Swin+CNN(Ours) no 0.860 0.965 0.594 0.864
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Table 7.  Effect of pre-training on experimental results
Method pre-training FERPlus CK+ AffectNet-8 RAF-DB
Swin+CNN(Ours) no 0.862 0.977 0.589 0.868
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Method pre-training FERPlus CK+ AffectNet-8 RAF-DB
Swin+CNN(Ours) no 0.862 0.977 0.589 0.868
Swin+CNN(Ours) yes 0.874 0.982 0.607 0.878
Table 8.  Consumption of GPU resources
Method GPU Memory Usage
Swin Transformer 5705MiB
Ours 6505MiB
Method GPU Memory Usage
Swin Transformer 5705MiB
Ours 6505MiB
[1]

Hyeontae Jo, Hwijae Son, Hyung Ju Hwang, Eun Heui Kim. Deep neural network approach to forward-inverse problems. Networks and Heterogeneous Media, 2020, 15 (2) : 247-259. doi: 10.3934/nhm.2020011

[2]

Zheng Chen, Liu Liu, Lin Mu. Solving the linear transport equation by a deep neural network approach. Discrete and Continuous Dynamical Systems - S, 2022, 15 (4) : 669-686. doi: 10.3934/dcdss.2021070

[3]

Editorial Office. Retraction: Honggang Yu, An efficient face recognition algorithm using the improved convolutional neural network. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 901-901. doi: 10.3934/dcdss.2019060

[4]

Yuantian Xia, Juxiang Zhou, Tianwei Xu, Wei Gao. An improved deep convolutional neural network model with kernel loss function in image classification. Mathematical Foundations of Computing, 2020, 3 (1) : 51-64. doi: 10.3934/mfc.2020005

[5]

Sen Zou, Shuai Lu, Boxi Xu. Linearized inverse Schrödinger potential problem with partial data and its deep neural network inversion. Inverse Problems and Imaging, , () : -. doi: 10.3934/ipi.2022045

[6]

Mingliang Xue, Xiaodong Duan, Wanquan Liu. Eliminating other-race effect for multi-ethnic facial expression recognition. Mathematical Foundations of Computing, 2019, 2 (1) : 43-53. doi: 10.3934/mfc.2019004

[7]

Lars Grüne. Computing Lyapunov functions using deep neural networks. Journal of Computational Dynamics, 2021, 8 (2) : 131-152. doi: 10.3934/jcd.2021006

[8]

Harbir Antil, Thomas S. Brown, Rainald Löhner, Fumiya Togashi, Deepanshu Verma. Deep neural nets with fixed bias configuration. Numerical Algebra, Control and Optimization, 2022  doi: 10.3934/naco.2022016

[9]

Christopher Oballe, David Boothe, Piotr J. Franaszczuk, Vasileios Maroulas. ToFU: Topology functional units for deep learning. Foundations of Data Science, 2021  doi: 10.3934/fods.2021021

[10]

Richard Archibald, Feng Bao, Yanzhao Cao, He Zhang. A backward SDE method for uncertainty quantification in deep learning. Discrete and Continuous Dynamical Systems - S, 2022, 15 (10) : 2807-2835. doi: 10.3934/dcdss.2022062

[11]

Ziju Shen, Yufei Wang, Dufan Wu, Xu Yang, Bin Dong. Learning to scan: A deep reinforcement learning approach for personalized scanning in CT imaging. Inverse Problems and Imaging, 2022, 16 (1) : 179-195. doi: 10.3934/ipi.2021045

[12]

Marcello Delitala, Tommaso Lorenzi. Recognition and learning in a mathematical model for immune response against cancer. Discrete and Continuous Dynamical Systems - B, 2013, 18 (4) : 891-914. doi: 10.3934/dcdsb.2013.18.891

[13]

Weiping Li, Haiyan Wu, Jie Yang. Intelligent recognition algorithm for social network sensitive information based on classification technology. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1385-1398. doi: 10.3934/dcdss.2019095

[14]

Jianfeng Feng, Mariya Shcherbina, Brunello Tirozzi. Stability of the dynamics of an asymmetric neural network. Communications on Pure and Applied Analysis, 2009, 8 (2) : 655-671. doi: 10.3934/cpaa.2009.8.655

[15]

Martin Benning, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, Carola-Bibiane Schönlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 2019, 6 (2) : 171-198. doi: 10.3934/jcd.2019009

[16]

Nicholas Geneva, Nicholas Zabaras. Multi-fidelity generative deep learning turbulent flows. Foundations of Data Science, 2020, 2 (4) : 391-428. doi: 10.3934/fods.2020019

[17]

Miria Feng, Wenying Feng. Evaluation of parallel and sequential deep learning models for music subgenre classification. Mathematical Foundations of Computing, 2021, 4 (2) : 131-143. doi: 10.3934/mfc.2021008

[18]

Govinda Anantha Padmanabha, Nicholas Zabaras. A Bayesian multiscale deep learning framework for flows in random media. Foundations of Data Science, 2021, 3 (2) : 251-303. doi: 10.3934/fods.2021016

[19]

Suhua Wang, Zhiqiang Ma, Hongjie Ji, Tong Liu, Anqi Chen, Dawei Zhao. Personalized exercise recommendation method based on causal deep learning: Experiments and implications. STEM Education, 2022, 2 (2) : 157-172. doi: 10.3934/steme.2022011

[20]

Alessandro Scagliotti. Deep Learning approximation of diffeomorphisms via linear-control systems. Mathematical Control and Related Fields, 2022  doi: 10.3934/mcrf.2022036

 Impact Factor: 

Metrics

  • PDF downloads (121)
  • HTML views (118)
  • Cited by (0)

Other articles
by authors

[Back to Top]