Cross-Modality Distillation for Multi-View Action Recognition

Authors

  • Siyuan Liu College of Data Science and Application, Inner Mongolia University of Technology, China
  • Wenjing Liu College of Data Science and Application, Inner Mongolia University of Technology, China
  • Jie Tian Department of Computer Science, New Jersey Institute of Technology, USA

DOI:

https://doi.org/10.15837/ijccc.2025.5.6883

Keywords:

Deep Action recognition, Cross-Model Knowledge Distillation, Contrastive learning, Multi data modality, Edge Intelligence

Abstract

Behavior recognition provides important help and support in the fields of medical care, security, and intelligent transportation, and thus has received wide attention in the field of practical intelligent applications. However, there remain many challenges in the task of behavior recognition under distributed multi-view video, such as lighting changes under different viewpoints, trunk posture changes, and background noise, which seriously affect the accuracy of behavior recognition. To address these challenges, a multi-view cross-modal distillation behavior recognition method is proposed. Data from two different modalities, skeletal points and RGB, are included to construct teacher and student networks respectively, and KL divergence is used to evaluate cross-modal knowledge transformation to achieve behavior recognition under multiple views. Meanwhile, semisupervised learning framework is designed to improve the learning performance of the student network through pseudo-labeling. The consistency information of behaviors under different viewpoints is learned among the introduced multiple student networks, which effectively improves the stability and accuracy of multi-view behavior recognition. Experimental results on the behavior recognition datasets NTU RGB+D 60 and NTU RGB+D 120 show that the method outperforms some current popular methods in terms of recognition accuracy. In addition, further experiments conducted in an experimental environment built with real edge devices validate the feasibility of the method for deployment and use in distributed environments.

References

AlFayez, F., & Bouhamed, H. (2023). Machine learning and uLBP histograms for posture recognition of dependent people via Big Data Hadoop and Spark platform. International Journal of Computers Communications & Control, 18(1). https://doi.org/10.15837/ijccc.2023.1.4981

Ashraf, N., Sun, C., & Foroosh, H. (2014). View invariant action recognition using projective depth. Computer Vision and Image Understanding, 123, 41-52. https://doi.org/10.1016/j.cviu.2014.03.005

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308. https://doi.org/10.1109/CVPR.2017.502

Crasto, N., Weinzaepfel, P., Alahari, K., & Schmid, C. (2019). Mars: Motion-augmented rgb stream for action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7882-7891. https://doi.org/10.1109/CVPR.2019.00807

Das, S., Dai, R., Yang, D., & Bremond, F. (2021). Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (12), 9703-9717. https://doi.org/10.1109/TPAMI.2021.3127885

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, 248-255. https://doi.org/10.1109/CVPR.2009.5206848

Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2969-2978). https://doi.org/10.1109/CVPR52688.2022.00298

Dhiman, C., & Vishwakarma, D. K. (2020). View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Transactions on Image Processing, 29, 3835-3844. https://doi.org/10.1109/TIP.2020.2965299

Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2969- 2978. https://doi.org/10.1109/CVPR52688.2022.00298

Garcia, N. C., Morerio, P., & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. Proceedings of the European Conference on Computer Vision (ECCV), 103-118. https://doi.org/10.1007/978-3-030-01237-3_7

Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., & Ding, R. (2022). Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36 (1), 762-770. https://doi.org/10.1609/aaai.v36i1.19957

Hinton, G. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Hong, J., Fisher, M., Gharbi, M., & Fatahalian, K. (2021). Video pose distillation for few-shot, finegrained sports action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9254-9263. https://doi.org/10.1109/ICCV48922.2021.00912

Hornos, M. J., & Quinde, M. (2024). Development methodologies for iot-based systems: Challenges and research directions. Journal of Reliable Intelligent Environments, 1-30. https://doi.org/10.1007/s40860-024-00229-9

Jiang, Y., Wang, Z., & Jin, Z. (2023). Iot Data Processing and Scheduling Based on Deep Reinforcement Learning. International Journal of Computers Communications & Control, 18(6). https://doi.org/10.15837/ijccc.2023.6.5998

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

Li, X., Kang, J., Yang, Y., & Zhao, F. (2023). A lightweight attentional shift graph convolutional network for skeleton-based action recognition. International Journal of Computers Communications & Control, 18(3). https://doi.org/10.15837/ijccc.2023.3.5061

Li, Y., Xia, R., & Liu, X. (2020). Learning shape and motion representations for view invariant skeleton-based action recognition. Pattern Recognition, 103, 107293. https://doi.org/10.1016/j.patcog.2020.107293

Liu, J., & Xu, D. (2021). Geometrymotion-net: A strong two-stream baseline for 3d action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31 (12), 4711-4721. https://doi.org/10.1109/TCSVT.2021.3101847

Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684-2701. https://doi.org/10.1109/TPAMI.2019.2916873

Modupe, O. T., Otitoola, A. A., Oladapo, O. J., Abiona, O. O., Oyeniran, O. C., Adewusi, A. O., Komolafe, A. M., & Obijuru, A. (2024). Reviewing the transformational impact of edge computing on real-time data processing and analytics. Computer Science & IT Research Journal, 5 (3), 693-702. https://doi.org/10.51594/csitrj.v5i3.929

Rani, S. S., et al (2020). Self-similarity matrix and view invariant features assisted multi-view human action recognition. 2020 IEEE International Conference for Innovation in Technology (INOCON), 1-6. https://doi.org/10.1109/INOCON50539.2020.9298424

Saha, S., Perumal, I., Niveditha, V. R., Abbas, M., Manimozhi, I., & Bhat, C. R. (2024). Contextual Information Based Scheduling for Service Migration in Mobile Edge Computing. International Journal of Computers Communications & Control, 19(3). https://doi.org/10.15837/ijccc.2024.3.6143

Shahroudy, A., Liu, J., Ng, T.-T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE conference on computer vision and pattern recognition, 1010-1019. https://doi.org/10.1109/CVPR.2016.115

Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026-12035. https://doi.org/10.1109/CVPR.2019.01230

Shi, Y., Ong, H. R., Yang, S., & Fan, Y. (2024). Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition. International Journal of Computers Communications & Control, 19(5). https://doi.org/10.15837/ijccc.2024.5.6457

Song, Y.-F., Zhang, Z., Shan, C., & Wang, L. (2020). Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31 (5), 1915-1925. https://doi.org/10.1109/TCSVT.2020.3015051

Thoker, F. M., & Gall, J. (2019). Cross-modal knowledge distillation for action recognition. 2019 IEEE International Conference on Image Processing (ICIP), 6-10. https://doi.org/10.1109/ICIP.2019.8802909

Tsai, Y.-H., & Hsu, T.-C. (2024). An effective deep neural network in edge computing enabled internet of things for plant diseases monitoring. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 695-699. https://doi.org/10.1109/WACVW60836.2024.00081

Ullah, A., Muhammad, K., Hussain, T., & Baik, S. W. (2021). Conflux lstms network: A novel approach for multi-view action recognition. Neurocomputing, 435, 321-329. https://doi.org/10.1016/j.neucom.2019.12.151

Wang, G., Zhao, P., Shi, Y., Zhao, C., & Yang, S. (2024, March). Generative Model-Based Feature Knowledge Distillation for Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 14, pp. 15474-15482). https://doi.org/10.1609/aaai.v38i14.29473

Wang, X., Lu, Y., Yu, W., Pang, Y., & Wang, H. (2024). Few-shot Action Recognition via Multiview Representation Learning. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3384875

Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., Yuille, A., & Li, Y. (2022). Learning from temporal gradient for semi-supervised action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3252-3262. https://doi.org/10.1109/CVPR52688.2022.00325

Xiao, Y., Chen, J., Wang, Y., Cao, Z., Zhou, J. T., & Bai, X. (2019). Action recognition for depth video using multi-view dynamic images. Information Sciences, 480, 287-304. https://doi.org/10.1016/j.ins.2018.12.050

Xu, Y., Jiang, S., Cui, Z., & Su, F. (2024). Multi-View Action Recognition for Distracted Driver Behavior Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7172-7179). https://doi.org/10.1109/CVPRW63382.2024.00712

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeletonbased action recognition. Proceedings of the AAAI conference on artificial intelligence, 32 (1). https://doi.org/10.1609/aaai.v32i1.12328

Zhang, D., Du, C., Peng, Y., Liu, J., Mohammed, S., & Calvi, A. (2024). A multi-source dynamic temporal point process model for train delay prediction. IEEE Transactions on Intelligent Transportation Systems. https://doi.org/10.1109/TITS.2024.3430031

Zhang, D. (2017). High-speed train control system big data analysis based on the fuzzy RDF model and uncertain reasoning. International Journal of Computers Communications & Control, 12(4), 577-591. https://doi.org/10.15837/ijccc.2017.4.2914

Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020). Semantics-guided neural networks for efficient skeleton-based human action recognition. proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1112-1121. https://doi.org/10.1109/CVPR42600.2020.00119

Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep mutual learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 4320-4328. https://doi.org/10.1109/CVPR.2018.00454

Zhou, T., Gao, X., Sun, X., & Han, L. (2024). Split Difference Weighting: An Enhanced Decision Tree Approach for Imbalanced Classification. International Journal of Computers Communications & Control, 19(6). https://doi.org/10.15837/ijccc.2024.6.6702

Zhu, X., Zhu, Y., Wang, H., Wen, H., Yan, Y., & Liu, P. (2022). Skeleton sequence and rgb frame based multi-modality feature fusion network for action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18 (3), 1-24. https://doi.org/10.1145/3491228

Additional Files

Published

2025-11-05

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.