Cross-Modality Distillation for Multi-View Action Recognition
DOI:
https://doi.org/10.15837/ijccc.2025.5.6883Keywords:
Deep Action recognition, Cross-Model Knowledge Distillation, Contrastive learning, Multi data modality, Edge IntelligenceAbstract
Behavior recognition provides important help and support in the fields of medical care, security, and intelligent transportation, and thus has received wide attention in the field of practical intelligent applications. However, there remain many challenges in the task of behavior recognition under distributed multi-view video, such as lighting changes under different viewpoints, trunk posture changes, and background noise, which seriously affect the accuracy of behavior recognition. To address these challenges, a multi-view cross-modal distillation behavior recognition method is proposed. Data from two different modalities, skeletal points and RGB, are included to construct teacher and student networks respectively, and KL divergence is used to evaluate cross-modal knowledge transformation to achieve behavior recognition under multiple views. Meanwhile, semisupervised learning framework is designed to improve the learning performance of the student network through pseudo-labeling. The consistency information of behaviors under different viewpoints is learned among the introduced multiple student networks, which effectively improves the stability and accuracy of multi-view behavior recognition. Experimental results on the behavior recognition datasets NTU RGB+D 60 and NTU RGB+D 120 show that the method outperforms some current popular methods in terms of recognition accuracy. In addition, further experiments conducted in an experimental environment built with real edge devices validate the feasibility of the method for deployment and use in distributed environments.
References
AlFayez, F., & Bouhamed, H. (2023). Machine learning and uLBP histograms for posture recognition of dependent people via Big Data Hadoop and Spark platform. International Journal of Computers Communications & Control, 18(1). https://doi.org/10.15837/ijccc.2023.1.4981
Ashraf, N., Sun, C., & Foroosh, H. (2014). View invariant action recognition using projective depth. Computer Vision and Image Understanding, 123, 41-52. https://doi.org/10.1016/j.cviu.2014.03.005
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308. https://doi.org/10.1109/CVPR.2017.502
Crasto, N., Weinzaepfel, P., Alahari, K., & Schmid, C. (2019). Mars: Motion-augmented rgb stream for action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7882-7891. https://doi.org/10.1109/CVPR.2019.00807
Das, S., Dai, R., Yang, D., & Bremond, F. (2021). Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (12), 9703-9717. https://doi.org/10.1109/TPAMI.2021.3127885
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, 248-255. https://doi.org/10.1109/CVPR.2009.5206848
Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2969-2978). https://doi.org/10.1109/CVPR52688.2022.00298
Dhiman, C., & Vishwakarma, D. K. (2020). View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Transactions on Image Processing, 29, 3835-3844. https://doi.org/10.1109/TIP.2020.2965299
Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2969- 2978. https://doi.org/10.1109/CVPR52688.2022.00298
Garcia, N. C., Morerio, P., & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. Proceedings of the European Conference on Computer Vision (ECCV), 103-118. https://doi.org/10.1007/978-3-030-01237-3_7
Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., & Ding, R. (2022). Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36 (1), 762-770. https://doi.org/10.1609/aaai.v36i1.19957
Hinton, G. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Hong, J., Fisher, M., Gharbi, M., & Fatahalian, K. (2021). Video pose distillation for few-shot, finegrained sports action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9254-9263. https://doi.org/10.1109/ICCV48922.2021.00912
Hornos, M. J., & Quinde, M. (2024). Development methodologies for iot-based systems: Challenges and research directions. Journal of Reliable Intelligent Environments, 1-30. https://doi.org/10.1007/s40860-024-00229-9
Jiang, Y., Wang, Z., & Jin, Z. (2023). Iot Data Processing and Scheduling Based on Deep Reinforcement Learning. International Journal of Computers Communications & Control, 18(6). https://doi.org/10.15837/ijccc.2023.6.5998
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Li, X., Kang, J., Yang, Y., & Zhao, F. (2023). A lightweight attentional shift graph convolutional network for skeleton-based action recognition. International Journal of Computers Communications & Control, 18(3). https://doi.org/10.15837/ijccc.2023.3.5061
Li, Y., Xia, R., & Liu, X. (2020). Learning shape and motion representations for view invariant skeleton-based action recognition. Pattern Recognition, 103, 107293. https://doi.org/10.1016/j.patcog.2020.107293
Liu, J., & Xu, D. (2021). Geometrymotion-net: A strong two-stream baseline for 3d action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31 (12), 4711-4721. https://doi.org/10.1109/TCSVT.2021.3101847
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684-2701. https://doi.org/10.1109/TPAMI.2019.2916873
Modupe, O. T., Otitoola, A. A., Oladapo, O. J., Abiona, O. O., Oyeniran, O. C., Adewusi, A. O., Komolafe, A. M., & Obijuru, A. (2024). Reviewing the transformational impact of edge computing on real-time data processing and analytics. Computer Science & IT Research Journal, 5 (3), 693-702. https://doi.org/10.51594/csitrj.v5i3.929
Rani, S. S., et al (2020). Self-similarity matrix and view invariant features assisted multi-view human action recognition. 2020 IEEE International Conference for Innovation in Technology (INOCON), 1-6. https://doi.org/10.1109/INOCON50539.2020.9298424
Saha, S., Perumal, I., Niveditha, V. R., Abbas, M., Manimozhi, I., & Bhat, C. R. (2024). Contextual Information Based Scheduling for Service Migration in Mobile Edge Computing. International Journal of Computers Communications & Control, 19(3). https://doi.org/10.15837/ijccc.2024.3.6143
Shahroudy, A., Liu, J., Ng, T.-T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE conference on computer vision and pattern recognition, 1010-1019. https://doi.org/10.1109/CVPR.2016.115
Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026-12035. https://doi.org/10.1109/CVPR.2019.01230
Shi, Y., Ong, H. R., Yang, S., & Fan, Y. (2024). Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition. International Journal of Computers Communications & Control, 19(5). https://doi.org/10.15837/ijccc.2024.5.6457
Song, Y.-F., Zhang, Z., Shan, C., & Wang, L. (2020). Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31 (5), 1915-1925. https://doi.org/10.1109/TCSVT.2020.3015051
Thoker, F. M., & Gall, J. (2019). Cross-modal knowledge distillation for action recognition. 2019 IEEE International Conference on Image Processing (ICIP), 6-10. https://doi.org/10.1109/ICIP.2019.8802909
Tsai, Y.-H., & Hsu, T.-C. (2024). An effective deep neural network in edge computing enabled internet of things for plant diseases monitoring. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 695-699. https://doi.org/10.1109/WACVW60836.2024.00081
Ullah, A., Muhammad, K., Hussain, T., & Baik, S. W. (2021). Conflux lstms network: A novel approach for multi-view action recognition. Neurocomputing, 435, 321-329. https://doi.org/10.1016/j.neucom.2019.12.151
Wang, G., Zhao, P., Shi, Y., Zhao, C., & Yang, S. (2024, March). Generative Model-Based Feature Knowledge Distillation for Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 14, pp. 15474-15482). https://doi.org/10.1609/aaai.v38i14.29473
Wang, X., Lu, Y., Yu, W., Pang, Y., & Wang, H. (2024). Few-shot Action Recognition via Multiview Representation Learning. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3384875
Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., Yuille, A., & Li, Y. (2022). Learning from temporal gradient for semi-supervised action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3252-3262. https://doi.org/10.1109/CVPR52688.2022.00325
Xiao, Y., Chen, J., Wang, Y., Cao, Z., Zhou, J. T., & Bai, X. (2019). Action recognition for depth video using multi-view dynamic images. Information Sciences, 480, 287-304. https://doi.org/10.1016/j.ins.2018.12.050
Xu, Y., Jiang, S., Cui, Z., & Su, F. (2024). Multi-View Action Recognition for Distracted Driver Behavior Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7172-7179). https://doi.org/10.1109/CVPRW63382.2024.00712
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeletonbased action recognition. Proceedings of the AAAI conference on artificial intelligence, 32 (1). https://doi.org/10.1609/aaai.v32i1.12328
Zhang, D., Du, C., Peng, Y., Liu, J., Mohammed, S., & Calvi, A. (2024). A multi-source dynamic temporal point process model for train delay prediction. IEEE Transactions on Intelligent Transportation Systems. https://doi.org/10.1109/TITS.2024.3430031
Zhang, D. (2017). High-speed train control system big data analysis based on the fuzzy RDF model and uncertain reasoning. International Journal of Computers Communications & Control, 12(4), 577-591. https://doi.org/10.15837/ijccc.2017.4.2914
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., & Zheng, N. (2020). Semantics-guided neural networks for efficient skeleton-based human action recognition. proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1112-1121. https://doi.org/10.1109/CVPR42600.2020.00119
Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep mutual learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 4320-4328. https://doi.org/10.1109/CVPR.2018.00454
Zhou, T., Gao, X., Sun, X., & Han, L. (2024). Split Difference Weighting: An Enhanced Decision Tree Approach for Imbalanced Classification. International Journal of Computers Communications & Control, 19(6). https://doi.org/10.15837/ijccc.2024.6.6702
Zhu, X., Zhu, Y., Wang, H., Wen, H., Yan, Y., & Liu, P. (2022). Skeleton sequence and rgb frame based multi-modality feature fusion network for action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18 (3), 1-24. https://doi.org/10.1145/3491228
Additional Files
Published
Issue
Section
License
Copyright (c) 2025 Siyuan Liu, Wenjing Liu, Jie Tian

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.






