表 3. 雙流模塊對實(shí)驗結果的影響 總結 騰訊優(yōu)圖實(shí)驗室針對現有多標簽分類(lèi)方法對于大量訓練集不可見(jiàn)的未知類(lèi)別標簽不能有效識別的問(wèn)題,提出了一種可遷移多模態(tài)知識的通用 Open Vocabulary 多標簽學(xué)習框架:MKT。該研究遷移圖文預訓練模型強大的圖文匹配能力,通過(guò)引入提示學(xué)習和知識蒸餾來(lái)優(yōu)化標簽 Embedding 以及提升圖像 - 標簽 Embedding 的一致性,并采用雙流模塊同時(shí)捕捉局部和全局特征,提高了模型的多標簽識別能力。在 NUS-WIDE 和 Open Images 兩個(gè)公開(kāi)數據集上的實(shí)驗結果表明,該方法有效實(shí)現了 Open Vocabulary 的多標簽學(xué)習。 參考文獻 [1] Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 Oct (pp. 1532-1543).[2] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning 2021 Jul 1 (pp. 8748-8763). PMLR.[3] Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 (pp. 14084-14093).[4] Huynh D, Kuen J, Lin Z, Gu J, Elhamifar E. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 (pp. 7020-7031).[5] Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. International Journal of Computer Vision. 2022 Sep;130 (9):2337-48.[6] Huynh D, Elhamifar E. A shared multi-attention framework for multi-label zero-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020 (pp. 8776-8786).[7] Ben-Cohen A, Zamir N, Ben-Baruch E, Friedman I, Zelnik-Manor L. Semantic diversity learning for zero-shot multi-label classification. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 640-650).[8] Narayan S, Gupta A, Khan S, Khan FS, Shao L, Shah M. Discriminative region-based multi-label zero-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision 2021 (pp. 8731-8740).