In this paper, we propose a task estimation method based on multiple subspaces extracted from multi-modal information of image objects in visual scenes and spoken words in dialog appeared in the same task. The multiple subspaces are obtained by using latent semantic analysis (LSA). In the proposed method, a task vector composed of spoken words and the frequencies of image-object appearances are extracted first, and then similarities among the input task vector and reference sub-spaces of different tasks are compared. Experiments are conducted on the identification of game tasks. Experimental results show that the proposed method with multi-modal information outperforms the method in which only single modality of image or spoken dialog is applied. Moreover, the proposed method achieved accurate performance even if less spoken dialog is applied.
|Journal||IEEJ Transactions on Electronics, Information and Systems|
|Publication status||Published - 2012|
- Latent Semantic Analysis
- Multi-modal Processing
- Task Estimation