We are a team of the Web Intelligence Group working closely on problems related to conversational systems. Conversational systems aim to satisfy users’ needs through conversations using written or spoken language.
Our main research topics are:
- mixed-initiative interaction
- common-sense reasoning
- conversational question-answering
- dialogue-state tracking
- task-based conversational systems
- fairness in conversational systems
- conversational systems’ evaluation
People
Publications
2022 |
Kim, To Eun; Lipani, Aldo A Multi-Task Based Neural Model to Simulate Users in Goal-Oriented Dialogue Systems Inproceedings Proc.~of SIGIR, 2022. @inproceedings{kim-lipani-2022-multi, title = {A Multi-Task Based Neural Model to Simulate Users in Goal-Oriented Dialogue Systems}, author = {To Eun Kim and Aldo Lipani}, url = {https://www.researchgate.net/publication/360276605_A_Multi-Task_Based_Neural_Model_to_Simulate_Users_in_Goal-Oriented_Dialogue_Systems}, year = {2022}, date = {2022-06-11}, urldate = {2022-01-01}, booktitle = {Proc.~of SIGIR}, series = {SIGIR '22}, abstract = {A human-like user simulator that anticipates users' satisfaction scores, actions, and utterances can help goal-oriented dialogue systems in evaluating the conversation and refining their dialogue strategies. However, little work has experimented with user simulators which can generate users' utterances. In this paper, we propose a deep learning-based user simulator that predicts users' satisfaction scores and actions while also jointly generating users' utterances in a multi-task manner. In particular, we show that 1) the proposed deep text-to-text multi-task neural model achieves state-of-the-art performance in the users' satisfaction scores and actions prediction tasks, and 2) in an ablation analysis, user satisfaction score prediction, action prediction, and utterance generation tasks can boost the performance with each other via positive transfers across the tasks. The source code and model checkpoints used for the experiments run in this paper are available at the following weblink: https://github.com/kimdanny/user-simulation-t5.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A human-like user simulator that anticipates users' satisfaction scores, actions, and utterances can help goal-oriented dialogue systems in evaluating the conversation and refining their dialogue strategies. However, little work has experimented with user simulators which can generate users' utterances. In this paper, we propose a deep learning-based user simulator that predicts users' satisfaction scores and actions while also jointly generating users' utterances in a multi-task manner. In particular, we show that 1) the proposed deep text-to-text multi-task neural model achieves state-of-the-art performance in the users' satisfaction scores and actions prediction tasks, and 2) in an ablation analysis, user satisfaction score prediction, action prediction, and utterance generation tasks can boost the performance with each other via positive transfers across the tasks. The source code and model checkpoints used for the experiments run in this paper are available at the following weblink: https://github.com/kimdanny/user-simulation-t5. |
Shi, Zhengxiang; Zhang, Qiang; Lipani, Aldo StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts Inproceedings Proceedings of the Association for the Advancement of Artificial Intelligence, 2022. @inproceedings{Shi2022, title = {StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts}, author = {Zhengxiang Shi and Qiang Zhang and Aldo Lipani}, url = {https://www.researchgate.net/publication/357159030_StepGame_A_New_Benchmark_for_Robust_Multi-Hop_Spatial_Reasoning_in_Texts}, year = {2022}, date = {2022-01-01}, booktitle = {Proceedings of the Association for the Advancement of Artificial Intelligence}, series = {AAAI '22}, abstract = {Inferring spatial relations in natural language is a crucial ability an intelligent system should possess. The bAbI dataset tries to capture tasks relevant to this domain (tasks 17 and 19). However, these tasks have several limitations. Most importantly, they are limited to fixed expressions, they are limited in the number of reasoning steps required to solve them, and they fail to test the robustness of models to input that contains irrelevant or redundant information. In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts. Our experiments demonstrate that state-of-the-art models on the bAbI dataset struggle on the StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Inferring spatial relations in natural language is a crucial ability an intelligent system should possess. The bAbI dataset tries to capture tasks relevant to this domain (tasks 17 and 19). However, these tasks have several limitations. Most importantly, they are limited to fixed expressions, they are limited in the number of reasoning steps required to solve them, and they fail to test the robustness of models to input that contains irrelevant or redundant information. In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts. Our experiments demonstrate that state-of-the-art models on the bAbI dataset struggle on the StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance. |
Shi, Zhengxiang; Feng, Yue; Lipani, Aldo Learning to Execute Actions or Ask Clarification Questions Inproceedings Findings of NAACL, 2022. @inproceedings{shi-etal-2022-learning, title = {Learning to Execute Actions or Ask Clarification Questions}, author = {Zhengxiang Shi and Yue Feng and Aldo Lipani}, url = {https://www.researchgate.net/publication/360050130_Learning_to_Execute_Actions_or_Ask_Clarification_Questions}, year = {2022}, date = {2022-01-01}, booktitle = {Findings of NAACL}, series = {NAACL '22}, abstract = {Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collabora-tive building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collabora-tive building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly. |
2021 |
Ye, Fanghua; Manotumruksa, Jarana; Zhang, Qiang; Li, Shenghui; Yilmaz, Emine Slot Self-Attentive Dialogue State Tracking Inproceedings WWW, 2021. @inproceedings{www21ye, title = {Slot Self-Attentive Dialogue State Tracking}, author = {Fanghua Ye and Jarana Manotumruksa and Qiang Zhang and Shenghui Li and Emine Yilmaz}, url = {https://arxiv.org/abs/2101.09374}, year = {2021}, date = {2021-04-01}, booktitle = {WWW}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
Lipani, Aldo; Carterette, Ben; Yilmaz, Emine How Am I Doing?: Evaluating Conversational Search Systems Offline Journal Article ACM Transactions on Information Systems (TOIS), 2021. @article{Lipani2021TOIS, title = {How Am I Doing?: Evaluating Conversational Search Systems Offline}, author = {Aldo Lipani and Ben Carterette and Emine Yilmaz}, url = {https://www.researchgate.net/publication/350640565_How_Am_I_Doing_Evaluating_Conversational_Search_Systems_Offline https://aldolipani.com/wp-content/uploads/2021/04/How_Am_I_Doing-Evaluating_Conversational_Search_Systems_Offline.pdf}, year = {2021}, date = {2021-01-01}, journal = {ACM Transactions on Information Systems (TOIS)}, abstract = {As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.}, keywords = {}, pubstate = {published}, tppubtype = {article} } As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search. |
2019 |
Lipani, Aldo; Carterette, Ben; Yilmaz, Emine From a User Model for Query Sessions to Session Rank Biased Precision (sRBP) Inproceedings Proc.~of ICTIR, 2019. @inproceedings{Lipani2019, title = {From a User Model for Query Sessions to Session Rank Biased Precision (sRBP)}, author = {Aldo Lipani and Ben Carterette and Emine Yilmaz}, url = {https://www.researchgate.net/publication/334725760_From_a_User_Model_for_Query_Sessions_to_Session_Rank_Biased_Precision_sRBP}, doi = {10.1145/3341981.3344216}, year = {2019}, date = {2019-10-02}, booktitle = {Proc.~of ICTIR}, journal = {Proc.~of ICTIR}, abstract = {To satisfy their information needs, users usually carry out searches on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as a query-session. Research in IR evaluation has traditionally been focused on the development of measures for the ad hoc task, for which a retrieval system aims to retrieve the best documents for a single query. Thus, most IR evaluation measures, with a few exceptions , are not suitable to evaluate retrieval scenarios that call for multiple refinements over a query-session. In this paper, by formally modeling a user's expected behaviour over query-sessions, we derive a session-based evaluation measure, which results in a generalization of the evaluation measure Rank Biased Precision (RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user model against the observed user behaviour over the query-sessions of the 2014 TREC Session track.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } To satisfy their information needs, users usually carry out searches on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as a query-session. Research in IR evaluation has traditionally been focused on the development of measures for the ad hoc task, for which a retrieval system aims to retrieve the best documents for a single query. Thus, most IR evaluation measures, with a few exceptions , are not suitable to evaluate retrieval scenarios that call for multiple refinements over a query-session. In this paper, by formally modeling a user's expected behaviour over query-sessions, we derive a session-based evaluation measure, which results in a generalization of the evaluation measure Rank Biased Precision (RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user model against the observed user behaviour over the query-sessions of the 2014 TREC Session track. |