Comparing neural sentence encoders for topic segmentation across domains: not your typical text similarity task

Ghinassi, I; Wang, L; Newell, C; Purver, M

dc.contributor.author	Ghinassi, I
dc.contributor.author	Wang, L
dc.contributor.author	Newell, C
dc.contributor.author	Purver, M
dc.date.accessioned	2023-12-05T15:42:41Z
dc.date.available	2023-12-05T15:42:41Z
dc.date.issued	2023
dc.identifier.other	e1593
dc.identifier.other	e1593
dc.identifier.uri	https://qmro.qmul.ac.uk/xmlui/handle/123456789/92635
dc.description.abstract	Neural sentence encoders (NSE) are effective in many NLP tasks, including topic segmentation. However, no systematic comparison of their performance in topic segmentation has been performed. Here, we present such a comparison, using supervised and unsupervised segmentation models based on NSEs. We first compare results with baselines, showing that the use of NSEs does often provide improvements, except for specific domains such as news shows. We then compare over three different datasets a range of existing NSEs and a new NSE based on ad hoc pre-training strategy. We show that existing literature documenting general performance gains of NSEs does not always conform to the results obtained by the same NSEs in topic segmentation. If Transformers-based encoders do improve over previous approaches, fine-tuning in sentence similarity tasks or even on the same topic segmentation task we aim to solve does not always equate to better performance, as results vary across method being used and domains of application. We aim to explain this phenomenon and the relative poor performance of NSEs in news shows by considering how well different NSEs encode the underlying lexical cohesion of same-topic segments; to do so, we introduce a new metric, ARP. The results from this study suggest that good topic segmentation results do not always rely on good cohesion modelling on behalf of the segmenter and that is dependent upon what kind of text we are trying to segment. Also, it appears evident that traditional sentence encoders fail to create topically cohesive clusters of segments when used on conversational data. Overall, this work advances our understanding of the use of NSEs in topic segmentation and of the general factors determining the success (or failure) of a topic segmentation system. The new proposed metric can quantify the lexical cohesion of a multi-topic document under different sentence encoders and, as such, might have many different uses in future research, some of which we suggest in our conclusions.	en_US
dc.format.extent	e1593 - e1593
dc.language	en
dc.publisher	PeerJ	en_US
dc.relation.ispartof	PeerJ Computer Science
dc.rights	This item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
dc.rights	Attribution 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	*
dc.title	Comparing neural sentence encoders for topic segmentation across domains: not your typical text similarity task	en_US
dc.type	Article	en_US
dc.rights.holder	© 2023 The Author(s). Published by PeerJ
dc.identifier.doi	10.7717/peerj-cs.1593
pubs.notes	Not known	en_US
pubs.publication-status	Published online	en_US
pubs.publisher-url	http://dx.doi.org/10.7717/peerj-cs.1593	en_US
pubs.volume	9	en_US
rioxxterms.funder	Default funder	en_US
rioxxterms.identifier.project	Default project	en_US

Files in this item

Name:: license_rdf
Size:: 914bytes
Format:: application/rdf+xml

View/Open

Name:: Ghinassi Comparing neural sentence ...
Size:: 2.946Mb
Format:: application/
Description:: Published version

View/Open

This item appears in the following Collection(s)

Electronic Engineering and Computer Science [3424]

Show simple item record

This item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Except where otherwise noted, this item's license is described as This item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.