Show simple item record

dc.contributor.authorGhinassi, I
dc.contributor.authorWang, L
dc.contributor.authorNewell, C
dc.contributor.authorPurver, M
dc.date.accessioned2023-12-05T15:42:41Z
dc.date.available2023-12-05T15:42:41Z
dc.date.issued2023
dc.identifier.othere1593
dc.identifier.othere1593
dc.identifier.urihttps://qmro.qmul.ac.uk/xmlui/handle/123456789/92635
dc.description.abstractNeural sentence encoders (NSE) are effective in many NLP tasks, including topic segmentation. However, no systematic comparison of their performance in topic segmentation has been performed. Here, we present such a comparison, using supervised and unsupervised segmentation models based on NSEs. We first compare results with baselines, showing that the use of NSEs does often provide improvements, except for specific domains such as news shows. We then compare over three different datasets a range of existing NSEs and a new NSE based on ad hoc pre-training strategy. We show that existing literature documenting general performance gains of NSEs does not always conform to the results obtained by the same NSEs in topic segmentation. If Transformers-based encoders do improve over previous approaches, fine-tuning in sentence similarity tasks or even on the same topic segmentation task we aim to solve does not always equate to better performance, as results vary across method being used and domains of application. We aim to explain this phenomenon and the relative poor performance of NSEs in news shows by considering how well different NSEs encode the underlying lexical cohesion of same-topic segments; to do so, we introduce a new metric, ARP. The results from this study suggest that good topic segmentation results do not always rely on good cohesion modelling on behalf of the segmenter and that is dependent upon what kind of text we are trying to segment. Also, it appears evident that traditional sentence encoders fail to create topically cohesive clusters of segments when used on conversational data. Overall, this work advances our understanding of the use of NSEs in topic segmentation and of the general factors determining the success (or failure) of a topic segmentation system. The new proposed metric can quantify the lexical cohesion of a multi-topic document under different sentence encoders and, as such, might have many different uses in future research, some of which we suggest in our conclusions.en_US
dc.format.extente1593 - e1593
dc.languageen
dc.publisherPeerJen_US
dc.relation.ispartofPeerJ Computer Science
dc.rightsThis item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
dc.rightsAttribution 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/*
dc.titleComparing neural sentence encoders for topic segmentation across domains: not your typical text similarity tasken_US
dc.typeArticleen_US
dc.rights.holder© 2023 The Author(s). Published by PeerJ
dc.identifier.doi10.7717/peerj-cs.1593
pubs.notesNot knownen_US
pubs.publication-statusPublished onlineen_US
pubs.publisher-urlhttp://dx.doi.org/10.7717/peerj-cs.1593en_US
pubs.volume9en_US
rioxxterms.funderDefault funderen_US
rioxxterms.identifier.projectDefault projecten_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

This item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Except where otherwise noted, this item's license is described as This item is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.