Sep

2024

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts.

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
Two models—GSM and WTM-GMM—are neural topic models.
W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

Date

19 September 2024

Topics

Research & Expertise

Keywords

publications research projects computer science frontiers of science

About

Laboratory for Social and Cognitive Informatics

About persons

Sergei Koltsov

First Digital Adult Reading Test Available on RuStore

HSE University's Centre for Language and Brain has developed the first standardised tool for assessing Russian reading skills in adults—the LexiMetr-A test. The test is now available digitally on the RuStore platform. This application allows for a quick and effective diagnosis of reading disorders, including dyslexia, in people aged 18 and older.

28 May

May

2025

Low-Carbon Exports Reduce CO2 Emissions

Researchers at the HSE Faculty of Economic Sciences and the Federal Research Centre of Coal and Coal Chemistry have found that exporting low-carbon goods contributes to a better environment in Russian regions and helps them reduce greenhouse gas emissions. The study results have been published in R-Economy.

27 May

May

2025

Russian Scientists Assess Dangers of Internal Waves During Underwater Volcanic Eruptions

Mathematicians at HSE University in Nizhny Novgorod and the A.V. Gaponov-Grekhov Institute of Applied Physics of the Russian Academy of Sciences studied internal waves generated in the ocean after the explosive eruption of an underwater volcano. The researchers calculated how the waves vary depending on ocean depth and the radius of the explosion source. It turns out that the strongest wave in the first group does not arrive immediately, but after a significant delay. This data can help predict the consequences of eruptions and enable advance preparation for potential threats. The article has been published in Natural Hazards. The research was carried out with support from the Russian Science Foundation (link in Russian).

26 May

May

2025

Centre for Language and Brain Begins Cooperation with Academy of Sciences of Sakha Republic

HSE University's Centre for Language and Brain and the Academy of Sciences of the Republic of Sakha (Yakutia) have signed a partnership agreement, opening up new opportunities for research on the region's understudied languages and bilingualism. Thanks to modern methods, such as eye tracking and neuroimaging, scientists will be able to answer questions about how bilingualism works at the brain level.

20 May

May

2025

How the Brain Responds to Prices: Scientists Discover Neural Marker for Price Perception

Russian scientists have discovered how the brain makes purchasing decisions. Using electroencephalography (EEG) and magnetoencephalography (MEG), researchers found that the brain responds almost instantly when a product's price deviates from expectations. This response engages brain regions involved in evaluating rewards and learning from past decisions. Thus, perceiving a product's value is not merely a conscious choice but also a function of automatic cognitive mechanisms. The results have been published in Frontiers in Human Neuroscience.

16 May

May

2025

AI Predicts Behaviour of Quantum Systems

Scientists from HSE University, in collaboration with researchers from the University of Southern California, have developed an algorithm that rapidly and accurately predicts the behaviour of quantum systems, from quantum computers to solar panels. This methodology enabled the simulation of processes in the MoS₂ semiconductor and revealed that the movement of charged particles is influenced not only by the number of defects but also by their location. These defects can either slow down or accelerate charge transport, leading to effects that were previously difficult to account for with standard methods. The study has been published in Proceedings of the National Academy of Sciences (PNAS).

14 May

May

2025

Electrical Brain Stimulation Helps Memorise New Words

A team of researchers at HSE University, in collaboration with scientists from Russian and foreign universities, has investigated the impact of electrical brain stimulation on learning new words. The experiment shows that direct current stimulation of language centres—Broca's and Wernicke's areas—can improve and speed up the memorisation of new words. The findings have been published in Neurobiology of Learning and Memory.

13 May

May

2025

Artificial Intelligence Improves Risk Prediction of Complex Diseases

Neural network models developed at the HSE AI Research Centre have significantly improved the prediction of risks for obesity, type 1 diabetes, psoriasis, and other complex diseases. A joint study with Genotek Ltd showed that deep learning algorithms outperform traditional methods, particularly in cases involving complex gene interactions (epistasis). The findings have been published in Frontiers in Medicine.

6 May

Apr

2025

Cerium Glows Yellow: Chemists Discover How to Control Luminescence of Rare Earth Elements

Researchers at HSE University and the Institute of Petrochemical Synthesis of the Russian Academy of Sciences have discovered a way to control both the colour and brightness of the glow emitted by rare earth elements. Their luminescence is generally predictable—for example, cerium typically emits light in the ultraviolet range. However, the scientists have demonstrated that this can be altered. They created a chemical environment in which a cerium ion began to emit a yellow glow. The findings could contribute to the development of new light sources, displays, and lasers. The study has been published in Optical Materials.

30 April

Apr

2025

Genetic Prediction of Cancer Recurrence: Scientists Verify Reliability of Computer Models

In biomedical research, machine learning algorithms are often used to analyse data—for instance, to predict cancer recurrence. However, it is not always clear whether these algorithms are detecting meaningful patterns or merely fitting random noise in the data. Scientists from HSE University, IBCh RAS, and Moscow State University have developed a test that makes it possible to determine this distinction. It could become an important tool for verifying the reliability of algorithms in medicine and biology. The study has been published on arXiv.

29 April