The KODAQS Toolbox – Assessing and Mitigating Data Quality Issues – Part 2: Digital Behavioral Data

Dr. Sina Chen, Dr. Yannik Peters and Fabienne Kraemer

1 month ago

In the first blog post of the KODAQS Toolbox series, we discussed how data quality issues can affect survey data. Similar challenges arise in digital behavioral data (DBD), though they often manifest differently. Researchers may encounter missing or deleted posts, inconsistent annotation schemes across datasets, or preprocessing decisions – such as text cleaning, stopword removal, or automated translation – that alter the data. In addition, digital traces may only imperfectly reflect the social constructs of interest. If ignored, these issues can quietly undermine even the most sophisticated analyses.

Im ersten Blogbeitrag der KODAQS-Toolbox-Reihe haben wir erörtert, wie sich Probleme mit der Datenqualität auf Umfragedaten auswirken können. Ähnliche Herausforderungen treten auch bei digitalen Verhaltensdaten (DBD) auf, wenngleich sie sich oft anders äußern. Forschende können auf fehlende oder gelöschte Beiträge, uneinheitliche Annotationsschemata über Datensätze hinweg oder Vorverarbeitungsentscheidungen – wie Textbereinigung, Entfernung von Stoppwörtern oder automatisierte Übersetzung – stoßen, die die Daten verändern. Darüber hinaus spiegeln digitale Spuren die sozialen Konstrukte von Interesse möglicherweise nur unvollständig wider. Werden diese Probleme ignoriert, können sie selbst die ausgefeiltesten Analysen unbemerkt untergraben.

DOI: 10.34879/gesisblog.2026.118

Understanding Data Quality Issues with DBD Through the TED ON

DBD are a promising resource for social science research and beyond. However, because DBD research is still relatively new, systematic work on data quality is less developed than for more established data sources. Unlike survey data, DBD consist of behavioral traces left by individuals (or algorithms) as they interact with digital platforms or systems. These nonreactive traces – generated as by-products – are often less prone to certain survey biases, but they also limit researchers’ control over data generation. Consequently, working with DBD poses distinct challenges for assessing and ensuring data quality.

Similar to survey research, data quality issues in DBD should not be treated as isolated problems but examined within a broader error framework. Building on the Total Survey Error (TSE) framework (Biemer, 2010), Sen et al. (2021) proposed the TED-On framework (Total Error Framework for Digital Traces of Human Behavior on Online Platforms), which adapts the core principles of TSE to the specific characteristics of digital trace data.¹

The TED-On framework distinguishes between measurement errors and representation errors. Measurement errors occur when data do not accurately capture the construct of interest – for example, using sentiment analysis of social media posts as a proxy for presidential approval, where positive language may be sarcastic or refer to others. Representation errors arise when the observed data do not reflect the target population; even if all posts about a U.S. president were collected, the platform’s user base would still differ from the general population or eligible voters.

Moreover, the TED-On framework organizes these errors across the main stages of the research process: construct definition, where researchers link DBD to theoretical concepts; platform, device, or application selection, which sets the context and observed population; data collection, where specific traces or indicators are chosen; data processing, where raw data are transformed or annotated; and data analysis, where measures are derived and inferences drawn. Measurement and representation errors can occur at each stage. These stages are often iterative rather than strictly sequential, as insights from later stages may prompt revisions to earlier decisions.

From Theory to Practice: Finding and Using Tools for Data Quality Assessment

As with any type of data, understanding the theoretical sources of error in DBD is an important first step. However, the real challenge lies in translating this knowledge into practical workflows. How can researchers systematically assess potential data quality issues in their datasets, and which tools or resources can support them in improving data quality of DBD in practice?

DBD is still a comparatively young field, so systematic research on assessing and standardizing data quality is less developed than in established areas like survey research. At the same time, DBD have specific characteristics that make it difficult to define universally applicable data quality measures and indicators. As a result, researchers often need to adapt existing indicators or develop project-specific approaches. Even when suitable indicators exist, practical questions remain: how should these measures be implemented, and how can researchers ensure they are correctly applied to a particular dataset? Moreover, identifying potential data quality problems is only the first step, the greater challenge often lies in determining how to address them effectively.

To support researchers in this process, the KODAQS Data Quality Toolbox was developed to translate theoretical knowledge about data quality into practical implementation. It guides users through case scenarios and offers recommendations for identifying and addressing specific data quality issues. For a detailed introduction, see our previous blog post.

Exploring KODAQS Tools for Digital Behavioral Data

How can the Toolbox be used in practice? The KODAQS Toolbox includes seven tools that focus on different quality aspects of DBD and offer a hands-on introduction to identifying and addressing common data quality issues.

Delab Trees

The Delab Trees tool is a Python library for analyzing network data, with a particular focus on social media conversation structures. Within the TED-On framework, it is especially relevant at the construct definition stage, where it helps researchers assess the validity of individual nodes in conversation trees. It can also be used during data preprocessing and data analysis, for example to handle large-scale conversation trees, account for deleted posts, study interaction dynamics between authors, and support discussion mining.

SubData

The SubData tool, implemented as a Python library, evaluates the alignment between large language models (LLMs) and human perspectives in subjective annotation tasks. It is particularly relevant during data preprocessing, where it helps to improve data quality by harmonizing heterogeneous datasets, providing standardized keyword mappings and taxonomies, and enabling theory-driven analyses of perspective alignment. In addition, SubData includes ten curated datasets that can be used to identify model biases, test generalizability, and contextualize hate speech datasets within the broader literature.

TES-D

The Total Errors Sheet for Datasets (TES-D) is a template-based approach for documenting datasets from online sources such as social media or Wikipedia, focusing on potential errors throughout the research process. It provides a structured catalogue of questions to guide researchers in reflecting critically on data collection and potential measurement or representation errors. Completed TES-D documentation is intended to accompany the dataset, enhancing transparency, reproducibility, and responsible reuse, as illustrated by the “Call me sexist, but…” dataset by Samory et al. (2021), where structured documentation enhances interpretability and reuse.

TextPrep

The TextPrep tool, implemented in R, provides text preprocessing and comparative strategies to improve the quality of social media data. It supports common techniques, such as automated translation, minor text operations, and stopword filtering, while allowing systematic comparisons of alternative approaches. The tool helps to assess how different procedures may affect analytical outcomes and provides metrics to quantify differences, helping researchers evaluate choices transparently. As an illustrative use case, a synthetic dataset of social media posts about the 2024 Summer Olympics is processed using different configurations to compare their impact on the resulting text.

TopLing

The TopLing tool enables researchers to evaluate the quality of machine translation as a consolidation strategy for multilingual topic modeling in R It provides a framework to assess how translation decisions affect downstream topic modeling, identify topics distorted by translation artifacts, and apply dedicated metrics to evaluate translation suitability. By making translation-induced inconsistencies visible, TopLing helps mitigate potential measurement errors during preprocessing and analysis. As a hands-on example, the accompanying tutorial demonstrates its use on a German-language United Nations corpus (Eisele and Chen, 2010).

TubecleanR

The tubecleanR tool implemented as R package provides functions for cleaning and preprocessing YouTube comment data collected using the R packages tuber or vosonSML. It addresses potential measurement errors by offering structured routines for handling typical challenges, such as separating text, emoticons, and paradata. This helps researchers prepare high-quality datasets for analysis. A tutorial demonstrates its use on a synthetic dataset generated with Google Gemini, replicating the structure of real YouTube comment data.

ValiText

The ValiText provides a framework and practical guidance for validating text-based measures of social constructs. It mitigates measurement errors by outlining key validation evidence, defining tailored validation steps, and offering a checklist for transparent documentation. By guiding researchers through a systematic process, ValiText helps ensure that text-derived indicators accurately capture the intended constructs. As an example, it has been applied to the “Call me sexist, but…” dataset by Samory et al. (2021) to demonstrate practical validation and documentation.

Outlook

While DBD offer considerable potential for social science research, its value increases when it is linked with additional data sources, such as survey data. Through linkage researchers can combine complementary information and analyze relationships that cannot be investigated using a single dataset.

To learn more about the opportunities and challenges of combining DBD with other data sources, see our upcoming blog post on Linked Data.

About KODAQS: Improving Data Quality in the Social Sciences

The Competence Center for Data Quality in the Social Sciences (KODAQS) is a collaboration between GESIS – Leibniz Institute for the Social Sciences, the University of Mannheim, and LMU Munich. Its mission is to support researchers in evaluating and improving the quality of social science data through a combination of training opportunities at the KODAQS Data Quality Academy, open-educational resources like the KODAQS Data Quality Toolbox, and collaborative research as part of the Guest Researcher Program.

Curious to learn more or get involved? Explore the KODAQS Toolbox and our related KODAQS services and join us in advancing data quality in the social sciences.

Endnotes/Footnotes

DBD include more than digital traces from online platforms, such as data from smartwatches or car-sharing apps. While the TED-On framework was developed for online platform data, its core principles can also be applied to other forms of DBD with minor adaptations.