ChatGPT: A tutor for data analysis?

A picture created by DALL-E – Request: “a robot working on a laptop, on the screen you can see a bar chart, style Roy Lichtenstein”

ChatGPT is an interesting new tool to support data analysis in the social sciences. It can be used to get suggestions for data analysis methods and for writing code. In this post, I summarize my experience in using ChatGPT to conduct a simple analysis of turnout with the ESS 2018.

ChatGPT ist ein interessantes neues Werkzeug zur Unterstützung der Datenanalyse in den Sozialwissenschaften. Es kann verwendet werden, um Vorschläge für Analyseverfahren zu erhalten und um Code für die Datenanalyse zu schreiben. In diesem Beitrag fasse ich meine Erfahrungen zusammen, die ich mit der Verwendung von ChatGPT zur Durchführung einer einfachen Analyse der Wahlbeteiligung mit dem ESS 2018 gemacht habe.

DOI: 10.34879/gesisblog.2023.67


ChatGPT and its implications for the social sciences

The OpenAI chatbot ChatGPT has been available for a few months now and has started a wide discussion about the impact of artificial intelligence (AI) over the next years. ChatGPT generates human-like responses, answers questions, and allows for significantly better interaction with chatbots than before. It has passed standardized exams, generates texts on different topics, and summarizes content (Open AI, 2023). Several other applications have been found for ChatGPT, which has led to an active discussion about the implications of new AI models for our future.

In this post, I explore the potential application of ChatGPT in social science data analysis. ChatGPT, with its sophisticated language processing capabilities, may be an important tool to support data analysis and enrich the writing of research outcomes. Thanks to the large amount of publicly accessible online resources on data analysis, ChatGPT can be trained to generate insightful textual responses. Consequently, it may serve as a supplementary tool for addressing common questions that arise during social science data analysis.

Social science data analysis

Data analysis in the social sciences includes regular steps that are standardized. Data is analyzed with software such as R, Stata, Python, or SPSS. The analysis typically includes the following steps:

  1. Data is read into the program.
  2. Data is explored, cleaned up, and prepared for analysis.
  3. A descriptive analysis describes and visualizes the variables of interest.
  4. A multivariate analysis is conducted to develop several statistical models to answer a question of interest.
  5. A report or a scientific paper is prepared to present and summarize the results of the data analysis.

Teaching data analysis to students of the social sciences follows the same path. Students learn statistical methods and the use of statistical software to conduct data analysis independently. In their work, they use these skills to study social science research questions with quantitative data analysis independently. This process of learning includes many new topics and techniques: statistical concepts, indicators, and models as well as basic computer programming approaches. A wide set of learning materials exist to learn, apply, and improve statistical and software skills at different levels, from early beginners to advanced learners.

At German universities, statistical training in the social sciences is often done with large surveys that are available from GESIS or other international social science data archives. The European Social Survey (ESS), conducted first in 2002, is a prominent example that provides plenty of information about the social dynamics in Germany and Europe over the last decades. The ESS is also widely used in social science research. In teaching quantitative methods to social science students, the survey is often used to study public opinion, social inequality, and political participation through quantitative data analysis.

ChatGPT meets ESS and R

I have used surveys such as the ESS regularly in teaching social science methods with R and Stata. So, I wanted to find out to what degree ChatGPT can serve as a tutor and teacher to conduct a simple statistical analysis of a survey with R. The point of reference is teaching material that introduces statistical techniques, survey analysis, and software for studies with data in the social sciences.

This should be a welcome field for ChatGPT. There is a lot of information available online on data analysis (e.g. teaching material, research articles, online exchanges on social media etc.). The ESS has been broadly used in research publications and the replication material, documentation, and code snippets are widely available. Much of this material that is freely accessible on the internet may have even been used to train ChatGPT.

The questions I was interested in: How well does ChatGPT answer and explain statistical questions? To what degree can it be used to write the code for a survey data analysis in R? Are its responses sound?

To answer these questions, I conducted a ChatGPT session of a little more than two hours. In the session, I interacted with ChatGPT in English to conduct an analysis of voting behavior with the ESS. I started by asking questions to ChatGPT about the steps of a data analysis. If the chatbot got stuck or if the provided answers were not helpful, I conducted the required steps independently and gave them as feedback to ChatGPT. My goal was to develop a simple analysis of voting behavior with the ESS. The result was a script in R for the ESS analysis.

ESS, R, and ChatGPT

Here, I study political behavior with a focus on voting by conducting data analysis of the ESS 2018. I was particularly interested in the question of turnout: Who participated in the last election? This is a classic question of democratic representation and many studies have been conducted in the social sciences to specify determinants of turnout (see Smets and van Ham, 2013).

As a social scientist, I have solid experience with R, RStudio, and the tidyverse (Wickham, Çetinkaya-Rundel, and Grolemund, 2023). I have also taught data analysis regularly to social science students. However, I use the documentation regularly when using Tidyverse-R in a data analysis and hoped that ChatGPT would provide some support. For my exchange with ChatGPT, I got the ESS 2018 data from the ESS Data Portal (ESS Round 9, 2018). So, I was ready to be guided by ChatGPT through the analysis.

ChatGPT gave me an overview on voting-related variables in the ESS at my request. That was a good start. However, in the next question, I asked for the name of the party vote variable. The variable names it suggested are not in the ESS although the description sounded reasonable. This may have been a challenge for the chatbot. The party vote variable in ESS is challenging because there is one variable for every ESS country. So, I asked ChatGPT for the voting variable and for age. It rightly suggested the correct variable names and gave a brief description. This allowed me to save some time looking up the ESS codebook.

User

what is the ess variable if a respondent voted

ChatGPT

The ESS includes several variables that measure voting behavior, depending on how you define “voted.” Here are some of the key variables in the ESS 2018 that may be relevant:

– “vote”: This variable measures whether the respondent voted in the most recent national parliamentary election using a binary (yes/no) response. Respondents are asked “Did you vote in the most recent national parliamentary election?” If the respondent answers “yes,” the variable is coded as 1; otherwise, it is coded as 0.

– “vote_what”: This variable measures which political party the respondent voted for in the most recent national parliamentary election. Respondents who did not vote or did not remember their vote are coded as missing. The variable codes the responses with a numerical value, which can be used to measure the respondent’s party preference.

Note that the variable names may vary slightly depending on the format in which you downloaded the ESS data. Therefore, it’s important to consult the codebook or documentation to ensure that you are using the correct variable names for your analysis.

User

what is the variable for age?

ChatGPT

In the European Social Survey (ESS), the variable for age is typically named “agea” in the data. The variable represents the age of the respondent at the time of the survey, and is measured in years. Note that the variable name may differ depending on the version of the ESS you are using, so it’s always a good idea to consult the codebook or documentation to ensure that you are using the correct variable name for your analysis.

I could use ChatGPT to get R code suggestions to load the data and to conduct a descriptive analysis of the data. This started well. I could run the snippets and sometimes made minor code modifications. Some of the code was slightly redundant or did not use the most recent approach that is available in R and the tidyverse. Overall, it went well and was helpful. Creating a script for the data analysis was significantly faster with the help of ChatGPT responses and code snippets. It would have taken me more time to look up the information on the help pages or through an online search.

I supported ChatGPT and provided the names of the descriptive statistics I was interested in. At my request, it gave me the respective R code. ChatGPT was also giving me a correct and reasonable summary of available R packages that can be used for descriptive data analysis. The voting variable needed some data transformation that did not work well with ChatGPT as it assumed different information than the variable included.

Afterwards, I moved to multivariate analysis and asked ChatGPT to suggest variables. It gave me this reasonable list of suggestions.

ChatGPT

1. `cntry`: Country of the respondent

2. `essround`: ESS round (useful for comparing results across rounds)

3. `agea`: Age of the respondent

4. `gender`: Gender of the respondent

5. `edulvla`: Level of education of the respondent

6. `emplnw`: Employment status of the respondent

7. `hinctnta`: Net household income of the respondent

8. `trstprl`: Trust in the national parliament

9. `stflife`: Satisfaction with life

10. `psppsgva`: General trust in people

11. `sclmeet`: Frequency of social meetings

I also asked for a statistical model to use, and it rightly suggested Logistic regression and Multilevel logistic regression but also Poisson regression which was not applicable. It generated the R code for the logistic regression model but was not helpful when I needed to conduct some data transformation. I also wanted to have a visualization of the model results in a graphic. ChatGPT pointed me in the right direction, but its code suggestions were verbose and did not work right away. Here I had to point ChatGPT towards a simpler solution with a recent R package, ggeffects (Lüdecke, 2018). Once it knew the package, it used the approach in its code suggestions.

For the multiple regression there were often smaller mistakes that I could catch thanks to my longtime experience with R. I was significantly faster in doing my analysis in R but had to correct some minor code mistakes and sometimes needed to guide ChatGPT towards the coding approaches I was interested in.

I also used ChatGPT to get some methodological advice on how to study country differences and interaction effects. Again, the responses were reasonable and pointed in the right direction. It suggested alternative model specifications and other variables that could be used after I asked to improve the model.

However, sometimes it just made-up things. For example, a particular method was not part of the R package it suggested although ChatGPT insisted on it. At some point, it suggested a correct alternative after I kept pointing out the mistake.

Finally, I briefly explored a more complex analysis technique, multilevel modeling. Again, it provided the correct R code. It also gave me a simple textual interpretation of the results after I provided the model results from the analysis in R. I asked more specific questions about the model, but ChatGPT kept its answers rather general.

ChatGPT, a reasonable tutor?

ChatGPT was helpful for me in data analysis of the ESS with R. It was particularly helpful in writing code snippets for the data analysis in R. I used these code snippets, adjusted them, and reused them for the analysis. With regard to R, it mainly pointed in the right direction and provided good suggestions. Some of the suggestions were a little outdated and I had to encourage ChatGPT to use more recent approaches. Some of the code did not work, though this was often due to misspecified parameters or ChatGPT not knowing the structure of the data. Overall, ChatGPT was a good coding company in R.

For the statistical analysis and the ESS data, the ChatGPT responses were reasonable. Sometimes they were helpful, sometimes they were misleading, and sometimes they were plainly wrong. It had good basic knowledge and may have passed a simple test in a social science data analysis class.

An AI-based tutor like ChatGPT can be a valuable aid for social science data analysis, especially when crafting analysis scripts. For me, ChatGPT was particularly helpful when I wanted to remember approaches or programming techniques. So, it helped me to refresh my memory. However, the more detailed my questions were, the less I could make use of ChatGPTs’ answers.

For those new to data analysis in R, Stata, or Python, ChatGPT may be less helpful. It does not provide a controlled environment but points in different directions. It can be used as a reasonable starting point. However, a good textbook, a structured online tutorial, or a course may be needed to get the necessary context information.

Using ChatGPT in data analysis feels a little like switching from a bicycle to a pedelec. You can drive more effortlessly and potentially faster; however you still need to know how to ride a bike if you want to move forward and don’t want to fall off.

Check out the the code for the ESS analysis with ChatGPT: https://hdigital.github.io/data-snippets/notebooks/chatgpt-ess.nb.html


References

  1. ESS Round 9: European Social Survey Round 9 Data. 2018. Data file edition 3.1. Sikt – Norwegian Agency for Shared Services in Education and Research, Norway – Data Archive and distributor of ESS data for ESS ERIC. https://doi.org/10.21338/NSD-ESS9-2018
  2. Lüdecke, Daniel. 2018. “ggeffects: Tidy Data Frames of Marginal Effects from Regression Models.” Journal of Open Source Software 3(26): 772. http://dx.doi.org/10.21105/joss.00772
  3. OpenAI. 2023. “GPT-4 Technical Report.” http://arxiv.org/abs/2303.08774 (May 6, 2023).
  4. Smets, Kaat, and Carolien van Ham. 2013. “The Embarrassment of Riches? A Meta-Analysis of Individual-Level Research on Voter Turnout.” Electoral Studies 32(2): 344-59. https://doi.org/10.1016/j.electstud.2012.12.006
  5. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Sebastopol, CA: O’Reilly Media. https://r4ds.hadley.nz (Accessed May 6, 2023).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.