Is ChatGPT a reliable research assistant? Building a books database
Last updated: Jan 5, 2025
I used ChatGPT to turn a critic’s list of 1300 books into a database, adding extra information to provide additional context to each book. Then I tested how accurate a portion of the generated information was.
LLMs are good at processing unstructured data, like internet pages and research papers, and at generating structured output, like code and JSON files. The devil is in the generated details. If they aren’t correct, what’s the use?
Testing the correctness of generated data is the goal of my informal experiment, which uses ChatGPT to build a database of information about classic books. The input is long list of books that Harold Bloom considered to comprise the Western Canon. The goal output is a table of these books with additional columns for information that wasn’t present in the initial liast.
Let’s ignore the validity of the list itself and consider its worth as source material to test an LLM’s capabilities as a research assistant:
- It would be boring to manually add contextual information for every single one of these 1300+ works
- Almost all the books in the list have extensive discussion in the wider body of written literature, meaning the model’s training corpus likely includes some basic contextual information about each one.
So I used ChatGPT to build me a books database.
Results summary
Though validation errors arose,
ChatGPT did reasonably well at following my prompt to create a CSV file using the input values and specified schema.
The categorical values, for the enum Form
field and freeform Genre
field, seem reasonable at a glance, but all I did was glance at them.
To evaluate the quantity of the reported year
, I tested a sample of 100 randomly selected rows.
The sample accuracy rate was 78%, and I estimate the population accuracy-rate parameter to be between 70% and 87%.
For this specific case, I’d say the LLM worked well enough as a research assistant for my unserious work. With so many potential errors, however, it is unreliable for work with meaningful stakes. Nevertheless, I see ways to improve the setup. And, as an assistant, the bot may be a better fit for helping researchers categorize and filter, rather than delivering high-precision values and facts.
Method
To generate the data, I used the default version of ChatGPT 4o Mini that existed on November 24th, 2024. I did not log in.
To inspect and clean up the data, I used DuckDB to create a table from the generated CSV files.
Goal
The main goal was to test how well ChatGPT could date a list of literary works for me. I also wanted to see how it does in classifying them by Genre and Medium.
Prompt
I refined the prompt in iterations. The added instructions tell the story of all the issues that arose in the prior attempts.
First prompt:
Turn this into a single CSV file in the format
Author, Title, Form (enum: novel, poetry, drama, philosophy/essay), Genre, Year, Country
These first prompts returned many poor-quality and invalid fields. After a few iterations, my prompt was something like this:
Turn this into a single CSV file.
Each book needs its own row
Only enclose fields in " if they contain a comma.
The schema is as follows. Let year be an integer.
If year is indeterminate, leave empty.
All other fields must have values.
Author, Title, Form (enum: novel, poetry, drama, philosophy/essay, story collection), Genre, Year, Country
This worked pretty well, enough that I generated all rows and started checking the accuracy of the Year
field from a sample.
This exposed a fatal flaw in my experimental design:
many of these works, in particular the collections of stories and poetry and the works from antiquity,
are impossible to precisely date.
So I came up with my final attempt:
Turn this into a single CSV file.
Each book needs its own row
Only enclose fields in " if they contain a comma.
The schema is as follows:
Author, Title, Form, Genre, Year, Country
Form is an enum with values ['Drama','Novel','Story collection','Philosophy/essay','Poetry','Memoire/Autobiography']
Let year be an integer.
If form is drama, use year it was first performed.
Otherwise, use year first published.
If multiple volumes, use year the first volume was published.
if you cannot determine year, leave blank.
I expect many collections and ancient works to have empty year fields.
At the end of the each prompt, I appended an unedited snippet of Bloom’s list. In the first attempts, I pasted too many items and ChatGPT occasionally hallucinated additional rows. In the final attempt, I prompted about 100 rows at a time.
I made four CSV files, one for each “Age” as classified by Bloom.
Exploration and clean up
As I went through attempts, I used DuckDB to explore the data. Sometimes obvious structural issues surfaced: all titles for a particular author were in one semi-colon-delimited row; a cluster of rows had large swaths of empty fields. When stuff like this happened, I rewrote my prompt and tried again.
For the final table, I added a column to record the literary “age” before joining the CSV files in one table. I updated the final fields to fix minor validation issues and remove whitespace. The most common nuisances were CSV validation errors for columns with commas in their values, such as The Mandrake, A Comedy. I also converted all BC years to negative integers.
Results
I assess how well ChatGPT did by a few criteria:
- Did it structure the information as asked?
- Did it add information that was accurate?
For the “added information” part, ChatGPT added three new data points for each book1:
- Year
- Form
- Genre
In the following sections, I’ll discuss how well ChatGPT did at each task.
Semi-structured prose to CSV
ChatGPT ingested a large semi-structured prose list and turned it into a CSV file. This definitely saved some labor. However, as the list already had a regular structure, this task also might have been done with some old-fashioned regex.
The structure of the data mostly conformed to my expectations, though it took some trial and error to validate the CSV output.
A future experiment might try to extract data from completely unstructured prose into a structured table of facts.
Quality of output for form
field
Most of the time, ChatGPT reported the Form
according to one of the values specified in the prompt.
However, it also repeatedly added new values to some works, which was against instructions and created invalid table inserts.
In some cases, I might have preferred ChatGPT’s suggestion—I especially wish I included the value history
—but this editorialization went outside of the prompt.
From 1300 books, 228 genres
Unlike the Form
field, which was explicitly an enum, the Genre
field could take any value.
ChatGPT created 228 genres. Only 118 genres were used for more than one title, and only 26 genres were used for more than 10 titles. This distribution makes some thin categories, but I do like the specificity of some of the singleton genres I saw, such as “Andalusian folk poetry.”
select genre, count(genre)
from books
GROUP BY all
order by count(genre) desc;
Evaluating the quality of the generated genres is a topic for another post. I didn’t see anything obviously absurd or wrong but I didn’t look hard.
Accuracy of reported Year
When I started this experiment, the Year
field seemed the most objective and easiest to fact check.
So Year
is the only field I validated for this first experiment.
As checking all 1300 fields for errors would defeat the purpose of having a research assistant, I used DuckDB to give me a 100-field sample:
CREATE TABLE sample
as SELECT Author, Title, "Year"
FROM BOOKS
USING SAMPLE 100;
My fact-checking technique was irresponsibly informal: first I looked for the work on English Wikipedia. If the work and its date didn’t appear in a page or in the “Works” section of the author, I checked Goodreads, Google Books, and the Internet Archive.
Of the 100-work sample, 91 had dates that I could verify. If you’re interested in the 9 I couldn’t verify, you can read the comments:
select GPT_correct, count(GPT_correct) from "sample"
group by all;
─────────────┬────────────────────┐
│ GPT_correct │ count(GPT_correct) │
│ boolean │ int64 │
├─────────────┼────────────────────┤
│ false │ 20 │
│ true │ 71 │
└─
This left a sample of 91 works. Of these, ChatGPT incorrectly reported the year of 20 rows. “Incorrect” in this case means either factually wrong (e.g. reported 2003 when the first publication was 2001) or against the prompt’s instructions to leave ambiguous fields blank.
Here are the results:
SELECT * FROM sample
WHERE GPT_correct IS False;
Estimating accuracy rate
The sample year rate has the following statistics:
Statistic | Value |
---|---|
Total fields | 91 |
Errors | 20 |
Sample accuracy rate | 0.78% |
Standard error | 0.04 |
Sample distribution interval | [0.70,0.87] |
In other words, using this method to build a confidence interval for a proportion, we can say with 95% confidence that the range
[0.70,0.87]
covers the true accuracy rate.
Discussion
Even with all the time spent rewriting prompts and fixing validation issues, ChatGPT built the books CSV database in far less time than it would have taken manually.
The results for the added fields are mixed.
From my cursory checks, the Genre
and Form
fields are at least reasonable.
The error rate of the Year
field, though, is too high to be reliable.
In this case, I judge my ChatGPT-built database to good enough for unserious work. If you were using this database just to discover some books that a specific critic recommended and wanted to filter by some crude categorizations and fuzzy dates, the data quality is acceptable. If, however, you needed to do precise research, all the generated values are too suspicious to be acceptable.
Ideas for extended research
I believe there is potential in using LLMs to create bespoke databases;
I’m not sure how easy it is to generate a very good one.
Part of the quality issues in my DB arose from my haphazard design, and but another part stems from the inherent instability of ChatGPT’s answers.
For this reason, the build-a-research-database use case might work better for exploration and rough filtering—say filtering by Genre
— than for retrieving precise values.
If anything, this generated table can serve as a baseline to try to improve. Here are some ideas to improve this experiment:
- Have a clear hypothesis and criteria for how to test quality
- Think carefully about the specific values to generate and whether they really are objectively verifiable
- Test this quality on a small sample first to discover issues with the prompt
- If creating a large database, build some automation tools to check for basic issues and reprompt (things like adding all works for an author on one row happened multiple times)
- Refine the interaction with the LLM: better prompting, better use and understanding of context windows, changing temperature, and so on.
I also have some ideas to extend this use case:
- Normalize the facts in different tables. For the input, create a table with a primary key. Then make that a foreign key in the tables where information needs to be generated. Pass the LLM the schema and have it build the database.
- Require a source for each reported fact
- If the database is public, offer a way for users to fact check and edit values, similar to Wikipedia.
The journey was more fun than the result
My books DB may be a dud, but the process of making it was fun:
- I got to use DuckDB for the first time..
- I researched a ton of books I never heard of:
- I now have a database for future learning projects, e.g. to build a simple web app or practice normalizing a data model.
- I discovered tons of issues in how I thought about creating an LLM test for a certain use case
- I gained a new appreciation for how difficult it is to manage information precisely. All of my table’s values are subject for interpretation.
it also added “Country” for each author. This information was usually in the prompt, but sometimes this was regional or politically ambigious. In these cases ChatGPT picked a country. I saw no errors there. ↩︎