Is ChatGPT a reliable research assistant? Building a books database

Publish date: Dec 17, 2024
Last updated: Jul 29, 2025

2979 Words

I used ChatGPT to turn a critic’s list of 1300 books into a database, adding extra information to provide additional context to each book. Then I tested how accurate a portion of the generated information was.

LLMs are good at processing unstructured data, like internet pages and research papers, and at generating structured output, like code and JSON files. The devil is in the generated details. If they aren’t correct, what’s the use?

Testing the correctness of generated data is the goal of my informal experiment, which uses ChatGPT to build a database of information about classic books. The input is long list of books that Harold Bloom considered to comprise the Western Canon. The goal output is a table of these books with additional columns for information that wasn’t present in the initial list.

Let’s ignore the validity of the list itself and consider its worth as source material to test an LLM’s capabilities as a research assistant:

It would be boring to manually add contextual information for every single one of these 1300+ works
Almost all the books in the list have extensive discussion in the wider body of written literature, meaning the model’s training corpus likely includes some basic contextual information about each one.

So I used ChatGPT to build me a books database.

Results summary

Jump to full results | Source files

Though validation errors arose, ChatGPT did reasonably well at following my prompt to create a CSV file using the input values and specified schema. The categorical values, for the enum Form field and freeform Genre field, seem reasonable at a glance, but all I did was glance at them.

To evaluate the quantity of the reported year, I tested a sample of 100 randomly selected rows. The sample accuracy rate was 78%, and I estimate the population accuracy-rate parameter to be between 70% and 87%.

For this specific case, I’d say the LLM worked well enough as a research assistant for my unserious work. With so many potential errors, however, it is unreliable for work with meaningful stakes. Nevertheless, I see ways to improve the setup. And, as an assistant, the bot may be a better fit for helping researchers categorize and filter, rather than delivering high-precision values and facts.

Method

To generate the data, I used the default version of ChatGPT 4o Mini that existed on November 24th, 2024. I did not log in.

To inspect and clean up the data, I used DuckDB to create a table from the generated CSV files.

Goal

The main goal was to test how well ChatGPT could date a list of literary works for me. I also wanted to see how it does in classifying them by Genre and Medium.

Prompt

I refined the prompt in iterations. The added instructions tell the story of all the issues that arose in the prior attempts.

First prompt:

Turn this into a single CSV file in the format

Author, Title, Form (enum: novel, poetry, drama, philosophy/essay), Genre, Year, Country

These first prompts returned many poor-quality and invalid fields. After a few iterations, my prompt was something like this:

Turn this into a single CSV file.
Each book needs its own row
Only enclose fields in " if they contain a comma.

The schema is as follows. Let year be an integer.
If year is indeterminate, leave empty.
All other fields must have values.


Author, Title, Form (enum: novel, poetry, drama, philosophy/essay, story collection), Genre, Year, Country

This worked pretty well, enough that I generated all rows and started checking the accuracy of the Year field from a sample. This exposed a fatal flaw in my experimental design: many of these works, in particular the collections of stories and poetry and the works from antiquity, are impossible to precisely date.

So I came up with my final attempt:


Turn this into a single CSV file.
Each book needs its own row
Only enclose fields in " if they contain a comma.

The schema is as follows:
Author, Title, Form, Genre, Year, Country

Form is an enum with values ['Drama','Novel','Story collection','Philosophy/essay','Poetry','Memoire/Autobiography']
Let year be an integer.
If form is drama, use year it was first performed.
Otherwise, use year first published.
If multiple volumes, use year the first volume was published.
if you cannot determine year, leave blank. 
I expect many collections and ancient works to have empty year fields.

At the end of the each prompt, I appended an unedited snippet of Bloom’s list. In the first attempts, I pasted too many items and ChatGPT occasionally hallucinated additional rows. In the final attempt, I prompted about 100 rows at a time.

I made four CSV files, one for each “Age” as classified by Bloom.

Exploration and clean up

As I went through attempts, I used DuckDB to explore the data. Sometimes obvious structural issues surfaced: all titles for a particular author were in one semi-colon-delimited row; a cluster of rows had large swaths of empty fields. When stuff like this happened, I rewrote my prompt and tried again.

For the final table, I added a column to record the literary “age” before joining the CSV files in one table. I updated the final fields to fix minor validation issues and remove whitespace. The most common nuisances were CSV validation errors for columns with commas in their values, such as The Mandrake, A Comedy. I also converted all BC years to negative integers.

Results

I assess how well ChatGPT did by a few criteria:

Did it structure the information as asked?
Did it add information that was accurate?

For the “added information” part, ChatGPT added three new data points for each book¹:

Year
Form
Genre

In the following sections, I’ll discuss how well ChatGPT did at each task.

Semi-structured prose to CSV

ChatGPT ingested a large semi-structured prose list and turned it into a CSV file. This definitely saved some labor. However, as the list already had a regular structure, this task also might have been done with some old-fashioned regex.

The structure of the data mostly conformed to my expectations, though it took some trial and error to validate the CSV output.

A future experiment might try to extract data from completely unstructured prose into a structured table of facts.

Quality of output for `form` field

Most of the time, ChatGPT reported the Form according to one of the values specified in the prompt.

However, it also repeatedly added new values to some works, which was against instructions and created invalid table inserts. In some cases, I might have preferred ChatGPT’s suggestion—I especially wish I included the value history—but this editorialization went outside of the prompt.

From 1300 books, 228 genres

Unlike the Form field, which was explicitly an enum, the Genre field could take any value.

ChatGPT created 228 genres. Only 118 genres were used for more than one title, and only 26 genres were used for more than 10 titles. This distribution makes some thin categories, but I do like the specificity of some of the singleton genres I saw, such as “Andalusian folk poetry.”

select genre, count(genre)
  from books 
  GROUP BY all
  order by count(genre) desc;

Genre	count(genre)
Modernist fiction	213
Modernist poetry	191
Tragedy	78
Comedy	50
Realist novel	49
Contemporary poetry	45
Historical fiction	41
Lyric poetry	37
Romantic poetry	31
Literary criticism	31
Postmodern fiction	31
Short stories	24
Modernist drama	24
Epic	24
Realist fiction	23
Philosophy	22
Romantic novel	20
Satire	17
Autobiography	15
Symbolist poetry	13
Drama	13
Historical novel	13
Fantasy	13
Modernist novel	12
Social novel	12
Magical realism	12
Adventure novel	9
Victorian poetry	9
Victorian novel	9
Poetry	9
Surrealist poetry	9
Gothic novel	9
Essays	8
Political fiction	7
Science fiction	7
Fantasy fiction	7
Romance	7
Epic poetry	7
Satirical fiction	7
Philosophical novel	7
Contemporary drama	7
Absurdist play	6
Satirical novel	6
Philosophical fiction	6
Historical drama	5
Epistolary novel	5
Political philosophy	4
Pre-Raphaelite poetry	4
Folk tales	4
Adventure fiction	4
Yiddish poetry	4
Religious text	4
Philosophical essay	4
Elegy	4
Historical writing	4
Absurdist theatre	4
Biography	4
Psychological fiction	4
Yiddish literature	4
Picaresque	4
Baroque poetry	4
Jewish fiction	4
American poetry	4
Religious poetry	3
Novel	3
Comic novel	3
Experimental fiction	3
Art criticism	3
Autobiographical novel	3
Autobiographical fiction	3
Mystical poetry	3
War poetry	3
Memoir	3
Ethics	3
Fiction	3
Neoclassical poetry	3
Existentialist fiction	3
Romantic	3
History	3
Religious philosophy	2
Bildungsroman	2
Art history	2
Devotional poetry	2
Aesthetics	2
Education	2
Satirical poetry	2
Philosophical text	2
Symbolism	2
Surrealist fiction	2
Futurist poetry	2
True crime	2
Sonnet	2
Pastoral poetry	2
Philosophical drama	2
Feminist literature	2
Crime fiction	2
Feminist fiction	2
Socialist realism	2
Sentimental novel	2
Psychological novel	2
Expressionist poetry	2
Symbolist fiction	2
Novella	2
Philosophical treatise	2
Aestheticism	2
Naturalist fiction	2
Diary	2
Modernist theory	2
Philosophical essays	2
Naturalist drama	2
Industrial novel	2
Dystopian fiction	2
Poetics	2
Theology	2
Mythology	2
Lyrical poetry	2
Autobiographical	2
Political drama	2
Science and art notes	1
Social philosophy	1
Arthurian legend	1
Epigram	1
Ballad opera	1
Ballads	1
Prose poetry	1
Comic verse	1
Decadent poetry	1
Satirical tales	1
Travel literature	1
Historical study	1
Existentialist poetry	1
Philosophical prose	1
Dadaism	1
Humor	1
Political thriller	1
Allegory	1
Religious	1
Philosophical tale	1
Political commentary	1
Travel	1
Philosophical poetry	1
Christmas fiction	1
Tragicomedy	1
Gothic tales	1
Political novel	1
War novel	1
Experimental poetry	1
Feminist philosophy	1
Religious fiction	1
Political poetry	1
Gothic fiction	1
Western	1
Fantasy drama	1
Translation	1
Detective novel	1
Modernist essays	1
Metaphysical poetry	1
Nature Writing	1
Masque	1
Utopian literature	1
Dialogue	1
Dramatic monologue	1
Aesthetic novel	1
Operetta	1
Mystery novel	1
Surrealist drama	1
Utopian philosophy	1
Travel poetry	1
Sonnet sequence	1
Spiritual	1
Autobiographical essay	1
Mystery	1
Nonsense verse	1
Letters	1
Travel writing	1
Holocaust poetry	1
Adventure	1
Periodical essays	1
Maxims	1
Epigrams	1
Philosophical fragments	1
Scottish poetry	1
Children’s literature	1
Religious novel	1
Coming-of-age novel	1
Neorealism	1
Philosophical dialogue	1
Surrealist play	1
Existentialist novel	1
Anthropological theory	1
Futurist drama/poetry	1
Romantic fiction	1
Post-colonial fiction	1
Modernist philosophy	1
Surreal fiction	1
Poetry/Drama	1
Memoire/Autobiography	1
Political theory	1
Tragedy/Comedy	1
Political	1
Fables	1
Aesthetic theory	1
Folklore	1
Utopian novel	1
Romantic opera	1
Historical account	1
Surrealism	1
Art theory	1
Theater theory	1
Literary fiction	1
Psychological thriller	1
poetry	1
Andalusian folk poetry	1
Philosophical memoir	1
Modernist prose	1
Yiddish fiction	1
Jewish memoir	1
Fable	1
Social criticism	1
Fairy tales	1
Modernism	1
Literary memoir	1
Existential fiction	1
Holocaust fiction	1
Post-apartheid fiction	1
Negritude	1
Existentialist play	1
War fiction	1

Evaluating the quality of the generated genres is a topic for another post. I didn’t see anything obviously absurd or wrong but I didn’t look hard.

Accuracy of reported `Year`

When I started this experiment, the Year field seemed the most objective and easiest to fact check. So Year is the only field I validated for this first experiment.

As checking all 1300 fields for errors would defeat the purpose of having a research assistant, I used DuckDB to give me a 100-field sample:

CREATE TABLE sample 
    as SELECT Author, Title, "Year"
    FROM BOOKS
    USING SAMPLE 100;

My fact-checking technique was irresponsibly informal: first I looked for the work on English Wikipedia. If the work and its date didn’t appear in a page or in the “Works” section of the author, I checked Goodreads, Google Books, and the Internet Archive.

Of the 100-work sample, 91 had dates that I could verify. If you’re interested in the 9 I couldn’t verify, you can read the comments:

select GPT_correct, count(GPT_correct) from "sample"
  group by all;

─────────────┬────────────────────┐
│ GPT_correct │ count(GPT_correct) │
│   boolean   │       int64        │
├─────────────┼────────────────────┤
│ false       │                 20 │
│ true        │                 71 │
└─

D select * from "sample" 
  where GPT_correct IS NULL;

|            Author             |           Title           | Year | GPT_correct |         correction_comments         |
|-------------------------------|---------------------------|-----:|-------------|-------------------------------------|
| Yeats, W. B. (William Butler) | Mythologies               | 1938 |             | not sure how to date                |
| Angelos Sikelianos            | Selected Poems            | 1960 |             | not sure about this                 |
| Fernando Pessoa               | Always                    |      |             | not sure about  title               |
| Kevin Hart                    | Peniel and Other Poems    | 1986 |             | not sure about title                |
| Robert Penn Warren            | Selected Poems            | 1997 |             | Likely indeterminate, but ambiguous |
| Ronald Firbank                | Five Novels               | 1924 |             | ambiguous                           |
| Németh Laszlo                 | Guilt                     | 1942 |             | not sure about title                |
| Francis Ponge                 | Things: Selected Writings |      |             | ambiguous                           |
| Lucretius                     | The Way Things Are        | -55  |             | not sure how to date                |

This left a sample of 91 works. Of these, ChatGPT incorrectly reported the year of 20 rows. “Incorrect” in this case means either factually wrong (e.g. reported 2003 when the first publication was 2001) or against the prompt’s instructions to leave ambiguous fields blank.

Here are the results:

SELECT * FROM sample
  WHERE GPT_correct IS False;

Author	Title	Year	GPT_correct	correction_comments
Thomas Hardy	Collected Poems	1928	false	1919
Philippe de Commynes	Memoirs	1489	false	1498 or indeterminate
Miroslav Holub	The Fly	1972	false	indeterminate or 1987 (year of translation)
Léonie Adams	Poems: A Selection	1953	false	1954
Günter Eich	Moles	1964	false	indeterminate: english edition dates from 66 to 72
Alfred Cord	A Call in the Midst of the Crowd	2005	false	1978
Georges Bataille	Blue of Noon	1955	false	1957
Sholem Asch	East River	1922	false	1946
Heinrich von Kleist	Stories	1810	false	indeterminate
Paul Celan	Poems	1991	false	indeterminate
Vachel Lindsay	Collected Poems	1931	false	indeterminate
François Villon	Poems	1400	false	indeterminate
Denis Johnson	Fiskadoro	1999	false	1985
Anthony Burgess	Nothing Like the Sun	1961	false	1964
Jacob and Wilhelm Grimm	Fairy Tales		false	First published 1812
Witold Gombrowicz	Three Novels	1967	false	indeterminate or 1978 (year of translation and compiliation)
Hugo von Hofmannsthal	Poems and Verse Plays	1990	false	indeterminate
Seneca, Lucius Annaeus	Tragedies, particularly Medea and Hercules Furens	-65	false	indeterminate
Ernest Hemingway	Complete Short Stories	1938	false	indeterminate or 1987
Robert Garnier	Mark Antony	1582	false	1578

Estimating accuracy rate

The sample year rate has the following statistics:

Statistic	Value
Total fields	91
Errors	20
Sample accuracy rate	0.78%
Standard error	0.04
Sample distribution interval	[0.70,0.87]

In other words, using this method to build a confidence interval for a proportion, we can say with 95% confidence that the range [0.70,0.87] covers the true accuracy rate.

Check my work

Discussion

Even with all the time spent rewriting prompts and fixing validation issues, ChatGPT built the books CSV database in far less time than it would have taken manually.

The results for the added fields are mixed. From my cursory checks, the Genre and Form fields are at least reasonable. The error rate of the Year field, though, is too high to be reliable.

In this case, I judge my ChatGPT-built database to good enough for unserious work. If you were using this database just to discover some books that a specific critic recommended and wanted to filter by some crude categorizations and fuzzy dates, the data quality is acceptable. If, however, you needed to do precise research, all the generated values are too suspicious to be acceptable.

Ideas for extended research

I believe there is potential in using LLMs to create bespoke databases; I’m not sure how easy it is to generate a very good one. Part of the quality issues in my DB arose from my haphazard design, and but another part stems from the inherent instability of ChatGPT’s answers. For this reason, the build-a-research-database use case might work better for exploration and rough filtering—say filtering by Genre— than for retrieving precise values.

If anything, this generated table can serve as a baseline to try to improve. Here are some ideas to improve this experiment:

Have a clear hypothesis and criteria for how to test quality
Think carefully about the specific values to generate and whether they really are objectively verifiable
Test this quality on a small sample first to discover issues with the prompt
If creating a large database, build some automation tools to check for basic issues and reprompt (things like adding all works for an author on one row happened multiple times)
Refine the interaction with the LLM: better prompting, better use and understanding of context windows, changing temperature, and so on.

I also have some ideas to extend this use case:

Normalize the facts in different tables. For the input, create a table with a primary key. Then make that a foreign key in the tables where information needs to be generated. Pass the LLM the schema and have it build the database.
Require a source for each reported fact
If the database is public, offer a way for users to fact check and edit values, similar to Wikipedia.

The journey was more fun than the result

My books DB may be a dud, but the process of making it was fun:

I got to use DuckDB for the first time..
I researched a ton of books I never heard of:
I now have a database for future learning projects, e.g. to build a simple web app or practice normalizing a data model.
I discovered tons of issues in how I thought about creating an LLM test for a certain use case
I gained a new appreciation for how difficult it is to manage information precisely. All of my table’s values are subject for interpretation.

it also added “Country” for each author. This information was usually in the prompt, but sometimes this was regional or politically ambigious. In these cases ChatGPT picked a country. I saw no errors there. ↩︎