NLP and ML Datasets


Verbatim, Baby! specializes in the creation of high-quality, human-curated, ML-ready datasets. Our sets contain detailed annotation and manifests, and can be licensed per audio minute or per audio hour.We have a rich variety of South African speakers who contribute speech data in an unscripted, conversational, code-switching tone, covering many topics, including accounts of deeply personal experiences.Some of our speakers submit their own recordings on topics of their choice, and others are recorded ethically and with full consent by the project leader in a conversational setting. The overwhelming majority of our speakers tie their identities to their premium speech contributions and are open to being contacted for more NLP contributions for ethical NLP research. For more information or inquiries, please submit a query to the project leader through the on-site contact form.


Meet Our Speakers


[SPEAKER J]: This speaker is a 19-year-old native English neurodivergent trying to teach himself Afrikaans by practicing his conversational skills with Gemini. He said that his goal is to learn new words and their contextual use, but these self-recorded clips really show the level of misunderstanding this LLM has and how quickly it reverts to potentially harmful responses while flipping accents and switching to an unrelated language - sometimes mid-sentence.Speaker J has a large, growing speech corpus as he contributes new recordings weekly. He is not open to being contacted by outside research teams but is willing to accept topic requests for future conversations. This human-curated corpus is available in custom-length clips and ML-ready packages.


[SPEAKER E.L.]: Understanding Gemini becomes a challenge for this 22-year-old Afrikaans-speaking car parts salesman from the Eastern Cape. Although the system delivers fairly accurate engine diagnostics, it unpredictably changes its accent to Dutch or switches its response language to Dutch or Korean when addressed in Afrikaans.Speaker E.L. has tied his identity to his audio and is open to being contacted by outside research teams for further contributions. This contribution is available as a full, human-curated package and in custom ML-ready segment sizes.


[SPEAKER L.F.B]: This three-to-five-hour-long corpus captures an unfiltered spectrum of emotion: Laughter, weeping, traumatic memories, and self-deprecating jokes about deadly moments. It offers a rare glimpse into the emotional contradictions of surviving addiction. The setting is conversational, and additional voices present are the project leader and L.F.B.'s partner.L.F.B's hours-long corpus is available in several tiers and formats, ranging from heavily redacted verbatim to entirely unredacted verbatim at Premium, and a full ML-ready package. She is open to being contacted for further related linguistic or socio-economic research involving NLP. This human-curated corpus is available in full or in ML-ready packages of various sizes.


[SPEAKER T.H.]: This Afrikaans account is a powerful survivor's testimony. She traces a journey from hope to heartbreak to healing. Delivered conversationally to the project leader, her story offers deeply personal insight into the emotional complexities of domestic violence within LGBTQIA relationships.Speaker T.H. ties her identity to her full corpus package and is open to being contacted by outside linguistic and socio-economic research teams for further NLP contributions. This human-curated corpus is available in full or as fully ML-ready packages of various sizes.


[SPEAKER L.V.A.]: This is a female speaker who takes initiative by recording her own contributions. This autonomy suggests a high level of engagement and ownership over her narrative, adding authenticity to her growing corpus.As a native Afrikaans speaker, L.V.A's recordings enrich the dataset with naturally paced, regionally grounded language. Her calm, pleasant delivery style makes her corpus especially valuable for training or evaluating speech recognition models in under-resourced languages.Speaker L.V.A. ties her identity to her full, but growing corpus and is open to being contacted for further NLP-related research. The corpus is available as a complete, human-curated package or as ML-ready packages.


[SPEAKER M.M.]: M.M. offers an accessible, relatable perspective on contemporary Afrikaans-speaking student life. With her clear voice and consistent speaking style, M.M.'s self-recordings are ideal for language modeling, pronunciation training, or phonetic research within the context of Afrikaans as spoken by young people in the Western Cape.This corpus is available only as human-curated ML-ready packages of various custom sizes.


More speaker packages are currently being curated, and this list will be updated as they become available.