General transcription convention & transcription resources
Contents
- Quick tips/things to keep in mind while transcribing
- General guide to transcribing corpus data
- BLIP cuarated transcription resources
Quick tips
1. Punctuation and capitalisation in annotations are not necessary
Utterances do not need capitalisation and punctuation when transcribed. There are a few exceptions that we allow. These include:
- Censoring personal information (e.g., BABYNAME, CONFIDENTIAL) See Privacy Matters for more information.
- Question marks (?) e.g., what is that?
- Ellipsis to indicate stuttering/interrupted speech (…) e.g. She told me that she t… took the taxi.
- Ellipsis can also be used to indicate a parent prompting children on words e.g.,
Mum: a pink um...
Child: brella
- Ellipsis can also be used to indicate a parent prompting children on words e.g.,
- Tildes (~) to indicate long languageless words (e.g., oh~)
- In our recordings, words may be dragged out longer than usual as the recordings are usually child-directed. There is no need to label such instances with tildes. Thus, tildes should only be used for languageless words.
- Colons/equal/plus signs used within the special coding conventions e.g., :v:laughing (see List of Special Codes section of the wiki)
- Three pound signs (###) to indicate unintelligible/inaudible speech.
- Do note that ### is a standalone token and should not be with anything else, e.g.
:m:###
or:si:###
.
- Do note that ### is a standalone token and should not be with anything else, e.g.
- Apostrophes (‘) in contractions e.g., can’t, don’t, shouldn’t
- Hyphens (-) for word connecting forms e.g., jalan-jalan
2. Don’t delete unused tiers in the template.
Right click on the tier panel and hide them if they are getting in your way
3. Adjust the horizontal and vertical zooms for more accurate transcription
Adjust the vertical zoom (right click on the waveform > Vertical Zoom > adjust the %) and horizontal zoom (adjust the slider at the bottom right of the ELAN screen) to help you see the waveform clearly.
4. Edit the tier attributes after you import the template
Edit the tier attributes to replace placeholder participant codes with accurate participant ID codes for all the participants present in the recording. This should be done on all the present participant’s primary and secondary tiers.(e.g. the participant ID is P1234TT
and the mother, baby, and researcher (R099
) are present in this recording. The Mother code MPXXXXTT
should be edited to MP1234TT
on all the Mother primary and secondary tiers (Utterance, Language, Chunk, Target EL/CL/ML/TL, Translation, Matrix, Sensitive_Masking
). The Baby code PXXXXTT
should be edited to P1234TT
on all the Baby primary and secondary tiers. The Researcher code R00XPxxxxTT
should be edited to R099P1234TT
on all the Researcher primary and secondary tiers).
5. Ensure the Utterance, Chunk, and Language tiers are annotated. The Translation tier and Sensitive_Masking tier should also be annotated when applicable.
Unless otherwise instructed, the Utterance, Chunk, and Language tiers should always have annotated cells. You will also have to translate the utterance on the Translation tier if you are transcribing non-English utterances. The translation can be a loose translation i.e., enough information for an outside English reader to understand what is going on. Isolate instances (e.g., mention of the baby’s name) that need redacting on the Sensitive_Masking
tier. Live demo on how to use the Sensitive_Masking tier here.
6. Mark inaudible segments with three pound signs (###)
Transcribe inaudible/unintelligible segments as ### in the Utterance and Chunk tiers, then tag it as “Vocal Sounds” in the Language tier.
7. Transcribing a Chinese recording? Ensure you put spaces between Chinese words in the Chunk tier
Refer to What is a word? By Dr. Shamala Sundaray. Please put spaces in between Chinese words in the Chunk tier but not the Utterance tier. Questions regarding this have been answered in the Frequently Asked Questions by our Research Associate Woon Fei Ting.
However, for the Utterance tier, spaces should be placed to separate Chinese characters from text that aren’t Chinese characters. e.g., “she eats 那个 uh 大大的汉堡包 :v:laughter” (Translation: She eats that uh huge hamburger :v:laughter) should be written in the Utterance tier as she eats 那个 uh 大大的汉堡包 :v:laughter
and NOT as she eats那个uh大大的汉堡包:v:laughter
The exception to this rule is question marks e.g. orangutan is sad 对不对?
(Translation: orangutan is sad, isn’t it?)is correct.
8. Keyboard shortcuts are your friend!
ELAN keyboard shortcuts can help streamline your transcription process. Here are a few:
Description | Shortcut Key and Notes |
---|---|
Create a new annotation cell | alt + N or ctrl + alt + N (Windows) or option + N (Mac) |
Modify annotatiion cell time | With a selected cell (it’ll be blue if it’s selected) > click and drag to desired timepoints so that the area is highlighted > ctrl + enter (Windows) or command + enter (Mac) |
Play/pause media file | shift + space |
Play a selected cell | shift + space |
Move to previous cell | alt + left arrow key |
Move to next cell | alt + right arrow key |
9. Remember to visit the Frequently Asked Questions page.
If any questions have not been answered in the FAQ, please do not hesitate to ask a full-time staff member.
Back to table of contents
–
A General Step by Step Guide to Text Transcription for Corpus Data (Standard Version)
-
Launch ELAN
-
Import the media file(s) and the template file (if available). Refer to the “Transcription Template” section of the wiki on instructions on how to handle the template subsequent to importing it.
-
Go to “File” and click “Save” or “Save As” like a normal document. The file should be named according to the file naming conventions outlined on the wiki. You can also set an automatic backup of the file by going to “File” > “Automatic Backup”. Choose a time interval between 1 to 30 minutes that suits your preference. Note: the backup file will be saved with the extension
*.eaf.001
. -
Listen to the media file in full before you begin transcribing. Familiarize yourself with the participants’ voices. Note how many participants there are. Note what languages are being spoken. Note any muffled parts or parts where all speech become indecipherable. Information on how to deal with these parts can be found under Frequently Asked Questions (FAQ).
-
In order to see the hierarchical relationship between the dependent tiers, right click on the tier sidebar > Sort Tiers > Sort by Hierarchy.
-
Rename the tier attributes to replace placeholder participant codes with accurate participant ID codes for all participants present in the recording. Go to Tier > Change Tier Attributes > Edit the “Participant” line. Ensure you also edit the Researcher’s participant line and tier name to reflect their actual researcher code (if they are present in the recording).
-
Generate any new participant tiers if necessary (e.g., if there’s an Unknown individual or a sibling). See “Transcribing Dependent Tiers” in the wiki for instructions on how to create new dependent tiers. See “A Guide to All Your Tiers” in the wiki for the tier type stereotypes and hierarchical relationships needed for each tier. Consult your supervisor if you are having issues.
-
Adjust the vertical zoom (right click on the waveform > Vertical Zoom > adjust the %) and horizontal zoom (adjust the slider at the bottom right of the ELAN screen) to help you see the waveform clearly.
- Put on your headphones and begin transcribing! Refer to the wiki for what you need to know regarding out transcription conventions.
- Transcribe utterances on the
Utterance
,Chunk
, andLanguage
tiers. Use theTranslation
tier should you need to translate any non-English utterances into English. - Isolate instances (e.g., mention of the baby’s name) that need redacting on the
Sensitive_Masking
tier. Live demo on how to use the Sensitive_Masking tier here. - What counts as an utterance? Read “A Problem Named Utterance” by Dr. Shamala Sundaray, which can be found here
- Transcribe utterances on the
- Update the transcription log to note your transcription progress on a given file. Alert a full-time staff member/your supervisor if you have finished transcribing a file.
Back to table of contents
Below are our curated resources/video guides that you should refer to:
-
Our Youtube playlist of introductory ELAN transcription tutorials
-
Our Youtube playlist of ELAN transcription tutorials for data from the Talk Together Study. You can also find live demo transcription videos within that playlist. These include: how to subdivide cells on the Chunk tier, how to transcribe picture cards recordings, how to transcribe talk prompts recordings, and how to transcribe Green Grass Park recordings from the Voicemail Game.
-
What counts as an utterance? Read “A Problem Named Utterance” by Dr. Shamala Sundaray, which can be found here
-
What is a word? by Dr. Shamala Sundaray. Side note to Chinese transcribers: please put spaces in between Chinese words in the Chunk tier but not the Utterance tier. Some questions regarding this have been answered in the FAQ.
-
BLIP Dictionary for transcribers, which was compiled by the transcription team at BLIP Lab, NTU. It is a non-exhaustive list of words and their respective language tags.
-
Our video on how to make a comment on ELAN. Useful for marking out if there’s overlapping speech, inaudible/muffled speech, or speech that is in a language you don’t speak.
Back to table of contents
If you are having any issues, please refer to the FAQ section of this wiki to see if your question has already been answered. Otherwise, please feel free to reach out to our team.