Saturday, May 16, 2020

Examples of AWS CLI calls to the transcribe service (Speech to text) in Latin using Italian as a basis and extending with a Latin vocab:

I create the vocabulary:
A aws transcribe create-vocabulary --vocabulary-name latinSupplement --language-code it-IT --phrases "misericordiis" "aiternis" "aeternis"
B aws transcribe get-vocabulary --vocabulary-name latinSupplement

II create the transcribe job:
A aws transcribe delete-transcription-job --transcription-job-name latinNounPhrase
B create the json input file:
{
"TranscriptionJobName": "latinNounPhrase",
"LanguageCode": "it-IT",
"MediaFormat": "wav",
"Media": {
"MediaFileUri": "https://s3.us-west-2.amazonaws.com/www1.cloviscorp.com/collegium/grammar/resources/latin/sounds/Ali_GfNpCb_misericordia_aeternus.wav"
}
}
C aws transcribe start-transcription-job --cli-input-json file://test-start-command.json --settings VocabularyName=latinSupplement

D check status of asyn job (even a simple job can take more than 30 secs as of 16may20)
1 aws transcribe list-transcription-jobs --status COMPLETED
2 aws transcribe list-transcription-jobs --status IN_PROGRESS

E aws transcribe get-transcription-job --transcription-job-name latinNounPhrase

F download the output json containing the transcript by GETing the transcribe URL (in browser)

More reading: https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-cli.html

Tuesday, January 7, 2020

A complete (reasonably reliable) hyphenator in Greek in just ten lines of sed-regex:
#!/usr/bin/env bash
# hyphenates a list of greek words that are one transliterated word per line
INPUT=$1
cat $INPUT \
| sed -E 's#([AEOaeo]\^i[/\\]*|[AEIOUaeou][iu][/\\]*|[AEIOUaeiou][\^]*[/\\]*)#-\1-#g' \
| sed 's#-+-#-#g' \
| sed -E 's#([BDGPTKLMNRSbdgptklmnrs])([BDGPTKLMNRSbdgptklmnrs])#-\1-\2-#g' \
| sed -E 's#([PTKtpk])[-]+([Ss])#-\1\2-#g' \
| sed -E 's#(-[PTKtpk][Hh])#-\1-#g' \
| sed -E 's#[-]+#-#g' \
| sed -E 's#-([SRNsrn][-]*)*$#\1#' \
| sed -E 's#([^aeiouAEIOU/\\^])-([aeiouAEIOU])#\1\2#g' \
| sed -E 's#-([BDGPTKbdgptk][Hh]*)-([rl])#-\1\2#g' \
| sed -E 's#-([BDGPTKLMNRSbdgptklmnrs][Hh]*)-#\1-#g' \
| sed 's#^-##g'
The transliteration is done with My Transliterator tool