SpeakerPro is an AI-driven Text-To-Speech Synthesizer module delivering high quality audio with natural-sounding voices. Its main purpose is to engineer speech samples from short phrases.

SpeakerPro is available in the VCV Rack Library. After installing the when starting the first time the module will download ai models, which may take a while.
When the module is up and running:
The Module has a pipeline with four sections: Text, Phonemes, Prosody, Audio

The top Text section provides a text editor where the text which should be spoken can be entered. For testing there is a random button inserting an english proverb. By clicking on Generate the whole pipeline will be processed:
For handling different languages see section "Language Support"

The Phonemes section provides a special Phoneme editor where either previously generated or new Phonemes can be edited. (see also section Phoneme Editor) Note: The phonemes are always overwritten when generating new ones from text above. So after changing Phonemes the Generate button on the right side of the Phoneme Window has to be used.
At this stage voices styles can be selected. There are 54 voice styles currently available:
For accent free results the language (see prefix) of the text should match the voice style. There can be two voice styles be selected and mixed. When selecting voice style 'random' a random mix of the predefined voice styles is generated.
The phonemes are converted into prosody, which includes pitch, energy, and duration information that gives the audio a natural-sounding quality. The result can be influenced by using the following parameters (knobs):
After changing/correcting/editing the phonemes the Generate Button on the right of the phonemes window must be used.

The generated pitch (orange) and energy (green) curves and durations can be edited. Durations can be changed by dragging the border of the phoneme boxes, while pitch and energy curves can be edited by dragging the points on the curves. After a change the Generate Button on the right side of the prosody window must be used.

The audio section is a simple sample player with a trigger input. The audio sample can be trimmed with the start and end knobs. The audio playback speed can be adjusted using the V/O Knob or input. There is an EOC trigger output which can be used to replay a sequence of samples, by increasing the bank (see section "The bank") replaying one after the other. For further processing or loading into an other sample player the audio samples can be saved into an wav file by clicking on the right mouse in the audio samples window. If the patch containing a SpeakerPro module is saved, then all generated audio samples are saved as well and they can be accessed after loading the patch in the Rack User folder under autosave/modules/<module_id> (the module_id is shown when right clicking on the module => Menu Info)

Multiple pipelines can be organized in a bank which provides copy/paste/insert/delete functions. If the input is connected it takes over setting the current bank. The input reacts to 0.1V per step e.g 0.3V leads to pipeline 3. In this case the bank cannot changed manually. However when turn on the edit button the input is blocked and the bank can be changed manually until the edit button is released.
To input text in a different language than 'en-US', the language can be provided by preceding the text with the language code in this form: e.g. "<fr-FR>: Bonjour, comment ça va?" Note that the space is required after the colon. For convenience there is a right click menu for selecting a language.
There are different levels of supported languages. The Kokoro model itself was trained with 9 languages:
| code | voices |
|---|---|
| en-US | af_alloy, af_aoede, af_bella, af_heart, af_jessica, af_kore, af_nicole |
| af_nova, af_river, af_sarah, af_sky, am_adam, am_echo, am_eric, am_fenrir | |
| am_liam, am_michael, am_onyx, am_puck, am_santa | |
| en-GB | bf_alice, bf_emma, bf_isabella, bf_lily, bm_daniel, bm_fable, bm_george, bm_lewis |
| es-ES | ef_dora, em_alex, em_santa |
| fr-FR | ff_siwis |
| hi-IN | hf_alpha, hf_beta, hm_omega, hm_psi |
| it-IT | if_sara, im_nicola |
| ja-JP | jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo |
| pt-BR | pf_dora, pm_alex, pm_santa |
| zh-CN | zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxia, zm_yunxi, zm_yunyang |
For these languages there are more accurate text to phoneme converters and the output is accent free if the languages match. One exception is hindi, as there are no text to phoneme converters available and there is no text editor capable of rendering. But nevertheless it is possible to convert hindi to ipa phonemes somewhere else and pasting the result into the phoneme window.
For the eight native supported languages the phonemes are generated word per word by a chain:
# User Dictionary SpeakerPro
# Format: word <tab> phonemes
blood blʌd
give giv
resume ɹɪzum
The following additional languages
ca-ES Catalan
cy-GB Welsh
da-DK Danish
de-DE German
et-EE Estonian
eu-ES Basque
fa-IR Farsi/Persian
ga-IE Irish
hr-HR Croatian
hu-HU Hungarian
id-ID Indonesian
is-IS Icelandic
ko-KR Korean
nb-NO Norwegian
nl-NL Dutch
pl-PL Polish
pt-BR Portuguese (Brazil)
pt-PT Portuguese
qu-PE Quechua
ro-RO Romanian
sr-RS Serbian
sv-SE Swedish
tr-TR Turkish
yue-CN Cantonese
can be converted into phonemes and also spoken, but with the (foreign) accent of the available voices. In this case the fdemelo g2p-mbyt5-12l-ipa-childes-espeak ai model is used to convert to phonemes.
Further native languages are added as soon there will a version of the kokoro model supporting them.
The kokoro tts model uses the following ipa alphabet:
;:,.!?—…"()“” ̃ ʣʥʦʨᵝꭧAIOQSTWYabcdefɡhijklmnopqrstuvwxyzɑɐɒæβɔɕçɖðʤəɚɛɜɟɥɨɪʝɯɰŋɳɲɴøɸθœɹɾɻʁɽʂʃʈʧʊʋʌɣɤχʎʒʔˈˌːʰʲ↓→↗↘ᵻ
The phonemes can be edited with the following keyboard layout.
| Keys | ipa char | Keys | ipa char | Keys | ipa char | Keys | ipa char | |||
|---|---|---|---|---|---|---|---|---|---|---|
| / | / | | | _ | ; | ; | : | : | |||
| , | , | . | . | ! | ! | ? | ? | |||
| - | — | _ | … | " | " | ( | ( | |||
| ) | ) | [ | “ | ] | ” | |||||
| ~ | ̃ | /Z | ʣ | J | ʥ | /z | ʦ | |||
| Q | ʨ | V | ᵝ | 1 | ꭧ | A | A | |||
| I | I | O | O | Q | Q | S | S | |||
| T | T | W | W | Y | Y | a | a | |||
| b | b | c | c | d | d | e | e | |||
| f | f | g | ɡ | h | h | i | i | |||
| j | j | k | k | l | l | m | m | |||
| n | n | o | o | p | p | q | q | |||
| r | r | s | s | t | t | u | u | |||
| v | v | w | w | x | x | y | y | |||
| z | z | 8 | ɑ | 5 | ɐ | 9 | ɒ | |||
| & | æ | B | β | > | ɔ | 6 | ɕ | |||
| C | ç | D | ɖ | /d | ð | J | ʤ | |||
| @ | ə | /z | ɚ | E | ɛ | 3 | ɜ | |||
| /f | ɟ | H | ɥ | /i | ɨ | /I | ɪ | |||
| /j | ʝ | M | ɯ | /w | ɰ | N | ŋ | |||
| /n | ɳ | /p | ɲ | /N | ɴ | 0 | ø | |||
| F | ɸ | /0 | θ | /& | œ | R | ɹ | |||
| /r | ɾ | /l | ɻ | /R | ʁ | /c | ɽ | |||
| /s | ʂ | X | ʃ | /t | ʈ | /T | ʧ | |||
| U | ʊ | /v | ʋ | ^ | ʌ | G | ɣ | |||
| 4 | ɤ | K | χ | L | ʎ | Z | ʒ | |||
| 7 | ʔ | ' | ˈ | /, | ˌ | /: | ː | |||
| /h | ʰ | /y | ʲ | /\ | ↓ | /> | → | |||
| /^ | ↗ | \ | ↘ | + | ᵻ |
There is a popup menu on right mouse click in the editor where all supported ipa chars can be looked up and selected.
AI models are never 100% correct, and models predicting a sequence (like g2p-mbyt5-12l-ipa-childes-espeak, and also the Kokoro decoder) can have problems at the beginning and the at end. So in case of a so called hallucinations e.g. at the end can be removed either in the phoneme editor or in the audio playback via the start/end knobs.
Sometimes the AI model produces not the expected results because it was not trained for those. The model was trained on a public domain audible book library, so the reading style dominates.
Hence is recommended to use this module in a generative way. Imagination of a result and then trying to accomplish exact this result may turn frustrating. Better may be to learn how the model behaves and how the voices sound -- then cherry pick the good results.
When longer texts are needed then it is in general better to split them in short sentences and using the bank feature to store them in a sequence. This reduces the generating time significantly and it is also more comfortable to make changes in phonemes and prosody.