Korean TTS Workbench — (2) Korean-Specific Tricks and Running on 4GB VRAM
From phone-number normalization to seed determinism gotchas — the small decisions behind running the workbench. Part 2 of 5
Korean TTS Workbench — (2) Korean-Specific Tricks and Running on 4GB VRAM
Hi. This is part 2 of a 5-part series on building an in-house Korean voice-cloning TTS workbench. Part 1 covered why we landed on Qwen3-TTS on a single 4GB GPU and why a workbench shape was the right form. This post is about how the workbench was actually built — the two themes were "make the model read Korean naturally" and "keep it alive on 4GB VRAM."
About this series
Part 1 — Starting with Qwen3-TTS on a 4GB GPU
Part 2 (this post) — Korean-specific tricks and running on 4GB VRAM
Part 3 — The last syllable getting cut off — diagnostic logging and two real bugs (upcoming)
Part 4 — From a tail-anchor failure to EOS suppression and a one-line library patch (upcoming)
Part 5 — Qwen3-TTS → VoxCPM2, swapping the model out (upcoming)
One-line summary — "Give the model good input, bundle the results well"
The workbench code itself isn't fancy. The big-picture flow is this.
User input text
↓ (Korean-friendly preprocessing)
Normalized text
↓ (branch if repetition pattern)
"Repetition" path: generate unit once → concatenate N times with silence
"Normal" path: split into sentences → generate per sentence → join with 0.2s silence
↓
Final post-processing (append trailing silence)
↓
Return WAV
For the non-technical reader: the model is something like a "voice actor that reads text aloud," but a strangely literal one. It tries to read characters faithfully, which means the "reading conversion" we do automatically in our heads — for example, turning 010-1234-5678 into "zero-one-zero, one-two-three-four, five-six-seven-eight" — doesn't happen. So before handing text to the actor, we need a step that prepares the text the way a person would naturally read it. We also need to chunk long passages into sentences and stitch the audio back together with short pauses, because reading too long in one breath breaks the rhythm.
There are a lot of small decisions inside this single flow. I'll go through them one by one.
1. Korean-friendly preprocessing — the model is dumber than you'd think
I started by handing text directly to the model. Got results like these.
"Phone: 010-1234-5678" → the model would read it as "zero-one-zero minus one-two-three-four minus..." or read the digits in English style ("one-zero-one-oh"), or worst case run "공일공 일이삼사오육칠팔" together as one breathless block.
"Next event 4/14" → it would read "four slash fourteen" word-for-word.
"January 5, 2023" → sometimes fine, sometimes "이천이십삼" came out garbled.
For the non-technical reader: results like these wreck about half the output for announcement-style work. A clip that says "zero-one-zero minus one-two-three-four minus five-six-seven-eight, please call us" is hard to write down on the first listen and undermines the credibility of the message. So the parts a human handles automatically when reading aloud have to be done in code beforehand.
So I added a preprocessing step that, just before generation, rewrites only the input text into natural Korean pronunciation. The original text stays intact — only what goes into the model changes. This separation matters more than it seems — keeping the user's original intact is what lets you re-listen, re-generate, or audit later without confusion.
Phone number normalization
Korean phone-number patterns are surprisingly varied.
11 digits without spaces (
01012345678)Hyphens (
010-1234-5678)Dots (
010.1234.5678) — actually commonArea codes (
02-1234-5678,031-123-4567)VoIP (
070-...)050-series, 15xx/16xx/18xx representative numbers
Trying to catch all of these with regex gets messy fast. I tried to do it with a single pattern; once inputs started mixing dots, hyphens, spaces, and parentheses, the patterns multiplied to six, seven, eight. Maintaining eight regexes was clearly the wrong path.
So I went two-pronged.
Use the
phonenumberslibrary for standard validation — it handles Korean number format checks and decomposition reliably. Built by Google, it knows Korean numbering rules well.Hand-rolled prefix/length rules for non-standard inputs the library doesn't catch (dot separators, slash separators, atypical lengths).
Matched numbers get rewritten one digit at a time into hangul: 010-1234-5678 → (공일공 일이삼사 오육칠팔). The parentheses are there to nudge the model into reading the contents as a single unit. Without them, the model would sometimes say "공일공" and then insert an awkward pause; with parentheses, it flows more cleanly.
For the non-technical reader: when a person reads a phone number out loud, they breathe in three groups — "zero-one-zero," "one-two-three-four," "five-six-seven-eight." Wrapping the digits in parentheses tells the model "this is one breath group."
Slash-form month/day
4/14 일 → 4월 14일, 04 / 05 일 → 4월 5일. A simple rewrite, with one trap — don't touch fractions or URLs. A regex that converts every slash would butcher inputs like 1/2 (a half) or https://.... So the matching pattern is narrowed to digit/digit일 — only triggers when a 일 character follows. That single character acts as a semantic anchor.
This kind of small preprocessing isn't glamorous, but it makes a surprising difference in the "is this listenable" axis. A one-line preprocessing rule often shifts perceived quality more than improving the model itself. This is a well-known pattern in NLP work, and TTS turns out to be no exception.
2. Auto-detecting repetition patterns
This is a domain-specific optimization. In announcement-style audio, the same short line getting repeated multiple times is common — "안녕하세요. 안녕하세요. 안녕하세요." or "Option 1. Option 2. Option 3. Option 4." style.
For the non-technical reader: think of it like a repeat sign in sheet music. To play the same measure three times, you don't write the notes three times; you mark "repeat three times." I did the analogous thing — instead of asking the model to generate the same text three separate times, generate it once and concatenate the result three times.
The naive path runs the model three times. With a model that takes seconds per generation, that's just wasted time. Worse, each generation might come out slightly different, breaking the "exactly three times" intent.
So at the input-text stage, repetition patterns are auto-detected.
Whitespace-separated repetition of the same unit (
안녕 안녕 안녕)Punctuation-separated repetition of the same sentence (
안녕하세요. 안녕하세요. 안녕하세요.)
When detected, generate the unit exactly once, and stitch the resulting WAV N times with 0.5 seconds of silence between repeats. The 0.5s value approximates a person's natural breath when repeating the same line. 0.3s sounded too mechanical, 1s dragged. 0.5s landed right by ear.
This optimization cuts a 5-repeat clip's generation time to 1/5, and all 5 reps come out identical. The latter is honestly the bigger win — the same line repeated five times has to sound the same five times for the announcement to read as consistent.
3. Sentence-level split generation — for length stability
Two problems show up if you feed long text to the model in one shot.
Memory usage scales non-linearly (lethal on a 4GB GPU)
Higher chance of mid-output noise or rhythm drift
For the non-technical reader: it's similar to how a person can only read so much in one breath. Try to read a long paragraph in one go and you'll either fall apart in the middle or run out of breath before the end. The model has a similar limit, and chunking input into manageable pieces is more stable.
So inputs over a length threshold get split into sentences, generated separately, and joined with 0.2 seconds of silence.
Sentence splitting looks simple but has many traps. I started with a basic text.split('.'). It broke within thirty minutes.
These cases all need to be excluded from splitting.
Decimal points:
1.2.3 version— the dots aren't sentence endsDomain dots:
example.comEnglish abbreviations:
Dr.,Mr.,Mrs.,Jr.,e.g.,i.e.,etc.,vs.Ellipses: dots inside
...,…Version numbers:
Python 3.11,v1.2.3Decimal numbers:
3.14,0.5초
So the splitter isn't a single regex — it protects the patterns above and only treats ., ?, !, 。, !, ?, … as actual sentence terminators. Order: protect-pattern matching → masking → terminator-based split → mask restore. I once got the order wrong and a domain name got cut in half — example and com generated separately with a 0.2-second pause between them. Listening to that one made me laugh for a minute.
The 0.2s inter-sentence silence is shorter than the 0.5s repetition silence. They mean different things — repetition silence is "the breath of starting the same line again," inter-sentence silence is "a short pause inside the same paragraph." The same gap of silence reads completely differently to a listener depending on context.
Priority — repetition first, sentence split second
There's a small but real decision here. An input like 안녕. 안녕. 안녕. — is it a repetition pattern, or three sentences to split? Both interpretations work technically, but treating it as repetition is the better operational choice. That guarantees the same audio plays exactly three times.
So the pipeline is "preprocess → check the full text for repetition → if repetition, take that path; otherwise, sentence split." This priority matters more than it sounds. Reverse it and the same line ends up subtly different across reps. The second "안녕" tones differently from the first — when you hear the same line repeated and the breath is off, you notice immediately.
4. Multi-generation — full N×M combinations of references and lines
Operationally, this is where the workbench earned its keep. Tasks like "generate this week's 5 candidate announcement lines across all 4 registered reference voices" got handled in a single request.
For the non-technical reader: the same line spoken in different voices feels different. Some announcements suit a calm voice, others suit a friendlier one. The fastest way to decide which fits is to generate them all and listen. So a feature that generates "5 candidate lines × 4 candidate voices = 20 results" in one call was needed.
The request is simple.
ref_ids: ["voice_a", "voice_b", "voice_c", "voice_d"]
texts: [
"Hello. This is the in-house announcement system.",
"Here are this week's main schedule items.",
"Details are available on the internal portal.",
...
]
seed: 529443 # same seed for all combinations
The server processes these sequentially. Parallel generation on 4GB VRAM is too OOM-prone. I tried running two at a time once and it OOM'd nearly every time. Stability won, sequential was the call. Each combination is followed by the CUDA cache cleanup routine I described in Part 1.
Results come back as a base64-encoded WAV array, each playable and downloadable from the frontend. And one more thing — everything bundled into a single ZIP for batch download.
The ZIP gets built on the browser side, deliberately. I considered server-side ZIP and rejected it for two reasons.
Server has to hold temporary file state. Until the user clicks download, the ZIP has to live somewhere; that accumulates and fills the disk. You'd also need a separate cleanup policy.
The browser already has the data. The base64-encoded results were already shipped to it — re-bundling on the server is redundant work.
I didn't even pull in an external ZIP library (JSZip or similar). Just wrote the ZIP store format, CRC32, and central directory record by hand. No compression (WAV doesn't compress well, and skipping compression makes the code much simpler). About 100 lines of vanilla JavaScript, done. A CDN dependency would have been more burden than help.
Filenames are auto-assigned in {reference_index}_{text_index}_{slug}.wav format with no collisions. That's so anyone opening the ZIP can tell which line came from which reference at a glance. Started with plain serial numbers; quickly noticed that opening the ZIP gave no way to trace which file was which voice, so I switched.
5. Seed locking and presets — reproducibility
I argued in Part 1 that determinism is one of the core advantages of running an open-source model. Reflecting that properly inside the workbench means seed-based reproduction has to actually work.
For the non-technical reader: a "seed" is essentially the starting number that drives random behavior in a computer. There's some randomness inside the model when it generates audio, and giving the same seed reproduces the same randomness, so the result comes out identical. That's how "regenerate the same audio that came out well last time" becomes possible. Note 521 once, give 521 again next week, and you get the same clip.
The set_seed(seed) function is short, but it pins all of the following.
Python's
randommoduleNumPy's RNG
PyTorch (both CPU and CUDA)
Miss any one of the three and the same seed produces different output. I missed the NumPy seed for a while at the start, and chased ghosts. Same seed, two calls, slightly different results — I assumed it was non-determinism in the model itself and dug through model options for a long time. Turned out the model was using NumPy calls like random.choice internally, and those were running on an unseeded RNG. There's no way to predict that ahead of time; you only know after one such episode.
This kind of trap is what makes guaranteeing determinism hard. Some library somewhere has a hidden RNG, and you have to corner all of them before the same input really produces the same output. Once cornered, it's stable; the first time, it's a trap.
A preset system sits on top. Save the full context of a result that came out well — reference id, seed, language, sampling options — and a later request with the same preset on new text produces the same tone.
This made a bigger operational difference than I expected — the ability to reproduce "that good tone from last week" exactly. When the same person's same voice has to keep delivering similar lines across quarterly updates, holding the tone constant across quarters matters from the listener's side. It was one of the features that defined what this workbench actually was.
The preset data is just a JSON file. No database — for the workbench's scale (single-digit concurrent users, dozens of cumulative presets), the filesystem is the simplest and fastest thing. Backup is a directory copy.
6. Operations — keeping the model library inside the repo
There's one structurally unusual decision: the upstream model library (qwen_tts) lives inside the repo as part of our source tree. Normally you'd list it in requirements.txt and pip-install, which is cleaner. Instead, the upstream source sits inside clone_folder/Qwen3-TTS/ and the docker build uses that.
For the non-technical reader: the usual approach with external libraries is "list what you need; download at runtime." Like keeping a "borrow this book" memo on your shelf instead of the book itself. We kept the book on our shelf. The reason: we ended up needing to write notes inside that book, and those notes had to be visible on every other machine.
Two reasons.
First, the model library was changing often at the time, and there was a behavior we wanted pinned. The most reliable way to freeze a particular point-in-time source is just to keep it. Pinning a version in requirements.txt is one way, but the chance of a maintainer pushing different code under the same version tag isn't zero (rare, but it happens).
Second — this reason got reinforced retroactively from the troubleshooting in Part 3 onward — we ended up needing to patch a single line in the upstream source. For that patch to apply automatically on other machines, keeping upstream inside the repo was the safest option. With an external dependency, you'd reapply the patch by hand each time, or maintain a fork separately. Both create operational overhead.
The detail of that decision is in Part 4. Brief preview — there's a min_new_tokens=2 hardcoded inside the library that the model was respecting regardless of what we passed in, and we needed a one-line patch to release it. That one-line patch had to apply consistently on every machine, so we carry the library source.
To compensate, build artifacts (__pycache__/, *.egg-info/) are explicitly ignored, and the parts we modified are tracked in a separate patch note (LOCAL_PATCHES.md). The note records:
Which file, which line was changed
What the original code was, what it was changed to
Why (which problem was it for)
How to roll it back
That makes the points where our patch needs to be reapplied during an upstream update obvious. And anyone looking at this repo for the first time has a document answering "why is the library source bundled in here?"
7. Post-processing — a small slice of trailing silence
Last step: post-processing. A 0.15-second silence is appended to the end of every generated WAV. Simple, effective.
The model occasionally returns a result that's very tightly cut. The natural fade-out of the last syllable doesn't fit; the audio just stops at the syllable boundary. Played as-is, listeners hesitate — "wait, did it end?" — for a beat before moving on, which feels off.
For the non-technical reader: when a person finishes speaking, there's a natural 0.1–0.2 second tail after the last syllable. The breath ending, the voice fading. Without it, the listener feels "cut off." The model sometimes ends exactly at the syllable boundary with no tail, so we manufacture and append the tail.
The 0.15s silence smooths that abruptness. I considered 0.3s, but it dragged; 0.15 was where it landed. Decided after a few listening tests.
— Here's the thing, though: this silence-append turns out to be the starting point of the troubleshooting in the next post. Realized much later, but there was a different kind of "the end feels off" that 0.15s couldn't fix. That feeling persisted as a "the last word sounds like it gets cut off" pattern that kept catching my ear, and it became the start of a two-week trace.
Recap
This workbench is the combination of:
Korean-friendly preprocessing: phone numbers and slash-form month/day rewritten into hangul pronunciation.
phonenumberslibrary + custom rules.Repetition pattern auto-detection: same line N times generated once, stitched N times with 0.5s silence between.
Sentence-level split generation: long inputs split per sentence, joined with 0.2s silence. Decimals/domains/abbreviations/ellipses are protected during splitting.
Multi-generation and ZIP download: N×M combinations in one pass, ZIP built in the browser (no server temporary files).
Seed locking and presets: Python random + NumPy + PyTorch, all three pinned. Good results saved as full-context JSON.
Upstream model library bundled in repo: pins behavior, applies patches automatically across machines. Patch notes (
LOCAL_PATCHES.md) tracked separately.4GB VRAM operational practices: BF16/FP16 auto-selection, conservative memory ceiling, post-request CUDA cache cleanup, 0.15s trailing silence.
Up to here it was uneventful work. This workbench wasn't a deployed service — it was a personal test bed for producing audio files that go into the actual service — and the only times multiple in-house people came in were review meetings or quick sample showcases, where everyone gathered around the same page and listened together. That flow ran fine for a few weeks: the solo loop of generating a reference × line combination, listening, moving to the next; multi-generation ZIPs; Korean preprocessing; all clean.
Then, once I'd listened through enough generations to compare them, the same pattern kept catching my ear.
The last word ending too quickly — sounding like it got cut off.
I assumed at first it was occasional, a specific result. But the same pattern kept showing up across different texts, different references, different seeds. That's a systemic-problem signal, and it persisted long after that.
Part 3 walks through the dozen-plus hypotheses and experiments behind that one line. The biggest payoff in that chase wasn't fixing anything — it was laying down a way to see before fixing. The pattern: when guessing stops working, stop guessing more and lay down more visibility tools instead. Faster in the end.
Thanks for reading.

