Korean TTS Workbench — (4) When the Workarounds Failed, a Hardcoded Line Inside the Library

Hi. This is part 4 of a 5-part series on building an in-house Korean voice-cloning TTS workbench. Part 3 ended on the conclusion that "the official model just drops Korean final-syllable prosody this way." If the model itself can't be fixed, can post-processing on the application layer compensate? — that's where this part starts. The result, stated up front: both workarounds collapsed, and at the end of that collapse a single line buried inside the library surfaced. The most expensive lesson of the series comes from that line.

About this series

Part 1 — Starting with Qwen3-TTS on a 4GB GPU
Part 2 — Korean-specific tricks and running on 4GB VRAM
Part 3 — The last syllable sounds cut off — and the cause was the model itself
Part 4 (this post) — When the workarounds failed, a hardcoded line inside the library
Part 5 — Qwen3-TTS → VoxCPM2, swapping the model out (upcoming)

Starting from Part 3's conclusion — can post-processing compensate?

The conclusion at the end of Part 3 was:

All the safeguards in our code had no effect
Switching prompt mode, language, and model size all had no effect
The official checkpoint just drops Korean final-syllable prosody this way

So the starting point for this part becomes one line — if the model itself can't be fixed, can the model's output be compensated through post-processing?

Two ideas came out of that question.

Tail anchor: append a compensation anchor sentence after the original text so the model doesn't treat the original ending as the actual ending. Then trim the anchor portion from the audio.
EOS suppression: force the moment when the model "is allowed to end" further out, so the final syllable has guaranteed time to be pronounced.

Both got tried in order. Both ended in failure, but the shape of the failures was different. Tail anchor collapsed under stacks of safeguards; EOS suppression turned out to have been silently no-op'd from the start, only confirmed after the fact.

Workaround 1 — Tail anchor: a plausible-looking idea

The idea — make the last syllable not be the last

This idea accepts the conclusion that "the model handles the final syllable awkwardly," then asks a simple question — "so what if the final syllable isn't the final syllable?"

Take an input like "Today's in-house lunch menu announcement." where the final "다" is getting clipped. What if the input becomes:

Today's in-house lunch menu announcement. Next announcement.

Now the "다" of "...announcement." is no longer a sentence ending; it's a syllable in the middle. From the model's view it's not a sentence-final syllable, so it's likely to be carried through naturally. Then trim the "Next announcement." portion afterward, and the desired "...announcement." ends up ending naturally. Like a singer who clips final syllables: have them sing "I love you forever" instead of "I love you," then cut "forever" from the recording so "I love you" ends naturally.

This was implemented exactly. The default behavior stayed; the new behavior was opt-in (a UI checkbox). An experimental mode that only fires when explicitly toggled. If the effect is null or there's a side-effect, it just gets turned off — that was the safety net.

First collapse — fixed-length trimming doesn't fit

The first attempt was simple. "An anchor sentence is usually about 1 second, so trim 1 second from the end." Five lines of code, the simplest possible implementation.

The problem — the same anchor sentence varies in actual audio length depending on seed and prosody. On one seed the anchor is 0.8s, on another it's 1.2s. With a fixed 1s trim, some outputs leave 0.2s of anchor remaining, others trim into the original final syllable that should have been preserved. Like the same singer singing "forever" at slightly different durations every take — fix the trim at 1s and the first take leaves "for-" trailing, the second take cuts into "I love y-".

This was an expected limit, but the assumption was that it would mostly work. The real data showed bigger variance than expected. Even on the same seed, the anchor length shifted depending on the preceding sentence's length, because the model's pacing changed. Fixed trim didn't work.

Second collapse — three simultaneous failures of the auto-cut heuristic

If fixed trim doesn't work, find the cut point dynamically. Pulling in a full ASR (Automatic Speech Recognition — a model that converts audio back to text) felt too heavy, so a lightweight signal-processing heuristic came first. ASR could pinpoint exactly where "Next announcement" begins, but it's a separate model with memory and time costs. So a lighter direction — "find the boundary between speech and silence and cut there" — got tried first.

A small RMS (average energy) window scans the last 2.4 seconds, and the "energy valley" (the brief pause between words) between the last two speech segments gets located automatically. Treat that valley as the boundary between original-text-end and anchor-start, and cut there.

There was a fundamental limit to the idea. RMS valleys identify "speech/silence transitions" but not "semantic ending points." It assumes there's a small pause between "I love you" and "forever." If they're sung joined together (Iloveyouforever), there's no valley to find. But the experiment ran anyway.

A batch run on 44 references × 1 announcement showed three different failure modes happening simultaneously in a single input:

Sentence 1: the model didn't produce enough silence at any point, so the auto-cut heuristic decided "only one speech segment exists" → auto-cut failed → fallback to fixed 1s trim → this turned out to be a case where the anchor was actually short, so the front of the original got trimmed too
Sentences 3 and 4: auto-cut returned 1.26s and 1.38s respectively → the actual anchors were shorter → part of the original got cut along with the anchor
Sentence 5: auto-cut at 0.97s, normal trim → behaved as intended

Three failure modes co-occurring in one input, and each failure demands a different treatment.

Failure mode	Cause	Possible treatment
Auto-cut search itself failed	Not enough silence generated	Widen search window, relax threshold
Auto-cut value too large	Wrong RMS valley picked	More conservative threshold
Normal	—	—

The first treatment and the second conflict. Relaxing the threshold catches the first case but makes the second case more frequent. Conservative threshold is the reverse. Satisfying both cases with a single parameter was fundamentally impossible.

At this point one thing became clear — RMS heuristics know speech/silence transitions, not semantic boundaries. The same 0.5s of silence might be the boundary between original-text-end and anchor-start in one case, and a natural pause inside the original sentence in another. Signal processing alone can't distinguish them. Switching to VAD (Voice Activity Detection — more precise speech-region detection) doesn't fix this. VAD judges "is there speech here or not" more accurately, but doesn't judge "what kind of semantic boundary is this silence." That's ASR territory.

Third collapse — runaway in long inputs

On top of that, one more thing. Turning on tail anchor for a long input produced this in the processing log:

🧩 Voice Clone sentence count: 1
🔊 Sentence 1/1 raw_sec=163.59s

163 seconds. The model was emitting nearly up to its max token budget. Almost three minutes of audio for what should have been about 30 seconds of text.

The cause was simple — the tail anchor code was forcing "single-shot generation mode" to make the anchor effect work, which meant long inputs were being thrown to the model whole, without sentence splitting. The Part 2 safeguard of "split long inputs by sentence, generate separately, then concatenate" was being bypassed in tail anchor mode. The wrong assumption was "the anchor must attach exactly to the last sentence's end, so the whole input has to go in together." Confronted with a too-long input, the model hallucinated audio and ran almost up to max.

Fixed it twice. First applied the anchor only to the final sentence of long inputs, which sent the middle sentences back to the original problem. Then changed it again to apply the anchor individually to each split sentence's end.

By this point one thing had become clear:

The tail anchor path leaks somewhere new every time a leak gets patched. The fundamental limit — that RMS/VAD-based boundary detection doesn't understand semantic boundaries — can't be covered up with workarounds. And extending the same workaround to long inputs produced runaway. Safeguards stacking on safeguards.

The code accumulated for tail anchor by this point — anchor text input, auto-cut enable option, fallback trim length on auto-cut failure, auto-cut search window length, split-first vs whole-first for long inputs, last-sentence-only vs all-sentences application, the same option wired into single generation / multi-generation / tuning / preset regeneration. All safeguards for a single workaround. And after all this, by ear it still sounded clipped. The premise needed to change.

Workaround 2 — EOS suppression: make the model unable to end

The idea

Instead of trying to trim, make the model unable to end in the first place. Pass a min_new_tokens argument large enough to force the model to keep generating until at least that many tokens have been produced — the model can't emit EOS until then. So the final syllable gets guaranteed time to be pronounced. Like telling a singer "you must sing for at least 3 minutes 30 seconds" — until that length, the model can't signal "end," so the last syllable naturally gets carried through too.

Korean uses roughly 8–13 codec tokens per character based on internal observation. So multiply input character count by an appropriate factor to compute min_new_tokens automatically and inject it. A factor around 6 would give enough room for the model to fully pronounce the input.

The option got added with three test scenarios:

A: baseline (factor=0, only the standard estimate)
B: conservative default (factor=6.0, lower bound of observation)
C: aggressive (factor=10.0)

Same seed 529443, same reference, same 5-sentence input — all three scenarios run and compared.

Result — all three scenarios were bit-identical

Sentence	Chars	A's min_new_tokens	B's min_new_tokens	C's min_new_tokens	raw length (A=B=C)
1	11	16	66	110	1.68s
2	9	14	54	90	1.36s
3	44	64	264	440	9.01s
4	20	30	120	200	3.83s
5	18	27	108	180	2.55s

min_new_tokens got pushed from 16 to 110 — a 7× increase — and the output waveform was completely identical. Not just raw length; the tail energy measurements were bit-identical too.

Like telling a singer "sing for at least 3 minutes 30 seconds" and they sang to 3 minutes. "Then sing to 4 minutes" — still 3 minutes. "Sing to 5 minutes" — still 3 minutes. The instruction isn't reaching the singer. Someone in the middle is intercepting the message.

What this means is — the min_new_tokens value passed in at the library call was not reaching the actual model.

The service log printed it correctly. From the call site, 110 was clearly being passed. But the model was ignoring it. Determinism that produces bit-identical output across calls means the model was, in fact, running with the same arguments every time.

The last suspicion from Part 3 — "the args visible in the service log can differ from the args the model actually receives" — came back. Last time it was the library wrapper silently injecting defaults that weren't in the log. This time it was the reverse: values printed in the log weren't getting into the actual model call.

This time it had to be confirmed by data, not hypothesis.

Reading the library source

From here it's not hypothesis but code tracing. The model library (qwen_tts) source got opened directly. Libraries are usually treated as black boxes — input in, output out, no peering inside. But when input and output don't match up, you have to look inside to make progress.

Clue 1 — the wrapper layer was clean

The library had two layers of abstraction:

The wrapper entry point generate_voice_clone()
The model body Qwen3TTSForConditionalGeneration.generate() invoked inside it

Wrapper first. The wrapper's _merge_generate_kwargs() function preserved arguments correctly:

merged = dict(kwargs)
merged.update(do_sample=..., top_k=..., ..., max_new_tokens=...)
return merged

So the min_new_tokens=110 passed at call time stays alive all the way to self.model.generate(**merged). OK so far. The wrapper preserves arguments and forwards them to the next layer.

So suspicion shifted to the next layer — the model body's generate() method.

Clue 2 — the next layer dropped them silently

Looking inside Qwen3TTSForConditionalGeneration.generate(), arguments aren't passed straight through to the model internals; instead, only an explicitly named set of keys is picked into a new dict for the call.

def generate(self, input_ids=None, ..., **kwargs):
    talker_kwargs = {
        "max_new_tokens": max_new_tokens,
        "min_new_tokens": 2,             # ⚠️ hardcoded
        "do_sample": do_sample,
        ...
        "suppress_tokens": [...],         # also hardcoded
        ...
    }
    ...
    talker_result = self.talker.generate(..., **talker_kwargs)
    # The **kwargs received from outside is NOT included in talker.generate

Found it.

The talker_kwargs dict has "min_new_tokens": 2 literally embedded. The **kwargs received from the caller doesn't make it into this dict — it just disappears. Whether 110 or 1000 was passed, the model was always running with min_new_tokens=2.

The same fate applied to other HuggingFace standard extension arguments — pad_token_id, suppress_tokens, logits_processor. Received, but not propagated to the model.

The library's official docs said "accepts HuggingFace generate kwargs." It accepts them, but they don't reach the model interior. Closer to "received and then discarded."

A message sent to the singer reached the manager fine. But when the manager passed it to the singer, the manager picked only a few preset items and discarded the message. "Sing for at least 5 minutes" got sent, and the manager passed only their own memo of "default 3 minutes" to the singer. The exact point inside the library where arguments were getting dropped was right there.

From the caller's view, the library was breaking its promise that "passed arguments are applied." And not at the wrapper layer, but one layer further in. So the wrapper layer alone gave no signal of the issue — only the bit-identical A/B/C scenarios as quantitative evidence pulled the search far enough inside.

Decision — patch the library source with one line

The path was set. Three options:

Fork the library and maintain it directly — full freedom but every library update has to be tracked. High maintenance cost.
Patch one line inside this repo — minimal impact. Reapply one line on each library update.
Give up and find another workaround — other workarounds were already exhausted.

Option 1 is too heavy. Carrying the whole library and resolving conflicts on every update doesn't fit this operation's scale. Option 3 is a dead end. Option 2 was the practical choice.

The one library line changed like this:

- "min_new_tokens": 2,
+ "min_new_tokens": kwargs.pop("min_new_tokens", 2),

If the caller passes a value explicitly, use that; otherwise use the default 2. Behavior compatibility doesn't break. Existing code paths that didn't pass min_new_tokens get exactly the same behavior as before.

Two things were done alongside this patch to make it operationally safe.

First, the Part 2 decision to "keep the library source inside this repo" earned its keep here. The patched library lives inside the repo, so on a different machine git clone followed by docker compose build reproduces the same behavior automatically. If the library had been an external dependency, the patch would have needed manual reapplication every time. The decision wasn't made with this strong a justification at the time, but it turned out to be the right call in retrospect.

Second, what was changed got recorded in a separate patch note (LOCAL_PATCHES.md) — the exact file path, line number, reason, and rollback procedure. A # LOCAL PATCH 2026-04-23 comment also went directly into the library code, so that whoever updates the library to a new version six months from now will see "ah, there's a local patch here" at a glance.

After the patch — "it works, but..."

The same scenario got rerun. This time the waveforms actually changed. min_new_tokens was reaching the model.

But the result table revealed a new problem.

Sentence	factor=1.5	factor=2.0	factor=2.1	factor=2.5	factor=6.0
1	1.68s	1.76s	163.59s	163.59s	163.59s
2	1.36s	2.80s	1.52s	3.43s	6.87s
3	9.01s	8.15s	8.15s	9.11s	163.59s
4	3.83s	3.67s	3.83s	3.99s	163.59s
5	2.55s	2.88s	163.59s	3.82s	163.59s

At factor=2.1 and above, some sentences suddenly ran away to 163 seconds — pushed nearly to the max token ceiling. A 9–11 character sentence stretching to almost 3 minutes is not normal.

The cause was short sentences. A factor of 2.1 applied to a 9–11 character sentence produces a min_new_tokens value of around 23–24, which is too large for short sentences. The model ends up in a "wants to end but can't" state, and to fill the required tokens it hallucinates audio and runs to max. Like giving a singer a one-line lyric and saying "sing for at least 3 minutes 30 seconds" — the singer starts repeating the same lyric or filling with meaningless tones.

So trying to solve one problem (final-syllable clipping) created a worse problem (runaway). The single line got found, and patching it surfaced a bigger problem. Exactly the troubleshooting retrospective pattern of "made it worse after fixing it."

The operational decision changed. The feature got defaulted off and isolated as an experimental option, with these guardrails:

Default factor lowered conservatively (6.0 → 1.5)
API/UI cap on factor at 2.0
Additional smaller cap for short sentences

Building a feature and immediately wrapping it in guardrails isn't the most flattering pattern, but it's the most honest state. "This was found, but it can't be exposed in operation as-is" — that message is encoded in the code itself. When the next person looks at this code and asks "why so many constraints?", the patch note and troubleshooting record have the answer.

Last candidate — tail stretch post-processing

The generation loop wasn't worth touching anymore. Only easy-to-roll-back post-processing experiments were left.

Idea: take the final voiced segment of the final WAV and stretch it slightly in time to reinforce the tail decay. Light signal processing that keeps pitch but extends time. Like a music editor's "stretch the last note slightly" function — pitch unchanged, duration extended.

factor	Post-processed result	Increase
1.15	19.22s → 19.25s	+0.03s
1.25	19.22s → 19.27s	+0.05s
1.50	19.22s → 19.31s	+0.09s

The code worked correctly. But the stretched region was capped at 0.18s, so even at factor 1.5 the total increase was only 90ms.

By ear it still sounded clipped. 90ms of stretch isn't perceptible. Below 100ms is essentially zero perceptual difference.

The follow-on option was raising the stretch cap to 0.25–0.30s, but a different decision was needed at that point. Increasing the stretch range — how far is needed for an effect is unknown, and going too far risks sounding unnatural. And this workaround too is "artificially lengthening the final syllable," not "making the model pronounce the final syllable naturally."

The workaround candidates standing at this point:

Workaround	Status	Cost	Effect
Tail anchor + auto-cut	Tried, failed	Code complexity exploded	Marginal effect, large side-effects
EOS suppression (server option)	Tried, failed	Code added, then confirmed no-op	Zero effect
EOS suppression (library patch)	Tried, partial success	Library patch + guardrails	Works, with runaway side-effect
Tail stretch post-processing	Tried, failed	Simple code	Marginal effect
ASR-based trim	Untried	Dependency added + code	Unknown

The remaining candidate was ASR-based trim. But a different decision was needed at that point. Part 5 covers it.

What this part comes down to

Tail anchor: a plausible idea, but the limit that RMS/VAD heuristics don't understand semantic boundaries couldn't be covered with workarounds. Patching one leak made another leak. Code complexity exploded for a single workaround.
EOS suppression (server option only): arguments passed in were getting silently dropped between the library wrapper and the model body. Only confirmed after the bit-identical A/B/C scenarios.
The hypothesis itself wasn't even testable until the library source got opened and the hardcoded line surfaced. The EOS suppression option was no-op'd from the start, only known after the line was found.
Patching that line worked, but a new side-effect (runaway) appeared. Short sentences with large forced lengths push the model into hallucinating audio up to max. Guardrails added immediately.
Tail stretch post-processing: worked, but the effect was too small (90ms). Going further risks unnaturalness.

The single most expensive lesson — a promise documented two abstraction layers up should be doubted. The library docs said "accepts HuggingFace generate arguments," but they were accepted only and didn't reach the model. If the trace had gone all the way down to the model body invocation from the start, the EOS suppression feature would never have been built. It was only found after stacking guardrails, UI, and batch integration on top of it.

How could this pattern have been caught earlier — by verifying with code, before forming the hypothesis, whether the argument actually reaches the model body invocation. When the EOS suppression hypothesis came up, instead of designing the verification experiment (A/B/C scenarios), 5 minutes spent reading the library source would have saved days of work.

This experience changed the way of thinking on subsequent projects. When a new hypothesis depends on an external library, the very first step is now "verify in code how far the hypothesis's core argument actually reaches inside the library."

What's next

Part 5 covers the consequence of all these attempts — the decision to swap the model out, and the broader retrospective from this whole stretch of work. "Don't get attached to the workaround code I wrote" turned out to be the most expensive item in that retrospective.

Thanks for reading.

Korean TTS Workbench — (4) When the Workarounds Failed, a Hardcoded Line Inside the Library

Korean TTS Workbench — (4) When the Workarounds Failed, a Hardcoded Line Inside the Library

Starting from Part 3's conclusion — can post-processing compensate?

Workaround 1 — Tail anchor: a plausible-looking idea

The idea — make the last syllable not be the last

First collapse — fixed-length trimming doesn't fit

Second collapse — three simultaneous failures of the auto-cut heuristic

Third collapse — runaway in long inputs

Workaround 2 — EOS suppression: make the model unable to end

The idea

Result — all three scenarios were bit-identical

Reading the library source

Clue 1 — the wrapper layer was clean

Clue 2 — the next layer dropped them silently

Decision — patch the library source with one line

After the patch — "it works, but..."

Last candidate — tail stretch post-processing

What this part comes down to

What's next

Comments