Skip to main content

Command Palette

Search for a command to run...

Korean TTS Workbench — (3) The Last Syllable Sounds Cut Off — and the Cause Was the Model Itself

Tracing a one-line symptom — the last syllable getting cut off — and why diagnostic logging paid back more than any single fix. Part 3 of 5

Published
16 min read
E
I build data and AI systems that have to survive real constraints: time, cost, memory, and messy integration boundaries.

Korean TTS Workbench — (3) The Last Syllable Sounds Cut Off — and the Cause Was the Model Itself

th

Hi. This is part 3 of a 5-part series on building an in-house Korean voice-cloning TTS workbench. Part 1 covered why Qwen3-TTS landed on a single 4GB GPU. Part 2 walked through how the workbench was built. This post is where the real troubleshooting starts. The single takeaway boils down to one line — "lay down a way to see before you start fixing."

About this series

  • Part 1 — Starting with Qwen3-TTS on a 4GB GPU
  • Part 2 — Korean-specific tricks and running on 4GB VRAM
  • Part 3 (this post) — Chasing the cut-off syllable, landing on a model limit
  • Part 4 — From a tail-anchor failure to EOS suppression and a one-line library patch (upcoming)
  • Part 5 — Qwen3-TTS → VoxCPM2, swapping the model out (upcoming)

Where it started — the same pattern kept catching my ear

1

Once the workbench had been running smoothly for a while, listening through outputs, the same pattern kept catching my ear.

The last word ending too quickly — sounding like it got cut off.

At first I assumed it was occasional. No model is 100%. Any speech synthesis model has its bad-day moments and a single off-sounding result is in the normal range. But across different texts, different references, different seeds — varying inputs, voices, and options — the same kind of awkwardness kept recurring. That's not coincidence. That's a patterned problem somewhere in the system.

A clipped final syllable in announcement audio is actually a non-trivial defect. If a clip like "Today's in-house lunch menu announcement." has its final "다" ending awkwardly, the listener hesitates — "wait, was there more?" — before doing whatever comes next. That's not a naturalness problem; it's a functional problem. When a listener can't tell where the announcement ends, they don't know whether to wait for more information or move on.

The first impression was "the last word is getting cut off," but the same awkwardness showed up in other final syllables too — the "요" of "...해주세요," the "다" of "...있습니다." The more accurate framing was "the last one or two syllables ending awkwardly."


First instinct — "just append a longer trailing silence"

2

There wasn't enough data to form a real hypothesis at this point. When the cause is unclear, the first move is usually to lay down a few low-risk "this might help" treatments at once — like a doctor recommending rest and hydration without a confirmed diagnosis. If it helps, great; if it doesn't, no harm done. So I started with safeguards — small in scope, easy to roll back. Three of them.

First, the trailing silence at the end of the final WAV got slightly longer. The 0.15s trailing silence stayed in place, with reinforcement when the model output was tight. (This silence was already in place from Part 2 — it's the same silence that smooths over outputs that "end exactly at a syllable boundary with no fade.")

Second, when the model generates audio, internally it produces small "audio tokens" one at a time and stops at some point when it emits a special "this is the end" token (EOS, End-Of-Sequence). Suspecting that on short sentences this EOS might come too early, min_new_tokens (a hard rule of "you must produce at least N tokens before ending") started getting injected mandatorily — 18 for short inputs, more conservative for longer ones.

Third, one of the model library's defaults was to take a "streaming text simulation" path. The workbench only ever returns finished WAVs anyway, so that path was disabled and a one-shot generation mode (non_streaming_mode=True) was forced. Streaming assumes text arriving one character at a time, but full text is sent as a single block, so the streaming path was suspect for not matching the usage pattern.

All three landed at once. And — almost no perceptible difference. Same pattern, still audible.

This is where one decision changed the entire trajectory of the troubleshooting.

"Stop guessing. Look at what the model is actually emitting."


Lay down diagnostic logs — convert guesses into data

3

The workbench runs inside a container, so details like which tokens the model emits at each call and exactly when generation ends aren't visible from the outside. Just listening to result WAVs isn't enough to make progress. Visible: the result WAV. Invisible: everything that happened inside the model to produce it.

So at every generation, the following values get printed to stdout:

  • raw_sec — the raw audio length the model emitted (seconds)
  • tail50_rms, tail50_peak — average and peak energy in the last 50ms
  • tail200_rms — average energy in the last 200ms
  • last_abs — absolute value of the very last sample

An audio file is really a record of fine vibrations — bigger vibration is louder sound, vibration of zero is silence. "Average energy in the last 50ms" is the numerical measure of how much the file was vibrating during its final 50 milliseconds; close to zero means it ended in near-silence, high means sound was still going strong when the file cut off. So these numbers describe how the audio ended.

The reason for these specifically: to separate two possibilities.

Hypothesis (A) — the model is ending too early

The model is emitting EOS too early, and the generated audio itself is shorter than it should be. In this case there's likely energy left in the final 50ms — sound was still flowing when generation cut off. Like singing through a song and someone yanking the microphone power mid-chorus.

Treatment: increase min_new_tokens so the model can't end early. The second safeguard above is the (A) treatment.

Hypothesis (B) — the model ends naturally, but the decoder cuts the tail short

The model rode out to its natural ending point, but the codec decoder (which converts audio tokens into actual sound waves) or post-processing didn't handle the final window cleanly. In this case, the last 50ms decays to near-zero on its own, but something just before it might be unnaturally compressed. Like singing all the way to the end but the song cutting off before the natural fade of the last syllable.

Treatment: stretch the last syllable slightly in post-processing (time-stretch), or refine the fade-out.

These two hypotheses demand opposite treatments. Without separating them, the wrong treatment is the only option. Treatment (B) on (A) does nothing; treatment (A) on (B) does nothing either. So distinguishing which is which is the starting point for any treatment.

The diagnostic logs were laid down precisely for this separation. And that decision turned out to be the highest-ROI move in the entire chase.


When the first data came in

4

After running the workbench once and the logs accumulated, this line was in there:

🔊 Sentence 1/1 raw_sec=2.40s tail50_rms=0.0000 tail50_peak=0.0001
                tail200_rms=0.0008 last_abs=0.0000

This single line says a surprising amount.

  • raw_sec=2.40s — the model produced 2.4 seconds of audio
  • tail50_peak=0.0001 — peak amplitude in the final 50ms is essentially zero
  • tail200_rms=0.0008 — the final 200ms is also near-zero on average
  • last_abs=0.0000 — the very last sample is essentially zero

That is, the waveform is decaying cleanly to zero at the end. It's not getting suddenly cut off — it's ending in a natural fade. If the model had ended too early, sound should have still been happening when the file cut off; instead, the final portion was already settled to near-silence. Hypothesis (A) didn't fit the data, and hypothesis (B) — the decoder didn't finish cleanly — looked more likely.

And yet by ear, it still sounded cut off.

This contradiction was the seed of the next several days of work. When data and perception disagree, treat both as truth and write a new hypothesis that explains the gap. Data rarely lies (measurement can be wrong, but the data itself doesn't). Perception, with day-to-day variance accounted for, is also a consistent signal. So somewhere there's a hypothesis where both are right at once.

Candidate hypotheses surfaced:

  • (C) The waveform ends cleanly, but the prosody of the syllable just before is the part that feels cut off — an articulation problem, not a waveform-shape problem
  • (D) The "last 50ms" being measured is too short to capture the actual region where the awkwardness lives
  • (E) The model misallocates time when pronouncing the final syllable — the time given to the last syllable is genuinely too short

(C) seemed most likely, but verifying it would require yet another diagnostic tool. Before getting there, two real bugs surfaced while reading the diagnostic logs and had to be addressed.


"Is the code somewhere forcing a strange default?"

Looking through the diagnostic logs surfaced two suspicious things. Neither was guaranteed to relate to "cut-off," but they were odd enough to chase. As it turned out, one was a real bug, and the other became a direct lead-in to the core finding in Part 4.

Suspicion 1 — defaults the library quietly refills

This message appeared in the container logs:

RuntimeWarning: 'temperature' is not a valid generation argument

Strange. temperature is the option controlling how varied the model's output is — closer to 0 means always the safest result, above 1 means more creative variation — and it wasn't being passed in from anywhere in the call site. Yet the library was warning "this option isn't accepted." That meant something somewhere off the hand-coded path was passing it.

Digging into the code revealed this — the model library's wrapper (generate_voice_clone) wasn't forwarding the passed-in arguments verbatim. Right before the actual model call, it was silently merging in defaults from its own generate_config.json (temperature=0.9, repetition_penalty=1.05, and so on).

So even from a "no defaults are being used" perspective, the library was injecting them on the back end. The workbench service log showed no temperature, but the actual model call had one. From the library's standpoint it's a friendly "if the caller didn't pass it, here's our default." From the caller's standpoint, it's behavior different from intent leaking in unseen.

Treatment: explicitly inject neutral values (temperature=1.0, repetition_penalty=1.0) so the wrapper merge can't override them. Pre-fill the slot the library was going to fill.

This finding wasn't a direct cause of the cut-off issue, but "the arguments visible in the service log can differ from the arguments the model actually receives" was a critical realization on its own. From here on, every hypothesis came with a check: "does this argument actually reach the model?" That habit is what makes the Part 4 finding possible — there, it isn't the wrapper losing arguments, it's deeper inside the model itself.

Suspicion 2 — the wrong end-of-sequence token was being picked

Another log line:

pad_token_id=151671, eos_token_id=151673

Big numbers like 151671 and 151673. But for voice cloning, the codec (audio token) end-of-sequence tokens are 2148 and 2150. 151671 is the text-TTS side token. The model marks "end" two different ways — one for text processing, one for audio processing — and the workbench is doing audio processing, so the latter is the right one. But somehow the code was passing the text-side end token. From the model's side: "is this an end token? I don't recognize it" — meaning end-of-sequence detection wasn't working as it should.

Tracing the cause was a search-logic issue. Walking through model objects to find end tokens, the code accepted the first object with a tts_pad_token_id attribute. With outer objects holding text tokens and inner configs holding codec tokens, the wrong one got picked first. A priority bug.

The search order got reversed. Attribute-name first: every candidate is scanned, and codec_* is preferred when present, falling back to tts_* only if no codec one exists.

Re-measure. This time pad_token_id=2148, eos_token_id=2150 came out cleanly. This was a real bug. With the wrong end token, the model can misidentify "when to end" — generation might run abnormally long, or stop at an awkward point. It might not surface for typical inputs, but on certain inputs it could plausibly affect behavior.

And the result — the same pattern was still audible.

A real bug fixed, perception unchanged. This is the most frustrating moment in any troubleshooting. A definitively correct fix, and the perceived symptom doesn't change. When this happens, the answer is one of two:

  1. The bug fixed wasn't the actual cause (a different real cause is still there)
  2. The bug was part of the cause, but not enough on its own (other causes are also present, and the partial fix isn't perceptually noticeable)

Without knowing which, both possibilities have to be carried forward. Keep the fix in place, and stack a new hypothesis on top.


Stepping back — "did I touch too many things?"

5

This is where the direction of the hypotheses flips.

The thinking up to this point had been: "the model is misbehaving, so filling in more constraints will help." Sampling options pinned conservatively, end tokens explicitly specified, minimum length forced. A "lay down safeguards" mindset.

But re-read the diagnostic logs:

  • The raw waveform ends cleanly at zero (natural ending)
  • The end token is now the correct codec token
  • And it still sounds cut off

Flip the hypothesis.

"Are the conservative defaults I laid down actually breaking the model's natural prosody (intonation and rhythm)?"

Specifically do_sample=False (sampling off) and temperature=1.0 (neutralizing). These force the model into greedy decoding — always picking the highest-probability next token. But the official model is trained/tuned with do_sample=true, temperature=0.9. Those were being overridden. Like a singer tuned to sound best at a particular key and tempo, forcing "always sing in this exact key, this exact tempo" for safety might be the very thing breaking the conditions where the singer performs best. What happens if generation is rolled back closer to training conditions?

So the policy was rolled back.

  • Removed all forced sampling-option injections
  • Used the model checkpoint's default generation policy as-is
  • Reinforcement minimized — only one-shot generation mode (non_streaming), min/max token counts, and the correct codec end token

By ear — same. The "마지막 다" cut-off pattern persisted.

The flipped hypothesis didn't help either. Which clarified one thing — "the cause isn't at the generation-options level." However the model's generation options are configured (whether on the safeguard side or the model-default side), the result is the same. The next layer to suspect was prompt.


Suspecting the prompt — actual A/B comparisons begin

Voice cloning isn't just "speak in this voice." There are two ways to ask a model to "read this in someone's voice." One is to extract just the voice characteristics (pitch, tone, vocal style) as a numeric vector and pass that alone. The other is to feed an actual audio clip of the person reading some sentence, paired with that sentence, and let the model continue the same pattern with new text. The latter is in-context learning (ICL), and voice cloning's internal flow follows ICL.

  1. A speaker embedding (acoustic feature vector) gets extracted from the reference audio
  2. The reference audio and its text (reference text) are bundled and fed to the model as a prompt
  3. The text to synthesize is appended after that prompt
  4. The model generates the new audio following the prompt's pattern

ICL usually produces more natural results, but the reference's tone and length can also influence the output. A suspicion forms — could the presence/absence of reference text affect how the final syllable gets handled? If the reference sentences always end abruptly, the model might mimic that and end new sentences abruptly too.

The check was simple. A checkbox got added to the UI — "Speaker embedding only mode." With it on, ICL using reference text is off, and only the speaker embedding (the voice-feature vector) drives the synthesis. Same reference, same seed, same input — the two modes can be A/B compared.

Result — both modes had the same final-syllable cut. The ICL prompt itself wasn't the cause.

Next suspicion — language conditioning. The input language was being forced to "Korean." If the model auto-detects instead, the result might differ. The hypothesis: forcing Korean activates some internal model path that processes final syllables awkwardly.

Added an Auto option to the UI for comparison.

Result — same.

Next suspicion — model size. Working with 0.6B; the 1.7B might behave differently. Larger models often handle subtle prosody better. Switched and compared.

Result — same.

Summarizing the three A/B rounds:

Dimension changedHypothesisResultConclusion
Prompt mode (ICL vs speaker-only)Reference text affects final syllableNo differenceNot a prompt-level issue
Language conditioning (Korean vs Auto)Forced Korean activates an awkward pathNo differenceNot a language-conditioning issue
Model size (0.6B vs 1.7B)Smaller-model limitNo differenceNot a model-size issue

What the three "no difference" results add up to is — the model itself processes Korean final syllables this way. Regardless of how it's called, what prompt is fed, or what size is used.

At this point a conclusion solidifies.

It isn't the code, it isn't the prompt mode, it isn't the model size. The public Base checkpoint just drops Korean final-syllable prosody this way.


The frustration phase — "so how does this get fixed?"

For a few days after this conclusion, the frustration was real. Working through nearly every hypothesis worth trying only to land on "this is just a model limitation" makes everything before it feel like wasted effort. And from the standpoint of audio that actually has to ship into a service, "we can't fix it because of model limits" is too thin a closing line.

Two realistic next moves.

1. Switch the model. Evaluate other Korean TTS models and migrate. The cleanest resolution, but migration takes time, and there's no guarantee a new model is better. Plus a working workbench is on the line — whether the model alone can be swapped without disrupting the rest needs evaluation too.

2. Work around it. If the model itself can't be fixed, post-process the output to mask the issue. "If the final syllable ends awkwardly, what can hide that awkwardness from the listener?"

A model swap can't happen overnight (option 1 isn't a few-days job), so workarounds get tried first. Maybe they pan out unexpectedly well; even if they don't, the process of trying makes the case for "yes, model swap is actually necessary" much stronger.

That's Part 4. How the seemingly clever workaround "tail anchor" collapsed in three different ways, and what surfaced when the search shifted to reading the model library's source directly — a single hardcoded line.


What this part comes down to

Pulling everything together:

  • It started with a recurring pattern in outputs: "the last syllable sounds cut off"
  • Three instinctive safeguards (trailing silence, min tokens, one-shot generation mode) had little effect
  • The decision to lay down diagnostic logs was the highest-ROI move — it separated hypothesis (A) early EOS from hypothesis (B) decoder cut-off
  • Data pointed to (B) — the waveform was already clean-zero at the end
  • Along the way, two real bugs surfaced — silently re-injected library defaults, wrong end-token selection
  • Tried flipping the hypothesis — "did I touch too many things?" → policy rolled back to defaults → no change
  • A/B'd across prompt mode, language, and model size → no difference
  • Conclusion: a Korean final-syllable prosody limitation in the official checkpoint

The single most important pattern was this:

When guessing stops working, stop guessing more — lay down more visibility instead.

Without diagnostic logs, hypotheses (A) and (B) couldn't have been separated and the chase would have lasted longer. The end-token bug surfaced almost incidentally, while reading those logs. Spending time on "see" before "fix" was faster in the end.

That pattern looks like a troubleshooting heuristic, but it's really a difference in mindset. "Fix" is action; "see" feels like delaying action. The decision to suppress the urge to write a fix and lay down measurement first takes patience. It took several days of useless safeguards before that turn.

One more — how to handle the frustrating moment when the right fix doesn't move the symptom. The approach here was "keep the fix; stack a new hypothesis on top." There's no reason to roll back a real fix (it's a worthwhile change on its own). The new hypothesis stacks above it. Later this might turn out to be "I caught part of the cause, but not enough to feel" or "wrong hypothesis." Either way, while the answer is unknown, preservation is safer.


What's next

Part 4 covers the workarounds — how the seemingly obvious "tail anchor" idea collapsed three different ways; the shock when EOS-suppression scenarios A/B/C came out bit-identical; and the single line of hardcoded behavior found by going into the model library's source directly. It's the part of the series with the most expensive lesson learned.

Thanks for reading.