Korean TTS Workbench — (1) Starting With Qwen3-TTS on a 4GB GPU

Hi. This is part 1 of a 5-part series on building an in-house Korean voice-cloning TTS workbench, and eventually retiring the model under it. The whole thing was a solo side project end-to-end — planning, model selection, web UI, tuning, voice data collection, cloning, troubleshooting, all of it on my own. The workbench worked. The model under it didn't, after persistently chasing one issue I caught with my own ear. I'm writing the whole arc as it actually happened.

About this series

Part 1 (this post) — Starting with Qwen3-TTS on a 4GB GPU
Part 2 — Workbench internals and Korean-specific tricks (upcoming)
Part 3 — The last syllable getting cut off — diagnostic logging and two real bugs (upcoming)
Part 4 — From a tail-anchor failure to EOS suppression and a one-line library patch (upcoming)
Part 5 — Qwen3-TTS → VoxCPM2, swapping the model out (upcoming)

An in-house TTS service, on a single 4GB GPU

This workbench started as an in-house TTS service task. There was no grand "build it ourselves instead of calling an external API" decision. We just needed a TTS service for internal use and had to pick a model to put under it.

A short note on the service shape — it had two axes that mattered: generating announcing-style voice output, and deterministic output (same input + same seed → same result every time). A clip that came out well had to reproduce identically when the same inputs were submitted again. If the tone shifts subtly each time, an "announcing" voice loses its credibility.

One more thing worth flagging up front — as already noted, this was a solo side project end-to-end. The person listening to outputs the most was me, and the person catching what felt off was also me. None of the "issues found" in this post came from external user feedback — they came from my own ear during testing.

The environment was, honestly, rough. GTX 1650 SUPER 4GB VRAM, 64GB RAM. A 2019 entry-level card, one of them, was the entire compute budget — there was no separate GPU infrastructure to lean on. That single line shaped everything that followed.

Smallest first — OpenVoice, then Qwen3-TTS

Step one was simply: it has to run on this machine. So the candidate evaluation went smallest-first. Going to a large model and discovering "doesn't run" before climbing back down is a waste of cycles.

The first candidate was OpenVoice. About 350MB of weights, runs comfortably on a 4GB card. The cloning quality wasn't what I needed, though — OpenVoice's cloning works by layering a reference's tone color onto a base TTS model's output, which means the prosody and intonation stay tied to the base model. In practice it copies the timbre of the reference waveform and not much more. The "this person is actually speaking" naturalness wasn't there, and for an announcing voice that fell below the quality floor.

Next candidate was Qwen3-TTS. The deciding factor was — 0.6B and 1.7B Base variants are both available behind the same interface.

On a 4GB GPU, load 0.6B first and confirm inference runs end-to-end
When a stronger machine is briefly available, run the same input through 1.7B for comparison
Switching sizes requires no code changes

Listening to the output, the cloning carried not just timbre but tone and intonation close to the reference. Small enough to fit on 4GB, but cloning that follows tone — that combination wasn't easy to find elsewhere. So we settled on Qwen3-TTS for the workbench, at least for a while.

(For reference, the eventual destination of this series is the slightly larger 2B VoxCPM2. That story's in Part 5.)

Voice cloning, not fine-tuning

There was a brief temptation — "what if we fine-tune on voice data, would that be better?"

Two reasons closed that path simultaneously.

First, fine-tuning is essentially impossible on a 4GB GPU. When inference alone is tight on memory, there's no room left for the training graph and optimizer state. Even with lightweight approaches like LoRA, the data prep / validation / retraining cycle is operational overhead, and we didn't have headcount to absorb that overhead.

Second, more fundamentally — we needed multiple speakers. A fine-tuned model is bound to the speakers it saw during training. A model fine-tuned on one person speaks in that one voice; adding a new speaker means another data collection + retrain cycle, which is an operational event every time. Even with multi-speaker conditioning during training, only training-time speakers are usable. Zero-shot voice cloning, by contrast, adds a new speaker with one reference audio file — as the speaker count grows, the operational cost gap between the two approaches widens.

Either reason on its own would have pushed us toward cloning. With both at once, the training track was closed from day one.

So we focused on voice cloning — take one or two reference audio clips, synthesize arbitrary text in that voice, no training, inference only.

The same logic pruned one more branch. Voice design models came off the candidate list. Voice design generates a new voice from a textual description ("a calm female voice in her 30s"), but for announcing-style output we needed a quality and consistency floor. "A plausible new voice" lost to "a clear, identifiable, reference-backed voice" on both consistency and evaluability.

Then the problem showed up

Up to this point it was a smooth ride. 0.6B loaded onto the 4GB card, ran a couple of references through it, listened to the cloned output. The first few days were "huh, this actually works" levels of reaction.

The trouble surfaced once I'd listened through enough generations to compare them. The same pattern kept catching my ear.

The last word ending too quickly — sounding like it got cut off.

A one-line symptom, and at first I treated it as one — "just append a bit of trailing silence?" That one line ended up outlasting every fix attempt and eventually pulled the whole project to swapping the model out. The rest of this post is the workbench state right up to the moment that line landed; the actual chase begins in Part 3.

The five criteria behind the model pick

Once the workbench was the plan, a model had to live under it. The OpenVoice → Qwen3-TTS narrative above is the trajectory. The five criteria underneath that trajectory are worth writing out, in case anyone is starting a similar build in a similar environment.

Open-source weights — required for self-hosting. No per-call cloud cost, runs offline, and at least in principle, deterministic output for a fixed seed becomes possible.
Voice cloning as a first-class feature — synthesis from a reference, no fine-tuning required. On a 4GB GPU, fine-tuning is closed off anyway.
Korean support — Korean explicitly in the training data, not as a side effect of being multilingual. (Spoiler: this turns out to be the center of the troubleshooting later. Part 3 onward.)
Deterministic inference — same seed produces the same output.
Fits on this GPU — the largest constraint of all.

Qwen3-TTS passed all five. Two extra positives stacked on top: 12Hz codec-based 0.6B / 1.7B variants behind one interface, and an Apache 2.0 license that put no operational handcuffs on us.

A short note on the candidates that didn't make it — some had better cloning quality but weak Korean support; some had strong Korean but cloning that depended on fine-tuning, which closed them off on 4GB; some were just too large to load at all. Qwen3-TTS was the one that cleared all five.

The real constraint — GTX 1650 SUPER 4GB

Theory was nice. The actual operating machine was a GTX 1650 SUPER. 4GB VRAM. A 2019 card that's no longer even called entry-level.

The reason that machine, and not something better — this workbench runs on a single in-house workstation. The workload doesn't justify standing up a dedicated server, and the usage pattern doesn't warrant a 24/7 cloud GPU instance. Running on the machine that's already there was the most realistic operational model. A small GPU that's always on, in a workstation that's always on — that was the starting point of the operating environment.

Why is that a constraint — voice cloning models put an LLM backbone, an audio codec, and a decoder into one inference path. Even a small model takes 1.5GB or more without trying. Add PyTorch's CUDA context (typically 600MB–1GB), the cache allocator's overhead, and inference-time scratch buffers, and 4GB gets uncomfortably tight.

The first thing that broke was OOM (Out Of Memory). The model would load. The first generation would crash. The second would go through. Then another OOM. The pattern was strange.

Tracing it surfaced something worse. After an OOM, the next request would hit an internal assert in PyTorch's CUDA allocator, and the container would effectively die. The log looked like:

RuntimeError: CUDA error: an illegal memory access was encountered
!handles_.at(i) INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp"

This shows up when PyTorch's internal memory manager hasn't recovered cleanly from an OOM and gets the next request anyway. Once it happens, only restarting the container brings it back.

The fix went two ways.

First, the memory ceiling was lowered to be more conservative, so OOM rarely fires in the first place. GPU_MEMORY_LIMIT=2.8GiB on a 4GB card — telling PyTorch it only gets to use 2.8GB of the card. We started at 3.0GiB, ate one OOM, then dropped further. There's an operational thought behind this — using only 70% of 4GB makes the model slightly slower, but from an ops perspective, "slow and stable" always beats "fast and OOM-prone."

Second, an explicit CUDA cache cleanup routine runs at the end of every request. The order is gc.collect() → torch.cuda.synchronize() → torch.cuda.empty_cache() → torch.cuda.ipc_collect(), with memory state logged before and after. The order matters more than it looks — gc.collect() has to run first to release Python references holding GPU tensors, then synchronize() waits for in-flight GPU work to finish, and only then does empty_cache() actually reclaim freed memory.

That dropped "OOM landed in a broken allocator and the next request kills it" almost to zero. Not all the way — once it does fire, it can still cascade — but the frequency dropped to operationally negligible.

One more thing — automatic dtype selection. Ampere-and-later GPUs handle bfloat16 (BF16) well, but Turing-generation cards like the GTX 1650 SUPER have limited BF16 support. Running the BF16 code path verbatim loads the model fine but produces garbage mid-inference, or just dies.

So at startup, torch.cuda.is_bf16_supported() decides — BF16 if supported, otherwise fall back to float16 (FP16). A small thing, but a concrete safety net for "running this workbench on someone else's GPU without editing code each time." The same docker compose command runs cleanly on both an RTX 4090 and a GTX 1650.

Why a workbench, not a CLI or a plain API

Another decision point: once we were running the model ourselves, how do we expose it? The exposure shape decides the entire user experience.

There were roughly three options.

1. CLI tool: a script that takes a text file and a reference audio as arguments and drops a result WAV. Quick to build, automation-friendly.

2. Plain API server: one /generate endpoint that returns audio for text in. Easy to integrate with other systems.

3. Web UI workbench: reference registration through result comparison, all on one page. Good for human-in-the-loop work.

CLI is fast to build but painful for comparison. Picking one of seven generated results from a terminal isn't realistic. Replaying with mpv and noting which was best, one at a time? Doable once or twice. At thirty results it becomes torture.

A plain API server is useful for automation, but our workflow is one where the human listening step decides the result quality.

So we went with the workbench. FastAPI + static HTML web UI. FastAPI handles routing on the backend, the frontend is single HTML files with no build tooling.

There was one variable in this decision — could a side project actually afford to build a full UI itself? Multiple in-house people had to come in, compare, and pick results — CLI or plain API was clearly insufficient. The question was whether the next step up was within a side project's effort budget.

What pulled that effort cost down was AI pair coding. With a stack as simple as vanilla HTML + FastAPI, the time to add one page drops by roughly an order of magnitude versus the traditional cost. That's what put "a workbench where humans listen and choose" inside the side project's budget. Without that variable, this would have likely been a CLI plus automation scripts.

Skipping React/Vue lives in the same context. A clearly bounded feature set (reference management / single generation / batch generation / tuning / history) with a single-digit number of pages — AI pair coding plus vanilla JS is plenty fast at that scale. Not enough complexity to justify dragging in a build pipeline.

The whole thing has been running for about two months as a side project, so "we avoided dependency hell" isn't the kind of long-term claim I can make yet — it's more honest to say we never opened that door in the first place.

The screens ended up looking like this.

Single generation page: one reference + one text → one result clip. Quick iteration on seed and options.
Batch generation page: every combination of multiple references × multiple lines, generated in one pass, downloaded as a ZIP.
Tuning page: same input, swept across seeds and parameters, listen and compare.
History page: replay past generations by timestamp.
Presets: snapshot every parameter (reference, seed, options) of a good generation for exact reproduction later.

This bundle is what we called the workbench. Each screen is a simple unit on its own; together, the five connect into one continuous workflow.

Sample flow:

Sketch the tone with one or two attempts on the single-generation page
Save the good one as a preset
Use the same preset on batch generation for seven lines at once
Download the ZIP and ship the update

This flowing inside one window is what the workbench is actually for. With CLI or a plain API, the user would carry context by hand between every step.

Wrapping it in Docker — same behavior on any machine

Last decision: the operating environment. The whole thing lives behind a single docker-compose. Model weights mount as host volumes, GPU limit and env vars are declared in the compose file. So in practice the workbench comes up with one line:

docker compose up -d

Reasons Docker was the right fit:

CUDA version dependency. The CUDA runtime PyTorch wants has to align with the host's NVIDIA driver. Pinning that inside the image means the host only needs the right driver.
Python environment isolation. Libraries like qwen_tts, transformers, torch, phonenumbers don't collide with whatever Python the host has.
Restart policy. restart: unless-stopped brings the workbench back automatically after a host reboot.
Portability to a different machine. If the operating machine ever has to change, only the repo and model weights move.

There was one hidden cost in this setup, which Part 4 covers. Briefly — we kept the model library (qwen_tts) source inside our own repo, which violates the usual best practice (declare the library in requirements.txt, install via pip). The reason was operational: patches we applied locally had to apply identically on the remote machine. In the context of this workbench, the non-standard choice was the right one. Details later.

Recap

What we needed wasn't one-shot synthesis — it was an environment that produces in-house announcing voice in a reproducible, deterministic way.
So voice cloning plus seed-based reproduction were the first-order requirements.
The model was evaluated on five criteria (open weights / first-class cloning / Korean / determinism / GPU size). After OpenVoice, Qwen3-TTS was the pick. Two-size variants and an Apache 2.0 license were extras.
The operating machine being a GTX 1650 SUPER 4GB makes memory management (allocator cleanup, BF16/FP16 auto-selection, conservative memory ceiling, OOM recovery routine) a real, separate area to think about.
Of CLI / plain API / full workbench, the workbench won. AI pair coding put it inside the side project's effort budget. FastAPI + static HTML, no build tooling.
Single / batch / tuning / history / preset — five screens that connect into one workflow.
The whole thing comes up with one docker-compose line.

Up to here it was uneventful, and it ran. The first few weeks were "huh, this actually works" levels of feedback.

What's next

Part 2 walks through the workbench's internals and the small Korean-specific tricks needed to get usable output. How phone numbers and date notation get mapped to natural pronunciation, how repeated phrases inside one input get generated once and stitched in, how long inputs get split into safe sentence-sized chunks, why the batch ZIP gets built in the browser, and the one thing people consistently miss when "fixing the seed."

Part 3 is where the actual troubleshooting starts. A one-line symptom — the last syllable getting cut off — that turned into an extended trace, and how the biggest payoff in that chase wasn't fixing anything but laying down a way to see before fixing.

Thanks for reading.

Korean TTS Workbench — (1) Starting With Qwen3-TTS on a 4GB GPU

Korean TTS Workbench — (1) Starting With Qwen3-TTS on a 4GB GPU

An in-house TTS service, on a single 4GB GPU

Smallest first — OpenVoice, then Qwen3-TTS

Voice cloning, not fine-tuning

Then the problem showed up

The five criteria behind the model pick

The real constraint — GTX 1650 SUPER 4GB

Why a workbench, not a CLI or a plain API

Wrapping it in Docker — same behavior on any machine

Recap

What's next

Comments