Skip to content

SiliconFlow and Gemini API Research

Date: 2026-05-24

This note records the current API surface and local connectivity tests for adding SiliconFlow and Gemini text/TTS providers to VideoCaptioner. API keys used during testing are intentionally omitted.

Summary

ProviderText model testedTTS model testedResult
SiliconFlowdeepseek-ai/DeepSeek-V4-FlashFunAudioLLM/CosyVoice2-0.5BText, TTS, and reference-audio voice cloning all succeeded
Gemini APIgemini-3.5-flashgemini-3.1-flash-tts-previewText and single-speaker TTS succeeded

Generated local test files:

FileFormatDurationSize
work-dir/api-research/siliconflow_cosyvoice2_alex.mp3MP3, mono, 32 kHz4.932 s80,191 bytes
work-dir/api-research/siliconflow_cosyvoice2_cloned_uri.mp3MP3, mono, 32 kHz4.320 s70,399 bytes
work-dir/api-research/gemini_3_1_flash_tts_kore.wavWAV PCM, mono, 24 kHz5.760 s276,524 bytes

SiliconFlow

Base API

Use the OpenAI-compatible API base:

text
https://api.siliconflow.cn/v1

The public docs also show:

text
https://api.siliconflow.com/v1

The .cn endpoint was used successfully in local tests.

DeepSeek-V4-Flash

Model ID:

text
deepseek-ai/DeepSeek-V4-Flash

Endpoint:

http
POST /v1/chat/completions

Key documented capabilities:

CapabilityStatus
Context window1049K tokens in the SiliconFlow model page
Max tokens393K in the SiliconFlow model page
JSON modeSupported
Function/tool callingSupported
Image inputNot supported
Embeddings/rerank/fine-tuningNot supported for this model
ServerlessSupported

Common request parameters supported by SiliconFlow chat completions:

ParameterNotes
modelRequired
messagesRequired, OpenAI-style chat messages
streamSSE streaming
max_tokensOutput token cap
temperatureSampling randomness
top_p, top_k, min_pSampling controls; min_p is model-limited
frequency_penaltyRepetition control
stopUp to 4 stop sequences
response_formatJSON mode object
toolsFunction calling
enable_thinking, thinking_budgetDocumented for selected thinking models; the model page says V4-Flash has switchable reasoning modes, but the chat API reference list does not currently include V4-Flash under enable_thinking. Treat this as needing runtime validation before exposing in UI.

Local connectivity test:

json
{
  "model": "deepseek-ai/DeepSeek-V4-Flash",
  "ok": true,
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 17,
    "total_tokens": 34
  }
}

CosyVoice2 TTS

Model ID:

text
FunAudioLLM/CosyVoice2-0.5B

Endpoint:

http
POST /v1/audio/speech

Request parameters:

ParameterTypeNotes
modelstringRequired
inputstringRequired, 1-128000 chars in API reference
voicestringRequired in API reference; can be system voice or speech:... cloned voice URI
response_formatenummp3, opus, wav, pcm
sample_ratenumbermp3: 32000/44100; wav/pcm: 8000/16000/24000/32000/44100; opus: 48000
streambooleanDefault true in docs
speedfloat0.25-4.0
gainfloat-10 to 10 dB

System voices, using FunAudioLLM/CosyVoice2-0.5B:<voice>:

VoiceDescription
alexCalm male
benjaminDeep male
charlesMagnetic male
davidCheerful male
annaCalm female
bellaPassionate female
claireGentle female
dianaCheerful female

CosyVoice2-specific features from SiliconFlow docs:

FeatureNotes
Cross-lingual synthesisChinese, English, Japanese, Korean, and Chinese dialects including Cantonese, Sichuanese, Shanghainese, Zhengzhou dialect, Changsha dialect, and Tianjin dialect
Emotion controlHappy, excited, sad, angry, etc.
Fine-grained prosody/emotion controlVia rich text or natural-language prompt
Prompt separatorExamples use instruction text plus `<
Reference audioMust be under 30 seconds; recommended 8-10 seconds
Reference qualitySingle speaker, clear articulation, stable volume/pitch/emotion, low noise/reverb
Reference formatsmp3, wav, pcm, opus; recommended MP3 >= 192 kbps

Example input style:

text
你能用高兴的情感说吗?<|endofprompt|>今天真是太开心了,马上要放假了!

Local TTS test:

json
{
  "model": "FunAudioLLM/CosyVoice2-0.5B",
  "voice": "FunAudioLLM/CosyVoice2-0.5B:alex",
  "ok": true,
  "content_type": "audio/mpeg"
}

SiliconFlow Voice Cloning

SiliconFlow supports two clone/reference flows.

  1. Upload reference audio and reuse returned URI:
http
POST /v1/uploads/audio/voice

Parameters:

ParameterNotes
modelFunAudioLLM/CosyVoice2-0.5B
customNameUser-defined voice name
textExact transcript corresponding to the reference audio
fileMultipart file upload
audioAlternative JSON/base64 field in data:audio/mpeg;base64,... form

Response:

json
{
  "uri": "speech:your-voice-name:xxx:xxx"
}

Then pass the returned uri as voice to /audio/speech.

  1. Dynamic reference audio in one TTS call:

The SiliconFlow guide shows OpenAI SDK usage with extra_body.references, where each reference has audio and text. This is useful when the app should avoid storing a cloned voice URI.

Local clone-chain test:

json
{
  "upload_reference": true,
  "tts_with_speech_uri": true
}

VideoCaptioner already has a partial implementation in videocaptioner/core/tts/siliconflow.py:

Existing behaviorStatus
/audio/speech binary outputImplemented
System voice selectionImplemented through segment.voice / config.voice
Upload reference audioImplemented through VoiceCloneManager.upload_voice
Cache uploaded URIImplemented
Dynamic references in a single TTS callNot implemented
Exposing preset voices in UI/configNeeds integration work

Gemini API

Latest text model

Official Google material says Gemini 3.5 Flash is available through the Gemini API. The DeepMind model page lists it as Preview and describes:

CapabilityGemini 3.5 Flash
InputText, image, video, audio, PDF
OutputText
Input tokens1M
Output tokens64K
Tool useFunction calling, structured output, Search as a tool, code execution
Best forEveryday tasks, agentic coding, advanced reasoning, multimodal understanding, long-context understanding

Model ID used successfully:

text
gemini-3.5-flash

Endpoint:

http
POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent

Local text test succeeded. The response included thoughtsTokenCount, so integrations should account for reasoning tokens in usage/cost reporting.

Latest Gemini TTS model

The current Gemini TTS docs list Gemini 3.1 Flash TTS Preview as the newest TTS model, with this model ID:

text
gemini-3.1-flash-tts-preview

Endpoint:

http
POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent

Request structure:

json
{
  "contents": [
    {
      "parts": [
        {
          "text": "Say cheerfully: Have a wonderful day!"
        }
      ]
    }
  ],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "voiceConfig": {
        "prebuiltVoiceConfig": {
          "voiceName": "Kore"
        }
      }
    }
  }
}

The REST response returns base64 PCM audio at 24 kHz mono. The app needs to wrap it as WAV or convert it to the configured output format.

Supported Gemini TTS models:

ModelSingle speakerMulti-speaker
gemini-3.1-flash-tts-previewYesYes
gemini-2.5-flash-preview-ttsYesYes
gemini-2.5-pro-preview-ttsYesYes

Gemini TTS voice options:

VoiceStyle
ZephyrBright
PuckUpbeat
CharonInformative
KoreFirm
FenrirExcitable
LedaYouthful
OrusFirm
AoedeBreezy
CallirrhoeEasy-going
AutonoeBright
EnceladusBreathy
IapetusClear
UmbrielEasy-going
AlgiebaSmooth
DespinaSmooth
ErinomeClear
AlgenibGravelly
RasalgethiInformative
LaomedeiaUpbeat
AchernarSoft
AlnilamFirm
SchedarEven
GacruxMature
PulcherrimaForward
AchirdFriendly
ZubenelgenubiCasual
VindemiatrixGentle
SadachbiaLively
SadaltagerKnowledgeable
SulafatWarm

Gemini TTS style control:

ControlNotes
Natural language promptCan guide style, accent, pace, and tone
Inline audio tagsExamples include [excited], [whispers], [shouting], [laughs], [sighs], [tired], [sarcastic]
Advanced promptRecommended sections: audio profile, scene, director's notes, transcript
Multi-speakerUp to 2 speakers, each mapped to a prebuilt voice
LanguagesAuto-detects input language; docs include Mandarin Chinese and many other languages

Gemini TTS limitations:

LimitationImpact
Text input only, audio output onlyNo reference audio input for TTS
32K-token TTS context windowLong transcripts must be chunked
No streamingUI should show task progress, not stream playback
Longer output driftSplit transcripts into smaller chunks
Occasional audio failure / text tokensAdd retry logic
Prompt classifier false rejectsUse clear preamble and label the transcript

Voice cloning status:

Gemini TTS does not expose a SiliconFlow-style upload/reference-audio voice cloning API in the current Gemini TTS docs. It supports expressive controllability and fixed prebuilt voices, but not custom voice cloning through this API.

Local TTS test:

json
{
  "model": "gemini-3.1-flash-tts-preview",
  "voice": "Kore",
  "ok": true,
  "output": "24 kHz mono PCM wrapped as WAV"
}

Integration Notes

SiliconFlow in VideoCaptioner

SiliconFlow text models can already fit the existing OpenAI-compatible LLM client by setting:

bash
OPENAI_BASE_URL=https://api.siliconflow.cn/v1
OPENAI_API_KEY=<key>

For GUI/config integration, prefer adding a SiliconFlow preset:

FieldValue
API basehttps://api.siliconflow.cn/v1
Text modeldeepseek-ai/DeepSeek-V4-Flash
TTS modelFunAudioLLM/CosyVoice2-0.5B
Default voiceFunAudioLLM/CosyVoice2-0.5B:alex or user-selected preset

The existing SiliconFlowTTS implementation should be kept, with follow-up work to expose:

UI/config itemWhy
Preset voice dropdownThe model requires/benefits from explicit voice
Emotion/style prompt fieldCosyVoice2 uses natural language + `<
Reference audio file + transcriptRequired for upload-based voice clone
Dynamic reference modeUseful for one-off clone without saving URI
Speed/gain/sample rate controlsAlready supported by API and TTSConfig

Gemini in VideoCaptioner

Gemini is not directly compatible with the current OpenAI client path used by videocaptioner/core/llm/client.py. It needs either:

  1. a Gemini-native LLM client using generateContent, or
  2. a provider adapter that maps VideoCaptioner messages/config into Gemini REST calls.

Gemini TTS needs a new TTS implementation because it returns base64 PCM inside JSON, not raw audio bytes from an OpenAI-compatible /audio/speech endpoint.

Recommended Gemini defaults:

Use caseModel
Text / subtitle optimization / translationgemini-3.5-flash
TTSgemini-3.1-flash-tts-preview
Default TTS voiceKore for firm/neutral, Puck for upbeat, Achird for friendly, Sulafat for warm

Implementation considerations:

AreaRequirement
Audio writingDecode base64 PCM and wrap as WAV at 24 kHz, 16-bit, mono
Output conversionUse ffmpeg/pydub if MP3/other formats are required
RetryRetry transient 500s and occasional failed audio generations
ChunkingSplit long TTS text to avoid drift after a few minutes
Multi-speakerAdd only if subtitle dubbing needs two-speaker dialogue; max 2 speakers
Voice cloneNot supported by Gemini TTS; use SiliconFlow CosyVoice2 for clone workflows

Sources

基于 MIT 许可发布