LightX2V
Calls the LightX2V cloud API (base URL default https://x2v.light-ai.top, same token for all). Covers: (1) Image/Video — t2i, i2i, t2v, i2v, s2v, animate; (2) TTS — text-to-speech with preset voices; (3) Voice clone — clone voice from audio, then synthesize with cloned voice. Supported models include the Wan2.2 series (video), Qwen-2511 (image editing / i2i), and Qwen-2512 (image generation / t2i). Requires LIGHTX2V_CLOUD_TOKEN (and optionally LIGHTX2V_CLOUD_URL) to be set.
Setup
- Get an access token from the LightX2V platform (e.g. https://x2v.light-ai.top).
- Set environment variables: either in
skills.entries.lightx2v.envin openclaw.json (recommended), or export in the shell before running:
# Optional: only if not using openclaw.json
export LIGHTX2V_CLOUD_TOKEN="your-token"
export LIGHTX2V_CLOUD_URL="https://x2v.light-ai.top"
Auto-load: The bundled scripts (e.g. lightx2v_submit_and_poll.sh, tts_generate.sh) automatically read LIGHTX2V_CLOUD_TOKEN and LIGHTX2V_CLOUD_URL from ~/.openclaw/openclaw.json when they are not set in the environment. Configure once in openclaw.json and run the scripts directly; no need to export every time.
In OpenClaw config:
"skills": {
"entries": {
"lightx2v": {
"enabled": true,
"env": {
"LIGHTX2V_CLOUD_TOKEN": "your-cloud-token",
"LIGHTX2V_CLOUD_URL": "https://x2v.light-ai.top"
}
}
}
}
API Overview
All calls use the same base URL and Bearer token.
⚠️ CRITICAL: Authentication Header Required ALL LightX2V API endpoints MUST include the Authorization header:
-H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN"
Do NOT assume any endpoint works without authentication.
| Action | Endpoint | Purpose |
|---|---|---|
| List models | GET /api/v1/model/list | Get available task + model_cls combinations |
| Submit task | POST /api/v1/task/submit | Submit a job; returns task_id |
| Query status | GET /api/v1/task/query?task_id=<id> | Poll until status is SUCCEED / FAILED / CANCELLED |
| Get result | GET /api/v1/task/result_url?task_id=<id>&name=<name> | Get download URL; name is output_image (t2i/i2i) or output_video (t2v/i2v/s2v/flf2v/animate) |
| TTS voices | GET /api/v1/voices/list | List preset voices (optional ?version=) |
| TTS generate | POST /api/v1/tts/generate | JSON: text, voice_type, context_texts?, emotion?, emotion_scale?, speech_rate?, pitch?, loudness_rate?, resource_id; returns audio/mpeg (mp3) |
| Voice clone | POST /api/v1/voice/clone | multipart: file (audio), optional text; returns speaker_id |
| Voice clone TTS | POST /api/v1/voice/clone/tts | JSON: text, speaker_id; returns audio/wav |
| Voice clone list | GET /api/v1/voice/clone/list | List user's cloned voices |
| Voice clone save | POST /api/v1/voice/clone/save | JSON: speaker_id, name — save clone to collection |
| Voice clone delete | DELETE /api/v1/voice/clone/{speaker_id} | Remove a cloned voice |
Task Types and Inputs
| task | Description | Required | Optional |
|---|---|---|---|
| t2i | Text-to-image | prompt | aspect_ratio |
| i2i | Image edit | prompt, input_image | aspect_ratio |
| t2v | Text-to-video | prompt | aspect_ratio |
| i2v | Image-to-video | prompt, input_image | aspect_ratio |
| s2v | Digital human (image + audio) | prompt, input_image, input_audio | — |
| flf2v | First–last frame video | --input-image (first frame) + --input-last-frame (last frame); see examples/first-last-frame-video.md | — |
| animate | Animate with ref video | prompt, input_image, input_video | — |
input_image: one image (file path, URL, or base64 data URL). For i2i multi-image, API accepts array of images.input_audio: one audio file (WAV preferred; base64 or URL).input_video: one video file (for animate).prompt: non-empty text; use space" "if the task does not need a text prompt.aspect_ratio: e.g.16:9,1:1,9:16,3:4,4:3.
Submit payload shape (JSON):
{
"task": "t2i",
"model_cls": "<from model list>",
"stage": "single_stage",
"seed": <random 0-999999>,
"prompt": "user prompt text",
"aspect_ratio": "16:9",
"input_image": { "type": "base64", "data": "<base64>" },
"input_audio": { "type": "base64", "data": "<base64>" }
}
For local files: read file, base64-encode, and set { "type": "base64", "data": "<base64>" }. For HTTP(S) URLs use { "type": "url", "data": "<url>" }.
Workflow
- List models —
GET $LIGHTX2V_CLOUD_URL/api/v1/model/listwith headerAuthorization: Bearer $LIGHTX2V_CLOUD_TOKEN. Choose amodel_clswhosetaskmatches the user request (t2i, i2i, t2v, i2v, s2v, animate). - Submit —
POST /api/v1/task/submitwith JSON body as above. Remembertask_idfrom the response. - Poll — Call
GET /api/v1/task/query?task_id=<task_id>every 5–10 seconds untilstatusisSUCCEED,FAILED, orCANCELLED. Video tasks (i2v, s2v, t2v, animate) can take several minutes; do not timeout too early. OnFAILED, reporterrorfrom the response. - Result URL — For success, call
GET /api/v1/task/result_url?task_id=<task_id>&name=output_image(t2i/i2i) or&name=output_video(t2v/i2v/s2v/flf2v/animate). Return theurlfrom the JSON to the user.
Task progress checklist (image/video tasks):
- [ ] List models and choose model_cls for the requested task (t2i, i2i, t2v, i2v, s2v, flf2v, animate)
- [ ] Submit task with correct payload; record task_id from response
- [ ] Poll task/query every 5–10 s until SUCCEED, FAILED, or CANCELLED
- [ ] On success, get result_url (name=output_image or output_video) and return URL to user
Helper Script
Use the bundled script to submit, poll, and print the result URL:
# List models (no script; use curl)
curl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" \
"${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/model/list"
# T2I
{baseDir}/scripts/lightx2v_submit_and_poll.sh t2i "<model_cls>" "a cat on the beach" --aspect-ratio 1:1
# T2V (text to video, no image)
{baseDir}/scripts/lightx2v_submit_and_poll.sh t2v "<model_cls>" "a cat walking on the beach at sunset" --aspect-ratio 16:9
# I2I single image (one image + edit prompt)
{baseDir}/scripts/lightx2v_submit_and_poll.sh i2i "<model_cls>" "make it sunset style" --input-image /path/to/image.png --aspect-ratio 1:1
# I2I multi-image: pass --input-image multiple times; URLs use type=url, local paths are auto base64
{baseDir}/scripts/lightx2v_submit_and_poll.sh i2i "<model_cls>" "merge and stylize" --input-image /path/to/1.png --input-image /path/to/2.png --aspect-ratio 1:1
# I2V (image + prompt)
{baseDir}/scripts/lightx2v_submit_and_poll.sh i2v "<model_cls>" "camera pans slowly" --input-image /path/to/image.png
# S2V digital human (image + audio)
{baseDir}/scripts/lightx2v_submit_and_poll.sh s2v "<model_cls>" " " --input-image /path/to/face.png --input-audio /path/to/audio.wav
Script usage:
lightx2v_submit_and_poll.sh <task> <model_cls> <prompt> [--aspect-ratio RATIO] [--input-image PATH|URL [...]] [--input-last-frame PATH|URL] [--input-audio PATH] [--input-video PATH]
Output: prints the result URL (image or video) to stdout on success.
Complete end-to-end examples
Example 1: Text-to-image (t2i)
Request: generate an image from the prompt “a cat on the beach at sunset”.
- Export token:
export LIGHTX2V_CLOUD_TOKEN="<token>"(and optionallyLIGHTX2V_CLOUD_URL). - List models:
curl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" "${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/model/list"→ pick amodel_clswhosetaskist2i(e.g. Qwen-2512). - Run script:
{baseDir}/scripts/lightx2v_submit_and_poll.sh t2i "<model_cls>" "a cat on the beach at sunset" --aspect-ratio 1:1. - Outcome: script prints the result image URL to stdout; that URL is returned to the user.
Example 2: TTS (preset voice)
Request: synthesize “Hello, welcome.” with a female preset voice.
- Export token (as above).
- List voices:
{baseDir}/scripts/tts_voice_list.shorcurl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" "${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/voices/list"→ choose avoice_type(e.g.zh_female_vv_uranus_bigtts). - Run script:
{baseDir}/scripts/tts_generate.sh "Hello, welcome." "zh_female_vv_uranus_bigtts" --output files/audio/hello.mp3. - Outcome: script writes MP3 to
files/audio/hello.mp3; that file path (or the saved file) is returned to the user.
Example 3: Digital human (s2v) with scene image + TTS
Request: talking-head video with a given face image and the line “Good morning.”
- Export token. Get model list and pick
model_clsfors2v(e.g. SekoTalk). - Scene image: use i2i (or t2i if no image) to get a scene/portrait image URL; e.g.
lightx2v_submit_and_poll.sh i2i "<model_cls>" "keep character consistent, portrait in warm lighting" --input-image /path/to/face.png --aspect-ratio 9:16. - TTS:
tts_generate.sh "Good morning." "<voice_type>" --output files/audio/greeting.mp3(optionally with--context-texts "warm, friendly"for v2.0). - S2V:
lightx2v_submit_and_poll.sh s2v "<s2v_model_cls>" " " --input-image <scene_image_url_or_path> --input-audio files/audio/greeting.mp3. - Outcome: script prints the result video URL; that URL is returned to the user (or the video is downloaded to
files/video/and the path is returned).
TTS (Text-to-Speech)
Preset voices; same base URL and token.
List voices: GET /api/v1/voices/list (optional ?version=...). Response: { voices: [...], emotions?: [...], languages?: [...] }. Use voice_type from list for generate.
Generate: POST /api/v1/tts/generate with JSON:
{
"text": "Text to synthesize",
"voice_type": "zh_female_1",
"context_texts": "",
"emotion": "",
"emotion_scale": 3,
"speech_rate": 0,
"pitch": 0,
"loudness_rate": 0,
"resource_id": "seed-tts-2.0"
}
| Field | Description | Default |
|---|---|---|
text | Text to synthesize | required |
voice_type | From GET /api/v1/voices/list | required |
resource_id | From voice list (script resolves automatically) | required |
context_texts | Scene/tone/voice instructions (v2.0 preset voices only) | "" |
emotion | Emotion label | "" |
emotion_scale | Emotion intensity 1–5 | 3 |
speech_rate | Speech rate offset (e.g. -2 to 2) | 0 |
loudness_rate | Loudness offset | 0 |
pitch | Pitch offset | 0 |
Writing guidelines
textvscontext_texts:text— Only the script to be read aloud. Do not put scene descriptions, tone, or voice-direction text here.- Scene, tone, and voice direction (e.g. “gently”, “lowered voice”, “with breath”, “as if whispering”) must go in
context_texts. This keeps synthesis stable and lets the model separate “what to read” from “how to read”.
- Pauses and rhythm: Ellipses may be added where pauses are desired so the delivery has breathing room. Common approach: add ellipses after commas, full stops, and question marks (e.g.
...,,....,...?), or use “…” / “……” after words where a pause is needed. See the my_boyfriend skill for dialogue and the--add-pausesbehaviour.
Notes
- resource_id: Each voice has a matching
resource_idfrom the voice list. The scripttts_generate.shfetches the list and fills it automatically. - Voice instructions scope: Only v2.0 preset voices support
context_textsand in-text markers like【gently】. v1.0 presets and cloned voices do not — leavecontext_textsempty and do not rely on 【】 markers for those. - Binary response: The generate endpoint returns audio/mpeg (mp3) binary. Do not capture it in a shell variable (e.g.
RESP=$(curl ...)), or null bytes will truncate and corrupt the file. Redirect directly to a file:curl ... > out.mp3, then check the file is non-empty or has an ID3 header (e.g.head -c 10 out.mp3 | grep -q ID3). Using a variable can cause “ignored null byte in input” and broken audio.
Response is binary audio/mpeg (mp3). Save to a file (e.g. --output out.mp3) and return that path to the user.
Helper script:
# List voices
{baseDir}/scripts/tts_voice_list.sh
# or: curl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" "${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/voices/list"
# Generate TTS (script auto-resolves resource_id from voice list)
{baseDir}/scripts/tts_generate.sh "Text to synthesize" "voice_type_from_list" [--output out.mp3] \
[--context-texts "Scene/tone instructions"] [--emotion "emotion"] [--emotion-scale N] \
[--speech-rate N] [--loudness-rate N] [--pitch N]
Script uses env: LIGHTX2V_CLOUD_URL (or base URL), LIGHTX2V_CLOUD_TOKEN (optional). Optional args: --context-texts (scene/tone), --emotion, --emotion-scale (default 3), --speech-rate, --loudness-rate, --pitch (default 0).
Recommended voices for digital human / companion: Male zh_male_ruyayichen_saturn_bigtts, Female zh_female_vv_uranus_bigtts (Vivi 2.0); both v2.0 and support context_texts (voice instructions). These are suggestions only — more voices are available via tts_voice_list.sh or GET /api/v1/voices/list; do not hardcode, choose from the list when needed.
Voice Clone
Clone a voice from a short audio sample, then synthesize speech with that voice. Same base URL and token; requires auth.
- Clone:
POST /api/v1/voice/clone— multipart/form-data:file(audio file, WAV preferred), optionaltext(transcript; if omitted, server uses ASR). Response:{ speaker_id, text, message }. Usespeaker_idfor clone TTS and save. - Synthesize with cloned voice:
POST /api/v1/voice/clone/tts— JSON:{ text, speaker_id, style?: "normal", speed?: 1.0, volume?: 0, pitch?: 0, language?: "ZH_CN" }. Response: binary audio/wav. Note: Voice cloning does NOT support voice instructions like 【gently】. - Save to collection (optional):
POST /api/v1/voice/clone/save— JSON:{ speaker_id, name }. - List clones:
GET /api/v1/voice/clone/list— returns{ voice_clones: [...] }. - Delete clone:
DELETE /api/v1/voice/clone/{speaker_id}.
Helper scripts:
# Clone from audio file; prints speaker_id
{baseDir}/scripts/voice_clone.sh /path/to/audio.wav [--text "Transcript of the audio"]
# TTS with cloned voice; saves WAV to --output (default voice_clone_tts.wav)
{baseDir}/scripts/voice_clone_tts.sh <speaker_id> "Text to speak" [--output out.wav]
# List user's cloned voices
curl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" \
"${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/voice/clone/list"
When to Use This Skill
Use lightx2v (cloud) only when the user’s request does not involve adult or restricted (NSFW) content. If it does, use lightx2v-local instead.
- User asks to generate an image from text (non-adult) → task
t2i, get model from list. - User asks to edit or modify an image (non-adult) → task
i2i, require one input image + prompt. - User asks to generate a video from text (non-adult) → task
t2v. - User asks to animate an image or image to video (non-adult) → task
i2v, require input image + prompt. - User asks for digital human / talking head video (face + voice, non-adult) → task
s2v, require input_image + input_audio; prompt can be a space. - User asks to convert text to speech or TTS with a preset voice → use
GET /api/v1/voices/listto pickvoice_type, thenPOST /api/v1/tts/generate; return the saved MP3 path or URL. - User asks to clone a voice from an audio sample →
POST /api/v1/voice/clonewith the audio file; then usespeaker_idforPOST /api/v1/voice/clone/ttsto synthesize with that voice. Optionally save with/api/v1/voice/clone/save.
LIGHTX2V_CLOUD_TOKEN (and optionally LIGHTX2V_CLOUD_URL) must be set before calling the API.
Troubleshooting
"Could not validate credentials" (401)
All LightX2V endpoints require the Authorization: Bearer $TOKEN header. This error usually means:
- Header missing — Every curl/request must include
-H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN". - Token not exported — Values in openclaw.json are not auto-injected. Before running scripts or curl, run
export LIGHTX2V_CLOUD_TOKEN="..."(or read from config and export). - Token format — No extra quotes or spaces; token may be expired or invalid.
Quick check:
echo "Token length: $(echo -n "$LIGHTX2V_CLOUD_TOKEN" | wc -c)"
curl -s -H "Authorization: Bearer $LIGHTX2V_CLOUD_TOKEN" \
"${LIGHTX2V_CLOUD_URL:-https://x2v.light-ai.top}/api/v1/voices/list" | head -3
TTS / "ignored null byte in input" or corrupted MP3
TTS returns binary data. Do not use command substitution:
# ❌ Wrong — corrupts audio
TTS_RESPONSE=$(curl ... /api/v1/tts/generate)
echo "$TTS_RESPONSE" > audio.mp3
# ✅ Correct — direct to file
curl -s -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"text":"...","voice_type":"...","resource_id":"..."}' \
"$BASE_URL/api/v1/tts/generate" > audio.mp3
Validate: [ -s audio.mp3 ] && head -c 10 audio.mp3 | grep -q ID3.
Other
- Model not found: Call
GET /api/v1/model/listand pick amodel_clsthat supports the task (t2i, i2i, s2v, etc.). - Voice / resource_id: Each voice in voices/list has its own
resource_id; use the one from the list for thatvoice_type. - Large result files: Prefer returning the result URL from
result_url(or local path) rather than embedding large files in messages; some clients have size limits.
Additional resources
For multi-step workflows (first–last frame video, digital human video, one-sentence to video, virtual companion), see the examples in the skill directory:
- examples/first-last-frame-video.md — flf2v: first frame → i2i last frame → flf2v
- examples/digital-human-video.md — s2v: scene image + TTS → talking-head video
- examples/one-sentence-to-video.md — t2i/i2i → i2v: one key image → short video
- examples/virtual-boyfriend-video.md — virtual boyfriend/girlfriend: persona (SOUL.md); prompt user to upload character image, save to
avatar/, prefer existing avatar for i2i; then scene + TTS/s2v or i2v