From 20-year-old ink drawings to an AI short film: training custom LoRAs for Z-Image and LTX-2.3

About 90 artists recently made short films using exclusively open source models for the Arca Gidan Prize. The range of styles, textures, and approaches that came back is hard to argue with. This post covers our entry โ€” INNOCENCE โ€” and the full technical pipeline behind it.

๐Ÿ† What is Arca Gidan

Arca Gidan is an open source AI film contest with a clear mission: "We want to help individuals discover what they are capable of, while showing the world the power and beauty of open source AI." This edition ran in partnership with Banodoco, ComfyUI, and Lightricks, with the theme of Time.

Around 90 submissions came in. Every one of them made with open source models only. The full selection is free to watch on the Arca Gidan website โ€” most entries also share their ComfyUI workflows directly. Voting is open until April 6th.

๐ŸŽฌ The film

INNOCENCE is a 2D short film, approximately 2 minutes. A dark Canadian rock face in winter. An icicle forms over days, reaches its full height, and is immediately snapped off and eaten by a child who wanders into frame.

The visual language is Chinese ink โ€” monochromatic, with the icicle rendered in cold blue and a child's red hat as the only warm note. The medium made sense both aesthetically and technically: ink is the medium of water, and the subject is ice. AI makes that medium move, which is something that isn't possible with traditional static ink work.

๐Ÿ—‚๏ธ The dataset

The starting point was 73 hand-drawn Chinese ink drawings โ€” existing work done about 20 years ago. No videos, no generated images. Both LoRAs (Z-Image and LTX) were trained on the same static image dataset.

Captions were generated with Qwen3-VL. The captioning strategy follows the concept-bleed principle: don't caption the core style elements you want baked into the trigger word. Brushstrokes, ink wash, monochromatic treatment, negative space โ€” none of that was captioned. It gets baked silently into the trigger word Suiboku. What was captioned: subject, composition, spatial relationships.

Captions were bilingual (English + Chinese). Z-Image was trained on multilingual data and the visual concepts for this style have stronger associations in Chinese text than in English.

Caption format matters for Z-Image specifically. Double-quoted trigger words activate differently. Natural sentence position worked best:

"In the style of Suiboku, a kneeling figure, head bowed, holds a long object in the right hand, left arm resting on the knee, body slightly twisted, soft shading on the right side, simple white background. ไปฅๅขจ็”ป้ฃŽๆ ผ๏ผŒไธ€ไธช่ทชๅงฟ็š„ไบบ็‰ฉ๏ผŒๅคดไฝŽไธ‹๏ผŒๅณๆ‰‹ๆŒ้•ฟ็‰ฉ๏ผŒๅทฆ่‡‚ๆญๅœจ่†็›–ไธŠ๏ผŒ่บซไฝ“็•ฅๅพฎๆ‰ญ่ฝฌ๏ผŒๅณไพงๆœ‰ๆŸ”ๅ’Œ้˜ดๅฝฑ๏ผŒ็ฎ€ๅ•็™ฝ่‰ฒ่ƒŒๆ™ฏใ€‚"

The captioning script is available to download on the Arca Gidan submission page.

๐Ÿ–ผ๏ธ Z-Image LoRA: training and results

Tool: Musubi-tuner (main branch), trained on RunPod H100 SXM.

Settings: rank 32, alpha 16, optimi.AdamW optimizer, logsnr timestep sampling, learning rate 1e-4, batch size 4.

  • 73 images at batch size 4 = ~19 steps/epoch
  • Trained for 200 epochs (~3,800 steps total)
  • Used the 80-epoch checkpoint (~1,520 steps)

The later checkpoints didn't hold up. By epoch 140+, generating a generic subject without the trigger word still produced ink-wash aesthetics โ€” a clear overtraining signal. The loss curve itself didn't look clean (persistent spikes, didn't plateau as low as expected), but inference results told a different story: the 80-epoch checkpoint produced solid, consistent results across varied prompts.

Playing with LoRA weight at inference was useful for calibrating the effect. Final generations all used weight 1.0.

The full Z-Image LoRA training documentation is available to download on the Arca Gidan submission page.

๐ŸŽฅ LTX-2.3 LoRA: training and results

Tool: Musubi-tuner (ltx-2-dev branch on AkaneTendo25's fork), same RunPod H100 SXM pod, same 73-image dataset.

Settings: rank 64, alpha 64, AdamW8bit, shifted_logit_normal timestep sampling with shifted_logit_uniform_prob 0.30, learning rate 6e-5, gradient accumulation 4, FP8 quantized base checkpoint.

The gradient accumulation changes the optimization step math:

  • 73 forward passes per epoch รท 4 accumulation steps = ~18 optimization steps/epoch
  • Trained for 140 epochs (~2,520 optimization steps)
  • Used the 80-epoch checkpoint (~1,440 optimization steps)

The same overtraining pattern appeared at later checkpoints. The 80-epoch checkpoint generalized well and responded cleanly to the trigger word without leaking into unrelated prompts.

shifted_logit_uniform_prob 0.30 forces 30% of training steps to focus on low-noise timesteps where fine textural detail lives โ€” brushstroke character, wash texture, ink edge quality. For a style this detail-dependent, this made a measurable difference vs the default 10%.

The full LTX-2.3 LoRA training documentation is available to download on the Arca Gidan submission page.

๐ŸŽจ Keyframe generation: Z-Image + QwenImageEdit

Each shot in the film was built around a Z-Image keyframe. The style LoRA handled the ink aesthetic; QwenImageEdit 2511 handled iterative art direction โ€” reframing compositions, adjusting spatial relationships, and putting the child into the icicle's environment.

The icicle's blue tint and the child's red hat were the only colour departures from the monochromatic ink world; both needed to be held consistently across multiple generations.

โฑ๏ธ Image-to-video and ink-wash transitions: LTX-2.3

Every shot went through LTX-2.3 I2V, running at half resolution (960x544) before upscaling. Two generation passes per shot:

1. The shot itself โ€” subtle animation of a static composition. Example prompt:

"In the style of Suiboku with hand-drawn ink brush strokes and painterly textures, a static cinematic close-up captures the child's face tilted upward in a state of absolute wonder. Small, subtle movements bring the artwork to life. The child's eyes, rendered in fine black ink, remain wide and fixed. The scene is defined by thin drying ink washes and a total suspension of motion, capturing a held breath moment in the hand-drawn environment."

2. The transition โ€” ink-wash brush reveals between shots. Example prompt:

"In the style of Suiboku with visible ink brush strokes and painterly reveal effects, the scene transitions from the child's face to an intimate shot of their reaching hand. A series of watery, grey ink brush reveals sweep across the frame, washing away the facial details to uncover the child's arm in a dark winter coat rising toward the icicle. The reveal follows the small hand in a thick, grey-patterned winter mitten as it moves very close to the crystalline blue shaft, hovering just around the icicle. This transition uses layered brushwork and drying paint washes to shift the focus."

The transitions required iteration. Common failure modes: an actual paintbrush appearing in frame when the prompt called for a paint effect, or a straight dissolve that read as fade rather than ink wash. Prompt refinement and seed variation eventually produced workable results for each transition, but it wasn't a reliable one-pass process.

The I2V + transition workflow is available to download on the Arca Gidan submission page.

๐Ÿ“ˆ Upscaling with SeedVR2.5

All shots were upscaled to 1080p HD using SeedVR2 v2.5 with the 7b model. Downscaling the input beforehand allowed to get rid of most of the noise seen in the raw i2v generation.

The Z-Image title card workflow (stylized credits using Z-Image image-to-image) is also available on the Arca Gidan submission page.

For a complete breakdown of SeedVR2 v2.5, see the dedicated post here.

๐ŸŽž๏ธ Editing

Final assembly in Kdenlive. Shot ordering, sound design layering and credit cards.

๐Ÿ” What we'd do differently

  • Consumer GPU optimization: the entire pipeline was built under time pressure for this contest, which meant running everything on an H100 to prioritize output speed over efficiency. Given more time it would be good to improve the workflows to achieve similar output without relying on expensive GPUs.
  • Animation control: better tools for directing motion โ€” currently everything from subtle animation to transition dynamics was driven by text prompting alone, which works but is slow to iterate.

๐Ÿ“ฅ Downloads

All assets are freely available on the Arca Gidan submission page:

  • Dataset image captioning script (Qwen3-VL) โ€” caption generation for LoRA training
  • Z-Image LoRA training guide โ€” full Musubi-tuner process on RunPod
  • LTX-2.3 LoRA training guide โ€” full Musubi-tuner process on RunPod
  • LTX-2.3 I2V + SeedVR2.5 upscale workflow โ€” ComfyUI workflow for shots, transitions, and upscaling
  • Z-Image title card workflow โ€” ComfyUI workflow for stylized credits

๐Ÿ‘€ Go watch the other submissions

Our entry is one of about 90. The range across the full selection โ€” styles, subjects, technical approaches โ€” makes a stronger argument for open source AI than any single project could. Watch a few, leave a score. Voting is open until April 6th.

arcagidan.com/submissions

๐Ÿ”— Sources & Links

๐ŸŽฌ Arca Gidan:

๐Ÿ”ง Training:

๐Ÿ› ๏ธ ComfyUI Nodes & Tools:

๐Ÿ“„ Related Posts:

Join the conversation

Have thoughts on this article? We'd love to hear from you!

Let's work together

Inquire now