Author: Changan I Biteye Content Team
Can someone who has never edited a video create an AI-generated short video with a plot, dialogue, and scene transitions?
Yes, and the entire process takes less than half a day.
This article teaches you how to go from: coming up with a story → breaking it down into storyboards → generating video → editing it into a final piece.
No prior experience needed—just follow along once, and you’ll have a complete AI-generated short video.
I. From Idea to Story: AI Videos Are Not Generated by a Single Prompt
Many people start creating AI videos by opening Jimeng, staring at the input box, unsure of what to write. After typing a few words, the generated result is far from what they imagined, leading them to doubt whether the tool isn’t working well or if they simply don’t know how to write effective prompts.
For example, “I want to create a story about a Biteye junior sister reincarnated in the crypto world as a big shot”—this is an idea, not a story.
An idea is a direction—it tells you generally what to do. A story is a structure—it tells you exactly what to show in each scene. Between idea and story lies a necessary process: script planning.
The simplest way is to open any LLM and tell it the vague idea in your mind directly, letting it help you expand the story. You don’t need to figure out all the details yourself—just provide a direction, and the rest can be worked out together with it.
Once the storyline is finalized, don't immediately break it down into shots; instead, divide it into several major sections based on the narrative rhythm, clearly identifying the core event in each section. This step ensures overall pacing control, preventing any section from being too slow or too rushed.
Ji Meng video clips are up to 15 seconds long; in practice, clips under 12 seconds are the most stable and have the lowest probability of visual issues. For a 1-minute final video, assuming an average of 10 seconds per clip, approximately 5 clips are needed.
We’ve divided our story into five sections:
Paragraph one: Begin by establishing the setting and characters.
Paragraph two: The core task is to establish the timeline.
Paragraph three: Show the character's transformation from confusion to clarity.
Calculate your wealth and elevate the emotion to a climax.
Paragraph five: Complete the reversal to form a closed loop with the opening.

After the paragraphs are finalized, break each paragraph down into specific shot descriptions. For each shot, include four elements: subject, location, action, and camera angle. Do not include motion in the shot descriptions—only describe static moments.
Copy the script from Scene One into the AI chat box, enter "Help me generate shot descriptions based on the script from Scene One," and you'll get the following result👇

II. From Story to Visuals: First, Lock in the Characters, Settings, and Storyboards
This chapter is the most critical part of the entire process—the quality of the images you generate here directly determines the upper limit of the final video's quality.
Start with the three-view diagram and lock your main character.
Before generating any storyboards, the first step is to create the front, side, and back views of the main character.
The three-view diagram consists of three images of the same character—front, side, and back—intended to permanently define the character’s appearance, ensuring consistency across all subsequent scenes by referencing these three images.
If you skip this step and generate the storyboard directly, you'll find that the characters look different each time—his hairstyle changes, his face shape changes, and the video becomes impossible to complete.
Open ChatGPT/Seedream and enter in the chat box:
Help me generate a three-view diagram of Biteye's junior sister.
AI will generate an image featuring the same person from three angles. If the generated person differs significantly from your desired result, you can upload a reference image.
Once you're satisfied with the three-view diagram, download it. You'll need to upload it again each time you generate a video as a reference.

Create a scenario reference image and lock your background.
After defining the role, use the same logic to generate a separate reference image of your scene by entering in the chat box: "Help me generate an image of an office."

Before generating storyboards, it's essential to understand a fundamental concept: a shot is the smallest unit of expression in video.
The camera also speaks—different shot sizes convey different information. Common shot sizes include the following:
Wide shot: Provides context, allowing the audience to understand where the scene is taking place and which characters are present.
Medium shot: Used to advance the plot, clearly showing actions and expressions; it is the most commonly used shot type in storytelling.
Close-up: Focus solely on the face, hands, or a key prop to amplify details and deliver a powerful emotional impact to the viewer.
After understanding a single shot, you need to go one step further: a video is not just one shot, but a combination of multiple shots arranged in rhythm.
In actual production, we typically use a "4-grid" or "9-grid" layout to structure the shot composition of a video—arranging four or nine shots to convey a complete message.
The choice between a 4-grid and a 9-grid essentially comes down to controlling the rhythm:
Slow-paced segments: For example, when establishing the setting at the beginning or concluding with emotional closure, a four-panel grid is sufficient—four shots provide ample space for each frame to breathe.
Fast-paced segments: For example, during a fight climax, rapid camera cuts are needed to build tension. Using a 9-grid layout, with nine shots compressed into a single video segment, creates a completely different editing feel.
Once you understand the shot composition and pacing, you can begin the actual production: turning abstract stories into concrete visuals.
Once the character’s three-view diagrams and scene reference images are ready, the next step is to convert each of the previously written storyboard descriptions into visualized frames. The reason is simple: AI performs better with defined individual frames rather than continuous processes, and this also significantly reduces the rate of unwanted outputs.
The specific approach is:
Generate one shot at a time: first upload the character's three-view diagrams and corresponding scene reference images to the ChatGPT conversation, then input the prompt for generating the storyboard.
Help me generate a four-panel storyboard based on the story outline and scene descriptions (including the previous AI-generated scene prompts), along with scene images and character images.
The model will split this shot into four frames based on the storyboard information you provide, ensuring consistency in characters and scenery, as shown below:

💡 Pro Tip: There are several common pitfalls in text-to-image generation—knowing them in advance can save you many attempts:
To generate a shot of a person holding a phone while gaming, the phone screen should automatically turn toward the viewer. AI logic prioritizes "readable content," making gaming an unwanted distraction in the image. The correct approach is: "Hold the phone horizontally with both hands, screen facing the person’s face, and the back of the phone facing the camera."
Professional titles cause AI to associate entire scenes: saying "nurse" makes the AI think of a hospital; saying "chef" makes it think of a kitchen. The correct approach is to describe only the clothing you actually want, without mentioning professional titles.
Text-to-image can only generate still frames; "turning head" has no corresponding visual state. The correct approach is to describe only what exists in this frame.

III. From Image to Video: Write prompts for actions, not for rewriting scenes.
The storyboards are ready; now we’re turning them into an animated video.
🌟 Register and Dream
Open your browser and search for "Ji Meng AI," then visit the official website. Click login in the top-right corner and register using your Douyin account or phone number; access is available directly within China.
New users can generate a free 15-second video. If you need a subscription, Biteye Xiao Shimei has compared Seedance 2.0 prices across multiple platforms—see “The Ultimate Guide to Subscribing to Seedance 2.0 at the Lowest Cost Online” for details.
🌟 How to write video prompts?
This is the most critical part of this step and the most common mistake beginners make.
First, upload all reference images at once—Qi Meng supports uploading multiple reference images simultaneously; simply drag the images directly into the chat box. Drag in all the assets you prepared in the previous chapter—the character’s three-view diagrams, scene reference images, and 4-grid or 9-grid storyboard frames—all at once. Qi Meng will analyze and combine information from all these images to generate the video.
Many beginners make this mistake: they simply redescribe what’s in the image. Dream can already see the image you uploaded, so there’s no need to tell it what’s in the picture.
The prompt should describe what is moving in the scene, how it is moving, whether the camera itself is moving, and what happens during each time interval.
Follow this template, with each line corresponding to a time segment in the video:
Please use the above storyboard as a reference to generate a video.
[Start second to end second], [shot type], [camera movement], [character or subject] + [specific action], sound effect: [sound description].

🌟 Audio descriptions are the most commonly overlooked part by beginners. If there is dialogue in the video, simply writing “voice” is not enough—the model will randomly generate a voice as a reference. To ensure consistent character voices across multiple videos, there are two methods:
1️⃣ Use the audio from the first paragraph as a reference
First, generate the initial video segment. Once you're satisfied with the result, export the audio from this segment separately. For each subsequent segment, upload this audio as a voice reference so that the system will use this voice tone to generate the narration for later parts, ensuring consistent voice quality.
2️⃣ Use Fish Audio to find reference tones
Open Fish Audio, search for voices that match the character’s personality, listen to samples, and download one as a reference audio. Use this same reference audio for every video segment to ensure consistent audio throughout the entire production.
🌟 Use punctuation to control AI voice tone
Writing lines for an AI voice model isn’t just about inputting text—it’s about how punctuation changes the tone and delivery of the same sentence.
The core logic is: punctuation controls pauses, and pauses determine emotion.
…… The ellipsis breaks the sound but maintains the breath, suitable for states of thinking, hesitation, or unfinished speech.
...! Used in combination, it is a sudden explosion after suppression.
The content within the parentheses is automatically lowered in volume and delivered as a whisper, ideal for inner thoughts and self-talk.
Words surrounded by asterisks will appear lower, slower, and heavier to emphasize key information.
[] Write instructions in square brackets, such as [take a deep breath] or [pause for 1 second]; the model will perform the action rather than speak it.
💡 Quick Tips:
AI lacks spatial awareness and often cannot distinguish left from right; therefore, a "reference diagram of positional relationships" is needed to show how the figure is moving, as shown in Figure 1. There is also a simpler method: use arrows to describe the figure’s movement trajectory, and finally add “remove the arrows.”
Write slowly, not quickly. The model handles slow-motion actions much more stably than fast ones. For fast-paced segments, prioritize adjusting the clip speed rather than having the model generate fast motion.
Upload reference images for each video segment, not just once. The model has no memory across segments; without uploading reference images, the character’s appearance may drift.

IV. From Clips to Final Cut: Editing Determines the Final Quality of the Video
Editing and post-production are the crucial final steps that tie everything together. Each clip generated earlier is independent, potentially differing in color tone, lacking rhythmic continuity, and featuring disconnected audio. The role of editing is to unify these fragments into a cohesive story.
Adding music to the video better engages the audience’s emotions, and including subtitles makes the dialogue clearer. The same footage, when edited well versus poorly, can result in a difference of an entire order of magnitude in the final output.
The process consists of four steps: arrange the footage → unify the color tone → add audio → add subtitles, then export.
Step 1: Arrange the assets
Open CapCut and drag all clips onto the timeline in scene order. Ignore color grading and audio for now—just confirm the sequence, review the overall pacing, and trim any overly long clips at this stage.
Step 2: Uniform the color scheme
Segments generated at different times may have slight variations in color temperature and brightness, making them appear disjointed when placed together. Solution: Select all clips, apply a filter as a whole in the "Adjust" panel—use a cool blue tone for Scene One, then switch to a warm yellow tone for Scene Two and beyond; just ensure consistent coloring within each scene.
Step 3: Add background music and sound effects
The dialogue audio has already been processed during video generation; this step primarily adds two types of sound: background music and ambient effects.
Background music sets the overall emotional tone; keep the volume below 30% of the dialogue to ensure it doesn't overpower the vocals.
Step 4: Add subtitles
Use CapCut’s “Smart Subtitles” to automatically recognize dialogue, then review for typos and standardize the font and positioning. For narration or inner monologue, use a different style to distinguish it from regular dialogue, such as italics or a different color.
Five: From Tool to Expression: What AI Video Has Truly Changed
In our previous article, "GPT Image 2.0 Empowers Seedance 2.0: Everyone Can Shoot a Hollywood Blockbuster," we argued that in the AI era, the barrier to creating videos has been lowered, and everyone will soon be able to produce Hollywood-quality blockbusters.
But a low barrier to entry doesn't mean you can succeed.
The tools are all public, and tutorials are everywhere, but most people get stuck at the same point: they’ve never successfully completed a full run-through.
In this article, Biteye has guided you from a vague idea to a fully edited video.
In the past, this process required a full team of specialized roles: screenwriters, storyboard artists, animators, cinematographers, and editors—each step presenting its own barrier.
And now, these steps haven't disappeared—they've simply been compressed into a single process.
This signifies a more fundamental shift: video is no longer a product of productive capacity, but is increasingly becoming a product of expressive capacity.
