Xiaohongshu AI Skill Bypasses AI Labeling Rules Using HTML Rendering

In February 2026, Xiaohongshu announced that AI-generated synthetic content must be actively labeled, and unlabeled content would be restricted from distribution. More than three months later, an open-source project named guizang-social-card-skill appeared on GitHub, specifically designed to generate Xiaohongshu 3:4 image-text posts and WeChat Official Account covers. Its technical approach featured an unusual choice: it does not use any AI models to generate image pixels; instead, the entire layout is rendered using HTML+CSS, with images sourced from real-photo libraries like Unsplash. The output is not an “AI-generated image,” but a rasterized screenshot of a webpage rendered by a browser engine.

This choice corresponds to a specific approach. Since 2026, Xiaohongshu has deployed an audio-visual recognition model that identifies AIGC content by analyzing pixel distribution patterns in images and audio characteristics. During the same period, over 800,000 AI-managed accounts and nearly 150,000 AI-generated posts have been removed. For content creators who need to produce图文 content frequently, images generated by Midjourney or Canva AI are increasingly likely to be detected and flagged. Master Cang chose a different path: letting AI handle layout decisions while leaving the final pixels to rendering engines and real-photo libraries.

This is a deliberate technical workaround. However, how far this solution can go depends on the platform's flexibility in defining the term "AI-generated synthetic content."

28 layout skeletons; AI is responsible for layout logic, not illustration.

Master Zang's real name is Guizang, who previously released guizang-ppt-skill, another AI tool designed for graphic layout scenarios. This new social-card-skill is more focused: it targets Xiaohongshu 3:4 graphics, WeChat Official Account 1:1 and 21:9 covers, with output resolutions of 1080×1440, 1080×1080, and 2100×900 respectively.

Technically, this Skill comes with 28 layout skeletons divided into two visual systems: Editorial (magazine-style, 16 layouts) and Swiss (Swiss Internationalist style, 12 layouts), along with 10 preset theme color schemes. After the user inputs a destination, itinerary, or note topic, the AI selects the appropriate layout skeleton, determines text placement, processes map annotation parameters, and converts all design decisions into HTML+CSS. The Playwright rendering engine then takes over, generating PNG screenshots page by page.

A component particularly useful for travel bloggers is the map module. It loads real tiles from OpenStreetMap using MapLibre and supports multiple location markers and connecting lines. Users simply provide city or attraction names, and the AI automatically generates a labeled base map and embeds it into the layout. The accompanying image sourcing workflow has a clear priority: user-uploaded photos take precedence; in the absence of user images, it automatically retrieves visuals in this order: Unsplash → Pexels → Flickr CC → Wallhaven.

The process consists of seven steps: Intake → Style & Theme → Layout Selection → Asset Prep → Compose & Render → Deliver & Review → Iterate. Each step is recorded in the .poster files within the task directory. To generate images in bulk, run node render.mjs, where Playwright renders each one sequentially. Additionally, a validation script validate-social-deck.mjs measures DOM elements in a real browser environment to detect layout issues such as text overflow, font sizes exceeding limits, and footer element collisions.

The design goal of this system is clear: precise and controllable like desktop publishing software, not free but unpredictable like diffusion models. The cost is that creative freedom is confined to 28 grid cells. For creators who rely on personal photography styles, hand-drawn elements, or irregular collages, these layout frameworks offer not increased efficiency, but design constraints.

Regarding ease of use, the CLI version requires installation of Playwright and a Node.js environment, as well as access to the Claude Code or Codex API. There is also a web-based entry point at xiaohongshu.guizang.ai designed for non-developers, but there is no publicly available comparison detailing whether its feature set matches the CLI version. The developer’s several posts on X and the continuously updated README indicate that this project is still under rapid development.

The pixels do not come from a generative model, but compliance does not equal long-term security.

Based on publicly available information and technical documentation, Xiao Hong Shu's AI content detection logic primarily relies on audio-visual recognition models. These models analyze patterns in pixel distribution to determine whether content was generated by AI. Diffusion models and GANs leave specific statistical signatures at the pixel level when generating images, which differ from the natural lighting, lens distortions, and noise patterns captured by camera sensors. The training objective of the audio-visual recognition model is to detect these inconsistencies in statistical patterns.

The avoidance logic of Master Zang's Skill is based on a key distinction: the pixels in its output images do not come from any generative model. The HTML rendering engine rasterizes CSS styles, producing pixel distributions that more closely resemble browser interface screenshots or desktop publishing software outputs. The photographic elements are sourced from real-world photos from libraries like Unsplash, captured by cameras and manually post-processed, without any traces of diffusion models.

But this distinction holds only if the platform’s definition of “AI-generated synthetic content” precisely stops at the line of “pixels generated by AI models.” Xiaohongshu’s official announcement uses the term “AI-generated synthetic content,” which has a broad scope. Once the platform expands its definition to include “programmatic renderings assisted by AI” or incorporates browser rendering characteristics of HTML-rasterized images into its recognition model’s training set, the current technical advantages of this approach will vanish.

The platform has a technical foundation and governance motivations for expansion. The audio-visual recognition model itself is continuously evolving. If the training data includes a large number of comparative samples between HTML-rendered images and AI-generated images, the model can learn to distinguish between "subpixel anti-aliasing characteristics of browser font rendering" and "irregular pixel blocks produced by GANs during text generation." There is currently no public information indicating that Xiaohongshu has initiated training in this direction; however, from the perspective of the model's capability boundaries, such an extension is technically feasible.

More importantly, compliance elements related to mini-program hosting must be noted. Currently, no official documentation has been found indicating that this Skill has registered a model filing number or completed the relevant compliance registration. If the platform adds tracing requirements for image generation toolchains in its content review process, the absence of filing information could become a new point of rejection.

API template engine, platform customization tools, and HTML rendering are branching into three separate paths.

Observing tools on the market that generate images for social media, one can see they are diverging into three distinct technological pathways, each facing different structures of moderation risks.

AI models generate images directly. This path represents Canva AI’s Magic Design feature, released in April 2026, which generates design drafts containing AI visual elements directly from text prompts. Images produced by models such as Midjourney and DALL·E fall into the same category. The issue is clear: these images are primary targets for audio-visual recognition models. Canva’s approach is to encourage transparent labeling rather than evade detection. On Xiaohongshu, there is no publicly available data confirming whether posts featuring AI-generated images receive reduced recommendation weight after being labeled, but the platform’s established policy explicitly states restrictions on the distribution of unmarked AI content. With each update to diffusion models, pixel statistical features may change, prompting corresponding iterations in detection models—creators are facing a constantly moving target.

API template engine rendering. Bannerbear is a typical example of this approach. Users create templates in a designer interface, then pass JSON data via a REST API to modify layer variables, with the server rendering output as PNG or JPG. Its core is still “programmatic rendering,” not “model-generated pixels,” and the output contains no traces of diffusion models. The difference from Zang Shifu Skill lies in: Bannerbear’s templates rely on manual design, with no AI involvement in layout decisions; Zang Shifu Skill lets Claude directly read and write HTML, delegating layout choices to AI. The risk with the Bannerbear approach exists on another dimension: when many accounts use identical templates, colors, and fonts to produce graphics, even if no image is AI-generated, platforms may still trigger their “programmatic bulk production” pattern recognition. The conditions for triggering anti-spam rules are not identical to AI detection, but for creators running bulk accounts, the result is the same—restricted distribution.

Platform-customized generation. The Pin Generator is designed specifically for Pinterest, automatically creating Pins that align with the platform’s algorithmic preferences. The core of this approach is not evasion, but full adaptation—dimensions, visual style, and posting frequency all conform to platform guidelines. The advantage is the lowest risk of review rejection; the downside is equally clear: the tool’s capabilities are locked to Pinterest’s rules, and it becomes instantly ineffective if Pinterest updates its algorithm or restricts third-party API access. In contrast to Cang Shifu’s Skill, the former is a platform-specific tool, while the latter is a cross-platform solution. Platform-specific tools are safer but more fragile; cross-platform solutions are more flexible but more complex—a recurring trade-off in the field of AI tools.

The risk profiles of the three approaches differ. AI-generated images offer the greatest freedom, but each update responds to new detection models. Template engines are the most stable but may be flagged by anti-spam rules. HTML rendering lies between the two: layout is flexibly controlled by AI, while pixels are handled by browsers and real-world assets, avoiding detection at the level of “AI-generated pixels,” but unable to counter platform-level semantic rule expansions.

The limit of the layout system lies not in the code, but in the type of content.

28 layout skeletons cover the two mainstream visual systems: magazine style and Swiss style. This system is highly suitable for travel bloggers who need to display map routes, timelines, or multi-day itineraries. Map annotations and itinerary connections are the core information in these notes, and the layout skeletons structure this information while maintaining a professional typographic aesthetic.

But Xiaohongshu’s content ecosystem is far richer than travel guides. Fashion posts rely on personal photography styles and color tones, beauty reviews require high-resolution macro photos and product comparison images, and lifestyle content extensively uses collage layouts and handwritten annotations. The “layout” of these content types is not a structured presentation of information, but an expression of personal aesthetics and emotion. In this context, 28 layout skeletons are not tools—they are constraints.

Technical limitations are equally real. Currently, three sizes are supported: 1080×1440 (Xiaohongshu 3:4), 2100×900 (WeChat Official Account 21:9), and 1080×1080 (WeChat Official Account 1:1). TikTok’s 9:16 vertical cover and Bilibili’s 16:9 horizontal cover are not supported. The image library relies on Unsplash and Pexels, whose assets lean toward high-quality photography, well-suited for travel, landscape, and urban architecture imagery. However, high-frequency vertical content such as food close-ups, cosmetic product staging, or outfit单品 imagery has limited coverage in these libraries. A user-image-first strategy can partially alleviate this issue, provided creators have sufficient on-site photo assets accumulated.

The validation mechanism is a double-edged sword. validate-social-deck.mjs can intercept layout errors before image generation, ensuring zero errors in 100 batch renders—a critical efficiency safeguard for operations requiring dozens of daily images. However, it also means any design that deviates from predefined layout rules will be rejected by the script. Creators wishing to add an angled text decoration or custom margins within the standard layout cannot simply drag and adjust as they would in Canva; they must directly edit the HTML and CSS source code.

The barrier to local deployment is another layering point. Creators who can run Playwright and Node scripts can dive deep into customizing layout skeletons and rendering scripts. However, for most Xiaohongshu creators, what they can access is only a subset of features available on the web interface. The actual value these two types of users derive from this skill differs significantly. The core user base of the open-source project consists of creators and developers willing to tinker and with technical backgrounds, not the “one-click image generation” needs of ordinary content creators.

There is no one-size-fits-all answer, but the divergence in technical pathways itself speaks volumes.

A Xiaohongshu travel blogger faces three choices: using Midjourney to generate illustration-style itinerary graphics, risking being flagged and downranked; setting up templates in Bannerbear to batch-import data daily, risking anti-spam penalties due to template homogenization; or using Cang Shifu’s Skill to let AI select layouts and render images via HTML, risking the platform expanding its definition of “synthetic content.” There is no safe option—only different combinations of risk structures.

This dynamic itself conveys a message: the iterative arms race between platforms and AI tools has already begun. Each time the platform updates its detection model, a batch of tools enters the end of their technological advantage period. Each time a new tool finds a way to bypass detection, the platform adjusts its strategy again. This is not a process that will converge to a stable state. The effectiveness of HTML rendering solutions depends on whether Xiaohongshu’s audio-visual recognition model continues to focus on “diffusion model pixel features” or expands to include “all non-native photographic pixels.”

For content creators, distinguishing between “AI-assisted” and “AI-replaced” has become practically meaningful. The platform’s stance is clear: it encourages AI as a creative amplifier, while opposing the use of AI to replace humans in low-quality, bulk production. In the Cang Shifu Skill, AI handles layout decisions rather than content generation—the photos are real-shot, and the layouts are pre-designed skeletons created by human designers. This falls precisely within the “AI-assisted” range. Those that generate both text and images entirely using generative models are exactly what the platform aims to crack down on.

It is still uncertain whether this distinction will become an operational standard for platform review. However, tool developers are already responding to this definition with technical choices.