Why Graphic Design Is Hard for Large Language Models
And How to Make Them Better
AI models have come a long way. Just a few years ago, GPT-3 was a toy — good at generating text, but far from reliable. Today, frontier models can write blogs, generate code, and even act as autonomous agents. Yet, despite their advancements, they still struggle with something fundamental: graphic design.
The failures are easy to spot. SVG (Scalable Vector Graphics) generation is riddled with bizarre infantile illustrations, weird choices and terrible awkward layouts. Users have been sharing their frustrations online:
So why does this happen? And more importantly, how can we fix it?
Why This Matters
If AI is going to be useful for designers, it needs to move beyond just generating code or text — it has to understand structure, composition, and visual relationships. That’s why we built the Presto Design Public Benchmark, a dataset that measures how well LLMs can reconstruct valid, visually accurate SVGs.
Through our experiments, we’ve found that design is uniquely challenging for language models, in ways that go far beyond just “understanding code.” Here’s why.
Why LLMs Struggle with SVGs and Graphic Design
Graphic design, at its heart, is the task of selecting and composing assets. Those could be text, stock photos, icons, creating illustrations, adding colors, fills, affects to create a visual design that is both beautiful and also effective at communication.
This task is a sort of “perfect storm” for LLMs, combining many of their weakest areas:
- Underlying a design is a ton of numbers: positions, sizes, transformations, opacities, colors (which are usually RGB/HSL scalars), bezier paths (hundreds of numbers to make a line). And LLMs are famously bad at numbers, never mind the subtle interrelationships of hundreds or thousands of numbers
- Creating a design (e.g. as SVG scalable vector graphics code) involves visual artistic reasoning over which assets to select (e.g. stock photos), how to combine it all, and this is an area of reasoning that today’s training datasets (E.g. the common crawl of the internet) just don’t build any sort of intelligence for
- Furthermore, graphic design is usually a long iterative process: create some design ideas, reflect on them, render them, adjust them, get user feedback, make changes, render, adjust, ad infinitum. This combines a skill that is already poor in on-shot scenarios with the need for a large window agentic flow on the same skill!
In this article I’ll explain these (and many more reasons) more fully, as well as explain how to address them.
Aside — why LLMs and not Stable Diffusion?
Stable Diffusion has had incredible success generating coherent, good looking (albeit often cheesy) images. So why is that not the solution for graphic design?
In its current form, it has some major drawbacks that prohibit it from being a serious solution for professional design:
- It struggles using text
- It doesn’t use brand-specified fonts
- It doesn’t use approved stock photography
- It doesn’t reliably handle company logos and wordmark assets without distorting/corrupting them
- It doesn’t deliver scalable/high resolution designs for print
- It doesn’t provide a design that can be easily tweaked in subsequent feedback rounds (instead it re-generates the entire design and usually changes many parts of it)
As opposed to this, using an LLM to take a user’s brand, fonts, assets and instructions to create design (E.g. in SVG format) fulfills all these requirements, it just needs to have a better sense of comprehension.
Numbers Have Long Been Difficult for Models
While modern LLMs have improved in basic arithmetic, design requires a a lot of hard numerical reasoning.
Coordinates, dimensions, and transformations in SVGs demand spatial awareness that language models aren’t naturally trained for. Furthermore, almost every single number (font size, text position, text color) is relative to every single other number (illustration position, background color)
For example, consider a model generating a simple rectangle:
<rect x=”10" y=”20" width=”200" height=”100" opacity=”0.8" />
A small mistake in x or y can completely misplace the rectangle. Unlike text, where meaning often remains intact despite minor errors, small coordinate miscalculations can break an entire design.
A related challenge is that SVGs use multiple, relative and absolute coordinate systems, meaning elements are positioned depending on whether they are in a pattern, in a transform and based on the view box. This adds many layers of abstraction that models must learn.
For instance, consider this code:
<svg width="400" height="400" viewBox="0 0 200 200" xmlns="http://www.w3.org/2000/svg">
<! - Absolute Positioning →
<rect x="10" y="10" width="40" height="40" fill="blue" />
<! - Relative positioning inside a group →
<g transform="translate(60,10)">
<rect width="40" height="40" fill="green" />
</g>
<! - Pattern with its own coordinate system →
<defs>
<pattern id="dots" x="0" y="0" width="10" height="10" patternUnits="userSpaceOnUse">
<circle cx="5" cy="5" r="3" fill="red" />
</pattern>
</defs>
<rect x="10" y="60" width="40" height="40" fill="url(#dots)" />
<! - Transformed element (scaled and rotated) →
<g transform="translate(120,60) rotate(30) scale(1.5)">
<rect width="40" height="40" fill="purple" />
</g>
<! - ViewBox affecting scaling →
<g transform="translate(10,120)">
<svg width="80" height="80" viewBox="0 0 40 40">
<rect width="40" height="40" fill="orange" />
</svg>
</g>
</svg>
In this example,
- The blue rectangle is placed with absolute positioning.
- The green rectangle is inside a <g> group, translated relatively.
- The red-dotted rectangle uses a pattern, defining a new coordinate system.
- The purple rectangle is transformed (scaled + rotated), making coordinates relative.
- The orange rectangle is inside another <svg> with a viewBox, affecting scaling.
This demonstrates how multiple coordinate systems overlap, making it challenging for models to learn precise positioning.
How to address this
This problem is surmountable from pure supervised training. The catch is that it takes a huge amount of compute to solve for even simple limited domains (e.g. two lines of text, fixed sized page), and exponentially more for real world designs (since the possible combination of elements and attributes is massive).
We’ve observed that the models successively can learn skills in isolation and then combine them. For example, this model has been trained with interleaved datasets to create text, to place images in masks and a third dataset to create “pill buttons”, here are examples of each separate skill:
And in evaluation, the model is able to combine these (albeit with mistakes as this is a relatively small dataset of 2M records) when asked to replicate designs from the wild:
SVGs Contain a Lot of Cruft
SVGs can be messy. They often contain redundant attributes, unnecessary metadata, and multiple valid ways to describe the same shape. For example, a filled circle can be represented with <circle> or a <path> with an arc command — both visually identical but structurally different.
Here’s the start of a typical Inkscape design, this SVG file is 5MB of text despite only containing three lines of rendered text and one image:
Just a small part of the 5MB SVG Code:
<svg
version="1.1"
id="svg1"
width="672"
height="672"
viewBox="0 0 672 672"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg">
<defs
id="defs1">
<color-profile
name="U.S.-Web-Coated-SWOP-v2"
xlink:href="data:application/vnd.iccprofile;base64,AAq5LGxjbXM
...
<g
id="layer-MC0">
<path
id="path1"
d="M 531,27 H 27 v 504 h 504 z"
style="fill:#f8f7f3;fill-opacity:1;fill-rule:nonzero;stroke:none"
transform="matrix(1.3333333,0,0,-1.3333333,-36,708)" />
<text
id="text1"
xml:space="preserve"
transform="matrix(1.3333333,0,0,1.3869535,197.3464,76.123733)"><tspan
style="font-variant:normal;font-weight:700;font-stretch:normal;font-size:25.2865px;font-family:Gotham;writing-mode:lr-tb;fill:#3f1e1e;fill-opacity:1;fill-rule:nonzero;stroke:none"
x="0 19.976336 36.918289 65.289742 72.875694 89.767075 108.04922 129.54274 149.31679 168.43338 187.09482"
y="0"
id="tspan1">NEW PRODUCT</tspan></text>
LLMs trained on vast, unstructured datasets most certainly struggle to extract the small number of important elements from the huge bloat of SVG boilerplate (in this example, about 300 characters matter out of 5,000,000)
How to address this
A necessary step in creating design AI is to engineer large scale data cleaning. This is much harder than it sounds, due to the huge number of ways SVGs complicate themselves, for instance:
- A vast array of redundant elements and attributes get added by design programs
- Elements are often in deeply nested many level matrix transformations
- Text is often outlined into paths
- CSS styling needs normalized into SVG attributes
- SVG supports many different ways to represent coordinate spaces
- SVGs can embed all sorts of other content
- Different renderers treat various features differently
- … and much more…
This is just good old fashioned engineering, with plenty of need for tests, sweat and manpower
A Scarcity of Training Data
AI models are data-hungry, but high-quality SVG datasets are scarce. The Common Crawl, the backbone of most LLM training, contains minimal examples of structured design. Most design-related data is locked behind private repositories or inaccessible in rendered image form (e.g. you see a graphic design flyer on Instagram, but never the Illustrator of Figma file it came from).
Without sufficient exposure, models don’t develop strong priors for generating correct SVGs. Instead, they develop overly simplistic priors such as “Align everything to the left margin or the center” as commonly seen in frontier models.
How to address this
Data is an absolute must have to make a successful designer LLM. Given the billions of parameters in a model, many times that volume in data is needed to teach and constrain it.
There are a few ways to confront this problem:
- Use private datasets (e.g. by licensing from a graphic design agency, or with permission from your customers if you are a design tool)
- Manually generate large scale vector design datasets, an expensive endeavour
- A more esoteric approach we’ve taken is to train a model to be able to reconstruct SVG design from images, therefore unlocking the vast tome of design images on the internet as training material. The model can then, under RL, attempt to replicate images into SVGs and get scored on visual similarity.
Finally, to make the model good at following instructions, it’s very important to run a Reinforcement Learning From Human Feedback procedure, based on real human preference data, to narrow in the model’s ability to both do what is asked and also to design it beautifully.
To do this,
- First train up the model using SFT to be a reasonable (if unreliable) one-shot graphic designer, taking a prompt to a design
- Then from a large prompt dataset, generate millions of completions (say 8 per prompt) and render them into graphics
- Then ask humans to rank these responses against each other
- Train the model on these rankings (e.g. under GRPO)
- And if these rankings are able to train an effective reward scoring model, then continue the process in a closed loop RL (e.g. no human rankers, just the trained reward model)
Multi-Step Agentic Reasoning Is Necessary
Human designers rarely create graphics in a single step. They look for reference designs, seek out assets (e.g. stock photos, fonts), iterate, adjust, get feedback, make variations and refine. Effective design tools must generate, evaluate, and improve their outputs over multiple passes.
For current models, simply creating a single good looking design is a big challenge. To make a design system that really excels, it needs to be able to find assets, create variations, render them, present them to the user, incorporate feedback, make small adjustments, try out different styles and tones.
How to address this
Reinforcement learning is necessary to make a great artificial graphic designer. For this to be effective, the model needs to first be trained under supervised fine tuning to have adequate competence at:
- One shot generation (e.g. turn a prompt into an SVG design)
- Using tools to look for image assets, illustrations
- Receiving reference designs (e.g. from user or from web searches) and using those to guide the design
- Making specific updates from user instructions (E.g. change a given text, or try a different layout)
Also required is a reward function. Ideally this:
- Rewards following instructions (e.g. content is present)
- Rewards aesthetics
To achieve this, an effective combination is
- LLM as judge (with a prompt containing a checklist of technical factors e.g. is all the content followed? Were the changes applied? Are the images the ones asked for?)
- Visual reward model (e.g. turns an image into a score, fine tuned on a large corpus of user preference data where humans rank multiple versions of the same design based on aesthetic appeal)
Putting all this together is an expensive and large endeavour, but very worthwhile as it develops all the skills needed together in a single learning environment. After this is all performed, an extremely competent design model is produced.
The path from here
AI has come a long way, but when it comes to graphic design, there’s still a big gap. Large language models (LLMs) struggle with positioning, layout, and the complex relationships between elements in a design. Unlike writing code or generating text, design requires spatial awareness, numerical precision, and an iterative creative process — things that don’t come naturally to today’s AI models.
With the right training and improvements, we can make AI much better at design. Here’s what needs to happen:
- Better Training Data — LLMs don’t see enough high-quality design examples. We need clean, structured SVG datasets and ways to teach models how real designers work.
- Stronger Numerical & Spatial Reasoning — AI struggles with precise positioning and transformations. Teaching it how different coordinate systems work in SVGs is key.
- Iterative & Agentic Design Thinking — Real designers don’t get it right in one shot; they iterate. AI models need reinforcement learning (RLHF) to refine their designs based on user feedback.
- Aesthetic & Instruction Following Rewards — AI should be trained to both follow instructions and produce visually appealing results, using ranking systems and visual reward models.
Whilst there is a huge amount of work ahead of us, the global appetite for synthetic intelligence is sufficient that I feel confident all these pieces will fall into place. The impact of this on designers will probably be similar to the impact of AI coding on engineers today: The human becomes much more of a manager instead of an individual contributor, overseeing a rapid flow of projects instead of working on a single one.
This transition (for all knowledge work disciplines) does give me pause and anxiety — I believe that we can get to new, more prosperous place, but I worry for the transition in the middle, with the potential for rapid job upheaval and recession. I hope that our lawmakers and business community can thoughtfully navigate the rough seas ahead.