Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model

AleksandarK · Jun 13, 2024

Stability AI, a maker of various generative AI models and the company behind text-to-image Stable Diffusion models, has released its latest Stable Diffusion 3 (SD3) Medium AI model. Running on two billion dense parameters, the SD3 Medium is the company's most advanced text-to-image model to date. It boasts features like generating highly realistic and detailed images across a wide range of styles and compositions. It demonstrates capabilities in handling intricate prompts that involve spatial reasoning, actions, and diverse artistic directions. The model's innovative architecture, including the 16-channel variational autoencoder (VAE), allows it to overcome common challenges faced by other models, such as accurately rendering realistic human faces and hands.

Additionally, it achieves exceptional text quality, with precise letter formation, kerning, and spacing, thanks to the Diffusion Transformer architecture. Notably, the model is resource-efficient, capable of running smoothly on consumer-grade GPUs without compromising performance due to its low VRAM footprint. Furthermore, it exhibits impressive fine-tuning abilities, allowing it to absorb and replicate nuanced details from small datasets, making it highly customizable for specific use cases that users may have. Being an open-weight model, it is available for download on HuggingFace, and it has libraries optimized for both NVIDIA's TensorRT (all modern NVIDIA GPUs) and AMD Radeon/Instinct GPUs.

View at TechPowerUp Main Site | Source

Firedrops · Jun 13, 2024

Y'all should go see what real outputs (twitter, reddit) are like from this model, it's so bad it's funny.

Chrispy_ · Jun 13, 2024

woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.

evernessince · Jun 13, 2024

Probably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.

Chrispy_ said:
woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.

They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.

mashie · Jun 13, 2024

Can it draw human hands with say all 5 fingers in the right locations?

evernessince · Jun 13, 2024

mashie said:
Can it draw human hands with say all 5 fingers in the right locations?

Nope, it has big issues with that. Both SD1.5 and SDXL are better in that regard (and both of those are far from perfect at hands as well).

remixedcat · Jun 14, 2024

I use SD thru a mobile app and it won't render anime catgirls wearing leather pants but it allows rendering of very nasty stuff that is too hot for this site (in a bad way)

Used to but this update made it not be able to

JWNoctis · Jun 14, 2024

evernessince said:
Probably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.

They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.

To be fair, the model does show composition and spelling capabilities previously unseen in any open-weight model. That rather overwrought safety engineering could theoretically be moderated or entirely reversed, even given the licensing restrictions, though that leads to other, rather a lot more complicated and much worse issues in a model capable of advanced composition, beyond the obviously worsened harm to artists' communities.

Licensing restrictions are not unlike choking hazard warnings on a pack of jelly beans; It won't actually stop people from either choking on them by accident, or for that matter stuffing them up their noses. Even though either are probably less legally actionable.

Point being, the Pandora's Box is open as soon as the model weights are released, and it's not a good look.

There are also possible ways to reduce VRAM usage, often enabled by default on various local deployment platforms. The encoders could be unloaded, and possible sub-INT8 quantization could potentially reduce the model weight itself to a reasonable size, if they decide to release the 8B version as well.

Minus Infinity · Jun 14, 2024

Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.

Wirko · Jun 14, 2024

Minus Infinity said:
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog.

How could SD possibly know that these aren't average humans and animals?

Minus Infinity said:
Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.

Wow. Is the number of fingers at least constant through the videos?

JWNoctis · Jun 14, 2024

Wirko said:
How could SD possibly know that these aren't average humans and animals?

"photography, many-faced many-armed statue of the Hindu pantheon", "majolica, triskelion, winged angelic head", and "sable a six-legged canine regardant, breathing flame gules, rotated 7 degrees counterclockwise from horizontal" would probably be actually quite distinctive in the dataset, if present at all. It's all in the labels.

Now I'm not an expert at this, but IMHO what happened here is likely some or all of these things, and certainly something else too:

First, they filtered their dataset based on labels, and excluded even relatively tame keywords of remotely possible misuse like "lying", among the more obviously questionable ones. This also filtered out the majority of the data associated with difficult but usually safe anatomical details, like hands.

Furthermore, they (EDIT: could have) included a poisoned dataset with labels associated with those concepts, made up of intentionally scrambled images, perhaps to impede finetuning to undo those limits. Though I wonder whether they actually went that far.

At last, perhaps they did not train on a large dataset to begin with, due to copyright, quality, and/or budgetary concerns, and overfitted.

InVasMani · Jun 14, 2024

Minus Infinity said:
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.

I'd say much of it has to do with it being trash. No it's probably attributed in no small part due to the negative prompt modifiers like bad anatomy, disfigured, blurred, butt on face, ect...

If you want to train it right with consistency I think you'd need to start with hand, fist, grip, and claw and get consistency out of each and then use the appropriate one for whatever you're intended scene is. Additionally left and right on those. I'm sure a Lora on each of those would help though you can still train a collection without a Lora it's just more difficult with less tuning since you can't also weight control the results vary the results if you need to or want to to blend and fuse things together in the right ways you intend to.

In the end draw me a perfect ect isn't really going to work at this point especially when talking about something unconventional that doesn't even have a dataset in the first place like animals that are human like and also hybrid various manners. DALLE-3 does a pretty decent job overall though and you can usually work with it pretty well to something close to what you were intending out of it. You know like groundhogs with sledge hammers they be sledging hard at work in the underground they got the Mjölnir might they ain't no sucka whack a moles...

Minus Infinity · Jun 15, 2024

Wirko said:
How could SD possibly know that these aren't average humans and animals?
View attachment 351217 View attachment 351218 View attachment 351219

Wow. Is the number of fingers at least constant through the videos?

Not sure I paid close enough attention, but I don't think so. As they shifted their arms, new arms appeared or disappeared and when their hands got close together the fingers fused!

System Name	Bragging Rights
Processor	Atom Z3735F 1.33GHz
Motherboard	It has no markings but it's green
Cooling	No, it's a 2.2W processor
Memory	2GB DDR3L-1333
Video Card(s)	Gen7 Intel HD (4EU @ 311MHz)
Storage	32GB eMMC and 128GB Sandisk Extreme U3
Display(s)	10" IPS 1280x800 60Hz
Case	Veddha T2
Audio Device(s)	Apparently, yes
Power Supply	Samsung 18W 5V fast-charger
Mouse	MX Anywhere 2
Keyboard	Logitech MX Keys (not Cherry MX at all)
VR HMD	Samsung Oddyssey, not that I'd plug it into this though....
Software	W10 21H1, barely
Benchmark Scores	I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.

Processor	Ryzen 7800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	32GB DDR5 6000 CL30
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	IONE
Processor	AMD Ryzen 9 5900X
Motherboard	ASUS STRIX B550-A Gaming
Cooling	Noctua NH-U12S SE-AM4
Memory	128GB (4x32GB) Corsair DDR4 Vengeance LPX Black, PC4-25600 (3200), CMK128GX4M4E3200C16
Video Card(s)	PNY GeForce RTX 3080 12GB
Storage	Samsung 980 1TB NVMe (system), Lexar NM790 4TB NVMe (temp), 16x Seagate IronWolf 10TB RAID6
Display(s)	Dell UP3017
Case	Lian-Li PC-777B
Audio Device(s)	Focal Alpha 65 Evo
Power Supply	Corsair AX1200
Mouse	Logitech M510
Keyboard	Keychron Q10, brass plate, Kailh Box Summer switches and PBT Cherry keycaps
Software	Xubuntu 24.04
Benchmark Scores	N/A

Processor	Ryzen 7800X3D
Motherboard	ASRock X670E Taichi
Cooling	Noctua NH-D15 Chromax
Memory	32GB DDR5 6000 CL30
Video Card(s)	MSI RTX 4090 Trio
Storage	P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s)	Acer Predator XB3 27" 240 Hz
Case	Thermaltake Core X9
Audio Device(s)	JDS Element IV, DCA Aeon II
Power Supply	Seasonic Prime Titanium 850w
Mouse	PMM P-305
Keyboard	Wooting HE60
VR HMD	Valve Index
Software	Win 10

System Name	RemixedBeast-NX
Processor	Intel Xeon E5-2690 @ 2.9Ghz (8C/16T)
Motherboard	Dell Inc. 08HPGT (CPU 1)
Cooling	Dell Standard
Memory	24GB ECC
Video Card(s)	Gigabyte Nvidia RTX2060 6GB
Storage	2TB Samsung 860 EVO SSD//2TB WD Black HDD
Display(s)	Samsung SyncMaster P2350 23in @ 1920x1080 + Dell E2013H 20 in @1600x900
Case	Dell Precision T3600 Chassis
Audio Device(s)	Beyerdynamic DT770 Pro 80 // Fiio E7 Amp/DAC
Power Supply	630w Dell T3600 PSU
Mouse	Logitech G700s/G502
Keyboard	Logitech K740
VR HMD	Linktr.ee/remixedcat // for my music ♡♡
Software	Linux Mint 20
Benchmark Scores	Network: APs: Ubiquiti Unifi AP-AC-LR and Lite Router/Sw:Meraki MX64 MS220-8P

System Name	Kuro
Processor	AMD Ryzen 7 7800X3D@65W
Motherboard	MSI MAG B650 Tomahawk WiFi
Cooling	Thermalright Phantom Spirit 120 EVO
Memory	Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s)	PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage	Crucial T500 2TB + WD Blue 8TB
Case	Lian Li LANCOOL 216
Power Supply	MSI MPG A850G
Software	Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores	17761 C23 Multi@65W

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model

News Editor