Thursday, June 13th 2024
Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model
Stability AI, a maker of various generative AI models and the company behind text-to-image Stable Diffusion models, has released its latest Stable Diffusion 3 (SD3) Medium AI model. Running on two billion dense parameters, the SD3 Medium is the company's most advanced text-to-image model to date. It boasts features like generating highly realistic and detailed images across a wide range of styles and compositions. It demonstrates capabilities in handling intricate prompts that involve spatial reasoning, actions, and diverse artistic directions. The model's innovative architecture, including the 16-channel variational autoencoder (VAE), allows it to overcome common challenges faced by other models, such as accurately rendering realistic human faces and hands.
Additionally, it achieves exceptional text quality, with precise letter formation, kerning, and spacing, thanks to the Diffusion Transformer architecture. Notably, the model is resource-efficient, capable of running smoothly on consumer-grade GPUs without compromising performance due to its low VRAM footprint. Furthermore, it exhibits impressive fine-tuning abilities, allowing it to absorb and replicate nuanced details from small datasets, making it highly customizable for specific use cases that users may have. Being an open-weight model, it is available for download on HuggingFace, and it has libraries optimized for both NVIDIA's TensorRT (all modern NVIDIA GPUs) and AMD Radeon/Instinct GPUs.
Source:
Stability AI
Additionally, it achieves exceptional text quality, with precise letter formation, kerning, and spacing, thanks to the Diffusion Transformer architecture. Notably, the model is resource-efficient, capable of running smoothly on consumer-grade GPUs without compromising performance due to its low VRAM footprint. Furthermore, it exhibits impressive fine-tuning abilities, allowing it to absorb and replicate nuanced details from small datasets, making it highly customizable for specific use cases that users may have. Being an open-weight model, it is available for download on HuggingFace, and it has libraries optimized for both NVIDIA's TensorRT (all modern NVIDIA GPUs) and AMD Radeon/Instinct GPUs.
12 Comments on Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model
Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.
It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well. They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.
Used to but this update made it not be able to
Licensing restrictions are not unlike choking hazard warnings on a pack of jelly beans; It won't actually stop people from either choking on them by accident, or for that matter stuffing them up their noses. Even though either are probably less legally actionable. :p
Point being, the Pandora's Box is open as soon as the model weights are released, and it's not a good look.
There are also possible ways to reduce VRAM usage, often enabled by default on various local deployment platforms. The encoders could be unloaded, and possible sub-INT8 quantization could potentially reduce the model weight itself to a reasonable size, if they decide to release the 8B version as well.
Wow. Is the number of fingers at least constant through the videos?
Now I'm not an expert at this, but IMHO what happened here is likely some or all of these things, and certainly something else too:
First, they filtered their dataset based on labels, and excluded even relatively tame keywords of remotely possible misuse like "lying", among the more obviously questionable ones. This also filtered out the majority of the data associated with difficult but usually safe anatomical details, like hands.
Furthermore, they (EDIT: could have) included a poisoned dataset with labels associated with those concepts, made up of intentionally scrambled images, perhaps to impede finetuning to undo those limits. Though I wonder whether they actually went that far.
At last, perhaps they did not train on a large dataset to begin with, due to copyright, quality, and/or budgetary concerns, and overfitted.
If you want to train it right with consistency I think you'd need to start with hand, fist, grip, and claw and get consistency out of each and then use the appropriate one for whatever you're intended scene is. Additionally left and right on those. I'm sure a Lora on each of those would help though you can still train a collection without a Lora it's just more difficult with less tuning since you can't also weight control the results vary the results if you need to or want to to blend and fuse things together in the right ways you intend to.
In the end draw me a perfect ect isn't really going to work at this point especially when talking about something unconventional that doesn't even have a dataset in the first place like animals that are human like and also hybrid various manners. DALLE-3 does a pretty decent job overall though and you can usually work with it pretty well to something close to what you were intending out of it. You know like groundhogs with sledge hammers they be sledging hard at work in the underground they got the Mjölnir might they ain't no sucka whack a moles...