This OpenAI’s 2023 research article on how they improved DALL-E is packed with incredibly interesting information.
In summary, the scientists trained DALL-E 3 to be better at image generation using a huge amount of images paired with a blend of AI-generated highly descriptive captions and real-world image captions.1 billion images were used with an equal amount of pairing captions.
But here’s the kicker: they didn’t just focus on quantity—they also carefully managed the type of data being used.
Through A/B testing, they determined that the best approach to achieve best image generation output was to use 95% synthetic captions and 5% real-world captions.
Why? Because this blend helps to mitigate overfitting to create more accurate image outputs. (Overfitting can happen when the model performs too well on synthetic caption requests, and performs less accurately on human-generated requests which often have irregularities.)
Takeaway for product managers: If you're building any AI system, pay special attention to your training data. A large dataset is important, but data quality and diversity are what really make the difference. Balancing the blend of real-world and synthetic data can help avoid common pitfalls like bias.
What are some strategies you’ve seen to improve data diversity in AI?