If you don’t recognize the term, “synthetic data,” don’t be too worried. It’s not in the modern language just yet. An expanding number of companies are producing synthetic data, though.
What is it and what is it for? Synthetic data is data that is generated rather than captured. And what does that mean? It means that the old way of acquiring data now has a finite lifespan—and synthetic data has no end in sight.
When it is deployed today, it is most often used in AI systems, although it is nowhere near common. Here is an example that helps us understand how and why it is deployed to teach AI systems:
Think of robots loading a truck. For the software in the system, the developers need to come up with a way to consider every size (in three dimensions) and type of box—not just to fill the truck, but to use the most efficient way possible to make each “wall” (think of the truckload as a succession of walls that start at the back of the trailer). Photos of each possible wall would train the system.
What seems like a simple process (stacking boxes, most of us probably had a high school job that used this skill) morphs and grows like kudzu when you realize there is an infinite amount of box shapes and sizes even while meeting this important criterion: it has to fit in the truck. Thus, if you want to teach the robots how to stack efficiently and quickly, you would need a warehouse full of empty boxes of all kinds, and a camera to record the exciting results as you swap in your 90,000th box into a single wall. Not fun, not cost-effective.
You could skip most of this; you could image relatively few boxes and have the system change their dimensions either in an image generator or just automatically (within criteria) shrink or enlarge photos to give you a full complement of boxes. You start to realize just how much it costs to acquire real data, and even in this most prosaic example, how much money you would save by using synthetic data. Thousands, maybe tens of thousands of dollars on the front end.
As I noted earlier, synthetic data is not exactly running rampant in the market right now because AI is not universally deployed. If we doubt its growth, though, we will be surprised in a couple of years. Research stalwart Gartner says that by 2030, more than half of all data used to train AI systems will be synthetic, not real. I can see why, it’s much cheaper and much faster on the front end.
While it really doesn’t do much to save money after the initial generation of data, you could envision a time when a synthetic data system’s job is to do on-the-fly simulation. It would do so by using criteria instead of photos. It would randomly generate a box every millisecond or some tiny time unit. If the time shrunk enough, it could do this in video form.
You will see this term used more often in the next few years. Until then, you will see more coverage here.