We have a data problem
From gaps in our products to content for models, missing data remains a concern that should ground our expectations and caution us from over-reliance.
There’s more data available now than ever before. We create more data daily thanks to easy access to recording devices and software of all varieties. But even with this awesome surge of information, we still don’t have enough data available to us. Not enough data to quickly train AI models and not enough data for our own content browsing needs.
Through the creation of the internet, we’ve become more connected than ever before. The ability for us to search for answers to our questions has helped us to sort and organize this information to be useful. Now many questions can be easily answered but it’s becoming more difficult to know if the answers we’re looking for actually exist.
Think about content services like Netflix. Do they make it easy to see that they don’t have a series of movies available? No, they show content they do have that might fit your tastes and only when searching for a specific movie will they say that they don’t have that title. Netflix then tries to immediately distract you from your displeasure that they don’t have the movie by showing a list of other eye-catching movies.
Netflix gives us the impression that they have every movie ever made but they can’t. They don’t have the rights to some movies across different countries. Some movies aren’t watched enough to be worth keeping available for ease of viewing. There are many reasons Netflix doesn’t have every movie although as a user you might forget that until you look for enough movies that aren’t included in their lineup.
Many digital products can easily present users with an abundance of content. Amazon has created one of the largest and most active digital stores. Navigating between categories you can basically teleport around the real-world equivalent of a store. By jumping directly to a product customers feel powerful and encouraged to purchase the product that’s right for them.
Still, Amazon doesn’t have every product. But it’s hard to remember that! It feels like they might have everything.
Recently I viewed an excellent visualization of Wikipedia’s content (shown above). Wikipedia is another site that could believably convince us that they have almost everything in their database. But even so, they have some distinct focuses on their content. With fans of Football and K-pop focusing their energies on Wikipedia at a larger scale than other interest groups, this graphing highlights the possibility that many niche topics are missing Wikipedia entries.
Important figures in many industries may be missing pages entirely. Key events that have shaped thousands of people might not have been of enough of a note to get their own Wikipedia page. These impactful people and events might never get entries without clear and credible sourcing.
If you could look at all of Amazon’s products from high above and sort their categories you might start to notice empty shelves or entire sections of the store that are vacant. Somewhat troubling in a traditional store, it’s all but invisible through the customer interface. How many other digital products are like this?
YCombinator, a leading startup incubator in Silicon Valley, mentioned that discovery apps helping people find things are one of the most unsuccessful categories of startups. Largely because people underestimate how difficult it is to collect the content, present it to people, and have consumers adopt it without large-scale marketing to bolster its adoption. The effort is larger than it looks and is a “tarpit” that will trap you if you get in close enough.
Furthermore, they go on to mention that there is a finite number of things to discover. Whether that’s videos or products, eventually you run out of new things to add to the site. That might imply that eventually you’ll have everything on your site but the actual implication is that eventually it will be too difficult or take too much effort to continue to find new things to add to the site to ensure you have complete coverage. Besides, as we’ve discussed, customers might not even be able to tell if you have everything.
AI products are no different and are perhaps even more difficult to map or understand. We have tests and leaderboards to understand the capabilities of AI models today but I’ve found those tests don’t map clearly to real-world scenarios. As companies push to achieve better and more generally useful AI, do expert use cases drop off and do any but a few people notice or care? After all, so many people bought Alexa devices despite only having a narrow set of useful features.
What’s becoming more complex, AI is now helping fill some gaps when you go looking. Using AI search providers like Arc Search, Perplexity, or Kagi you can generate a summary of any search you’d normally perform on a browser like Google. These tools take traditional search results and look to answer your query with the power of LLMs. It just feels so much better than the traditional browsing experience that I can’t help but use these features.
However, I worry about AI's difficulty saying no. Ask it a question that doesn’t make sense or give it a reason you believe something and it will only weakly, if at all, push back against you. More worrying, if you’re looking for an answer from an AI it may just make one up if an answer doesn’t actually exist.
So, we have a data problem. We have the issue that many sites and sources we use every day can’t clearly indicate gaps in their content space. New companies are unlikely to compete in the same space due to the difficulty and constraints of aggregation. And AI summarizers could look to fill these gaps with their own generations to keep us browsing and buying.
I think this is important to acknowledge and think about. We’re entering a weird bubble-like period of AI investment and being cautious is a good idea. Evangelizing too early, as seen with the Rabbit R1 and AI Pin, can result in customers bearing the costs of the optimism around the possibilities.
But we still need to invest in AI. There is the potential to do incredible groundbreaking things. We just need a healthy dose of caution and awareness as to what to look out for. Data gaps and experience gaps are a major concern and one that I think will sneak up on you if you’re not looking out for them.
Digital products are, almost by definition, marketed based on a kind of magical thinking that the software could provide for our every need. We’ll see what this next wave of AI services and features bring but I’m not optimistic this gap is going away anytime soon.
Yes. Missing data is a huge problem with everything from genealogies and opinion polling to astronomical observations and medical treatments. We can only know what's already known and can only see what our tools allow us to see; most of reality lies beyond our reach.