LUCKY YOU: You found the long article where I researched and rambled on about my personal connection to this topic. But you may be looking for is the summarized version where I debunk peak data theory.
Summary
Having achieved “peak data” is a myth. This article chronicles my journey through the evolving landscape of search, from early SEO tactics to the current AI-driven revolution. I delve into the technical intricacies of web crawling, data processing, and the challenges posed by big tech monopolies, while also critiquing the “peak data” assertions of industry leaders. Ultimately, I outline my vision for a future where open-source tools and innovative workflows empower individuals to navigate and leverage the vast, untapped potential of the internet.
Key Points
- Research suggests “peak data theory” (the idea we’ve exhausted data for AI training) is likely overstated, given vast untapped data in the invisible web and continuous new data generation.
- It seems likely that with proper curation and web scraping, high-quality data for AI training remains abundant, challenging claims of data exhaustion.
- The evidence leans toward synthetic data as a supplement, not a replacement, offering privacy benefits but potentially lacking real-world complexity.
Introduction
The “peak data theory,” suggesting we’ve hit the limit of useful data for AI training, has been debated by industry leaders like Elon Musk and Ilya Sutskever. However, this idea seems flawed when considering the vast amount of untapped data in the invisible web, the continuous generation of new data, and the potential of synthetic data. This article explains why “peak data” is likely a myth.
The Invisible Web: A Hidden Resource
The invisible web, or deep web, includes data not indexed by standard search engines, such as dynamic content and databases. Estimates suggest it’s several times larger than the visible web, offering a treasure trove for AI training. Web scraping, especially with tools that execute JavaScript, can access this data, providing a rich source for models. For example, Ryte Wiki notes that the invisible web is vast and often more focused, while Wikipedia highlights its unindexed nature, including paywalled services and private databases.
Continuous Data Generation and Curation
New data is generated at an unprecedented rate, with Statista predicting 394 zettabytes by 2028, up from 2 zettabytes in 2010. While much may be low quality, data curation—collecting, cleaning, and organizing data—can extract high-quality datasets. Microsoft Learn emphasizes that curated data enhances AI model outcomes, ensuring accessibility and compliance, challenging the notion of data exhaustion.
Synthetic Data: A Complementary Approach
Synthetic data, generated to mimic real-world patterns, addresses privacy and scarcity issues. It’s predicted to make up 60% of AI training data by 2024, per Keymakr. However, MIT News suggests it performs better in specific scenarios, like fewer background objects, but may lack real-world complexity. This makes it a supplement, not a replacement, for real data.
Peak Data Theory Debunked
Detailed Analysis
Survey Note: Detailed Analysis on Peak Data Theory and AI Training Data
This section provides a comprehensive exploration of why the “peak data theory” is likely overstated, integrating research and detailed insights into the availability and utilization of data for AI training. The analysis begins with an overview of the theory, followed by an examination of the invisible web, continuous data generation, data curation, and synthetic data, ensuring a thorough understanding of the topic.
Overview of Peak Data Theory
The “peak data theory” suggests that we’ve reached the limit of useful human-generated data for training AI models, a claim notably made by Elon Musk and Ilya Sutskever in recent discussions. For instance, a Glass Almanac article from March 10, 2025, highlights Musk’s assertion that we’ve hit “peak data,” implying a plateau in real-world data availability. Similarly, Sutskever’s comments at the NeurIPS conference on December 13, 2024, as noted in ALPHA AVENUE, suggest we’ve achieved “peak data,” with no more to come. However, these claims are controversial, as they overlook several factors, which we’ll explore below.
The Invisible Web: Untapped Data Reservoir
The invisible web, also known as the deep web, encompasses data not indexed by standard search engines, such as dynamic content, databases, and pages requiring authentication. Ryte Wiki explains that this includes content missed by crawlers, often due to technical barriers like JavaScript execution, which most crawlers don’t handle. Wikipedia further details that it includes web mail, online banking, and paywalled services, accessible via direct URLs or IPs but not through general searches. Estimates from The Guardian suggest the invisible web is 400-550 times larger than the visible web, containing over half a trillion documents, offering a vast, untapped resource for AI training. This challenges the “peak data” notion, as advanced web scraping techniques, like those using headless browsers, can access this data, as discussed in research.aimultiple.com, published January 22, 2025, which highlights web scraping for AI training data collection.
Continuous Data Generation: Exponential Growth
New data generation is occurring at an exponential rate, with Statista projecting 394 zettabytes by 2028, up from 2 zettabytes in 2010. Exploding Topics, updated February 22, 2025, estimates 402.74 million terabytes daily, equating to 147 zettabytes annually. While much of this data may be low quality, rivery.io, published December 11, 2024, notes that 90% of the world’s data was generated in the last two years, suggesting a continuous stream of potential AI training material. The challenge lies in curation, not scarcity, as Microsoft Learn underscores, with curated data improving model outcomes and scalability, published November 18, 2024.
Data Curation: Enhancing Quality for AI Training
Data curation involves collecting, cleaning, and organizing data to ensure it’s suitable for AI training. Atlan, published October 16, 2023, defines it as discovering, annotating, and maintaining datasets, crucial for machine learning success. SuperAnnotate, from October 27, 2022, highlights its role in addressing impurities and bias, ensuring data is machine-readable and diverse. Tools like NVIDIA NeMo Data Curator and Cleanlab, as mentioned in cleanlab.ai, published recently, automate curation, addressing quality issues like label errors and outliers, suggesting that with proper curation, the data pool remains far from peaked.
Synthetic Data: A Supplement, Not a Replacement
Synthetic data, generated to mimic real-world patterns, addresses privacy, scarcity, and bias issues. NVIDIA notes its use in training AI models when real data is limited, offering cost-effectiveness and scalability. Mostly AI, updated May 31, 2024, emphasizes its privacy-preserving nature, with tools like TabularARGN ensuring high fidelity. However, MIT News, from November 2, 2022, suggests synthetic data performs better in specific scenarios, like fewer background objects, but may lack real-world complexity, as research.aimultiple.com, published October 30, 2024, compares, noting structural similarities but potential gaps in nuance. This positions synthetic data as a complement, not a replacement, challenging the “peak data” narrative by offering a scalable alternative.
Comparative Analysis: Real-World vs. Synthetic Data
A comparison reveals both types have merits. StageZero, from January 12, 2023, notes real-world data’s authenticity but highlights limitations like bias and privacy concerns, while synthetic data offers scalability but may miss real-world nuances. Forbes, published November 20, 2023, lists uses like financial modeling with synthetic data, protecting privacy, but warns of potential model errors. Keymakr, from August 23, 2024, predicts synthetic data will make up 60% of training data by 2024, suggesting a hybrid approach is optimal, countering “peak data” by expanding data availability.
Conclusion and Implications
The “peak data theory” is likely overstated, given the invisible web’s vastness, continuous data generation, and synthetic data’s potential. These factors suggest that with innovative methods like web scraping and curation, as well as synthetic data generation, AI training data remains abundant. This challenges industry claims, emphasizing the need for evolving data access and utilization strategies, ensuring AI’s future growth is not constrained by perceived data limits.
Key Citations
- Data growth worldwide 2010-2028 Statista
- Amount of Data Created Daily 2024 Exploding Topics
- Invisible Web Ryte Wiki
- Deep web Wikipedia
- Understanding data curation and management for AI projects Microsoft Learn
- Data Curation in Machine Learning SuperAnnotate
- Synthetic Data Generation with Highest Accuracy for Free Mostly AI
- Synthetic Data Is About To Transform Artificial Intelligence Forbes
- Synthetic data can offer real performance improvements in machine learning MIT News
The No More Human Data Fallacy: My Personal Story
So much for the research. Now for the personal stuff. You can tell AI Research when you read it, can’t you? Well, the rest is right from my keyboard to the Web – the candid genuine sort of stuff that’s becoming rarer and rarer.
I am at my current place of employment to do the best work of my life. That’s why I am there. The huge challenge of enterprise SEO crawling exceeds my home-grown DIY SEO ethos, and that’s where the big challenges (and the money) is at. Enterprises have to adapt to the age of AI… and I’m there! They execute JavaScript over millions of pages. I don’t, and probably never will, haha!
However, my scrappy approach is where outlier innovation resides. With my own free AI SEO tool, I do shallow crawls with an LLM riding shotgun, commenting along the way like Statler and Waldorf. Or perhaps like the Mystery Science Theater 3000 peanut gallery. Now go away before I taunt your website again! But that’s the present. Let’s talk about the past and why peak data theory is bullshit.
The Early Days of Search
1994 was a big year in search. This is after I wrapped up working for Commodore Computers (they went under) but before I moved to New York City, we saw Yahoo, Lycos, Excite and Infoseek all come onto the scene. AltaVista arrived in 95. And that’s when I consider myself becoming an SEO in spirit if not in name. Now it’s 2025, so I’ve been SEO-aware for 30 years, and professionally calling myself an SEO for bout 25 of those years, since the move to NYC. I would have probably rather been a scientist in hindsight, but SEO becoming Data Science is a nice consolation prize.
Early Technical Approaches
As I got started in the burbs of Philly, I felt I was effective as an SEO because I had direct control of the websites. I was doing static site generation long before tools like Jekyll came onto the scene. I kept the website content actually in XML files, and later in databases from which I generated the XML files. The files would get slice & diced and transformed into websites in multiple transformation passes. I used a novel and kooky tech called XSLT (XML style-sheet transformation), which were supposed to be my path to future-proofing.
They weren’t. But my general thinking was correct.
While these XSL stylesheet transformations were a really awesome secret weapon in SEO, especially during the era of the long-tail when such precise control of link construction could win you visits, it didn’t last. JavaScript matured and JSON replaced XML as the interoperable language of choice. NodeJS wasn’t on the scene yet, so the concept of the full stack wasn’t a thing.
I actually investigated JavaScript on the server before Node arrived, and there was nothing viable at the time – although the history of server-side JS goes back longer than people think to Netscape Enterprise Server and Microsoft ASP. But at the time, there was only Helma and Rhino. This is when I opened my mind to other programming languages, tried Ruby to see what the big deal was about, but then got bitten by the Python bug.
The Move to New York
By this time, I was no longer making the websites directly myself. I had moved to New York City to become a vice president at the company that had once launched Amazon.com. It was the public relations company for Amazon back during the dotcom boom era of the late 90s/early 2000s, before the bust. I went to work for them during their second coming, after the bust was over and Web 2.0 was on the rise. I moved to NYC circa 2004/2005. Within a year, 2006, I had created HitTail, a Web 2.0 era app that took the best of the long-tail slice & dice learnings and spun it into a real-time writing suggestion tool, mostly for bloggers but the technique would work with any SEO-friendly CMS.
The Evolution of Search Algorithms
Even to this day, the HitTail algorithm still holds half the formula. If something is nearly working for you, you need only push it over the edge. It’s like those penny arcade games in skee-ball joints and Chuck E Cheese where the quarters (or tokens) are hanging over the edge and you drop a coin in to see how many you can get to topple over the edge. That’s HitTail. Or was.
The outlier data is the part you often want to overlook, clean, filter or normalize away. But outliers are trying to tell you something. Data at the edge that’s normally missed is a huge reason why the peak data theory is wrong.
The Concept of Striking Distance
In SEO circles, this is often known as striking distance or low-hanging fruit. If the data shows that a page of yours is in the search results — positioning, but not positioning well (page 2 or lower) — then the data is trying to tell you something. It’s like having a new content-topic suggestion-box built into your website. Just follow the data! HitTail was a serendipity to deliberateness engine. Coincidence to higher click-through. The systematic identification of outliers, the small tweak it takes to mainstream it, and lather, rinse, repeat.
The Power of Incremental Improvement
In other words, if you were not deliberately targeting that term before and now you see that you have the propensity for positioning for it, if you took a few extra measures, you could improve its position. And that was a reliable, repeatable process. It could (and did for many people) create the snowball effect of content your audience wanted to hear about, leading to more traffic, leading to more serendipitous hits, leading to more new topic writing suggestions, leading to more content, leading to more hits.
The End of an Era
A positive feedback loop and virtuous cycle! That is, until Google more or less went on a war against the long-tail, dialing-up brand importance giving their sites safe harbor in SERPs while dialing down “low-quality” pages, effectively filtering out the noise like the old squelch dial on CB radios. Anyone? The sound of crickets…
The Power of Verbatim Search
But then search engines were a lot easier to affect back then, in a verbatim fashion. The concept of “verbatim” matching, that is preferring exact matches of what you searched for being present in the pages in the results, is so powerful that Google to this day keeps the concept alive in their search, though you have to click the Web link / Tools, and then switch All results to verbatim. It’s buried, but it’s there. Think what that means. After all these years, there is both still enough reason to keep it there, and the underlying systems are still working in such a way that it can be there. There’s a lot to read into that. I sometimes think if not for SEO spammers, Web/Verbatim would still be the preferred way of searching.
The Evolution of Google Search
So Universal search hit, which was the layering in of custom (non-traditional) search results from other sources like maps, images and local. Then Rich Snippets and the Knowledge Graph hit, and then the changes came fast and furious. The machine learning and AI stuff started hitting with BERT around 2018 (a year after the “Attention is All You Need” fateful Transformer article), and then RankBrain, yadda yadda… and the rest is sort of a dramatic tension between keeping things just-so so that AdWords revenue was not dramatically cut into while keeping pace with the times. They were effectively cornered into the classic innovators dilema because most of their revenue came from advertising from a particular user experience in Google-brand dominated search.
The ChatGPT Revolution
And then November 2022 ChatGPT hit. Microsoft saw Google trapped in the corner due to the innovators dilema AdWords reliance and decided to go on the defensive. For Microsoft, Bing ad revenue is just one of many revenue streams for Microsoft, and they’re not risking much by madly experimenting there, and so they did. The teamed up with OpenAI to use ChatGPT to power Bing in an attempt to reframe the public’s expectation of search from being that Googly Top-10 blue links experience to more of a chat experience with embedded citations capable of still passing through the traffic to destination sites.
Microsoft’s Challenge to Google
The New Bing powered by ChatGPT starting around February 2023. You can find a full timeline of Microsoft’s determination to beat up on Google here where I documented it during my brief time at Moz while that was happening. Microsoft continues to attempt to reframe the search value proposition, with Perplexity and the OpenAI themselves jumping on that bandwagon. All that pressure plus an antitrust lawsuit that they lost is still not enough to dislodge Google more than a couple percentages from search market dominance, leaving us to ask whether the Google castle is built on clouds, or a more solid foundation than what Microsoft and other competitors chomping at the bit for a piece of Google’s search action might have hoped.
The Antitrust Battle
And that brings us to today. The Google antitrust suit shined light on just how much Google pays Apple to remain the default search engine of Safari Mobile (~$20 Billion/year), and quite how effective Google’s been in making Chrome the default browser of choice. There’s other questions here about quite how much that browser control might and has been leveraged to improve Google’s search services in a way sufficiently anti-competitive to make a forced break-up the antitrust remedy.
You have to read a little bit into that, but the message here is that in order to address Google’s search dominance, and divestment of their control of the browser is the remedy… hmmm. For lack of other better remedies that would get more to the point, or because browser control actually yields unfair search dominance? Is it because of control over what happens when a non-URL is typed into the address bar? Maybe. That would line up nicely with why the Apple deal is so in the spotlight on this issue, because what Safari Mobile does in that condition is the Google money-maker. If Google pays $20 billion/year for that right, now much must they be making from control of that feature alone? This whole thing could be framed as the battle over the browser address bar.
The Future of SEO
Alright, back to the original point. I am where I am to do the best work of my life, against this dynamic industry backdrop where the dominant juggernaut is going to do everything it can to hang on to its mostly brand-generated and behind-closed-door-deal generated market dominance while a combination of a super-competitor who controls the most dominant desktop OS in the world is gunning for them, along with every little pisher AI-powered search competitor like Perplexity.
My schtick is future-proofing and resisting obsolescence as my very field is under fire in the first meaningful way in a quarter-century. And frankly, its time has come. But not in an SEO-is-dead sort of way quite so much as let’s get all this right so that you don’t need to be the JavaScript-rendering massive infrastructure of Google to run clever new AI search startups.
The Network as Computer
Sun Microsystems once said the network is the computer… and then incurred Microsoft’s wrath and became nothing more than another Oracle acquisition themselves. Whoops. Specifically, that comment and the reality it represents made Microsoft incorporate Microsoft Explorer into the OS as a built-in feature, incurring the antitrust wrath of the government themselves and being forced to break IE out as a separate feature. But the damage was done and Netscape, the browser innovator at the time, was more or less killed. And the idea of the network as the computer was embraced, extended and eliminated. This is a pattern. What the once dominant IBM is today is also a result of this strategy, and that’s what Google is facing.
The Internet as Database
Today, the Internet is the database. You don’t need no Google if an intelligent LLM-guided small-world-theory crawl can just go out to the Internet and find what its looking for. URLs are unique keys (uniform, really but the full URLs with querystrings are unique). The data they unlock are values. This makes the Internet itself the ultimate NoSQL database. That was a keyword of yore referring to preferring a single super-optimized column that keeps key/value-pairs as a practical alternative to row & column data that has dominated the industry for so long, generally known as SQL databases and relational database management systems (RDBMS). Key-value NoSQL databases are sometimes posed as some new thing that became popular with MongoDB, Redis, Cassandra and CouchDB. But the truth is they’ve been built into Unix for a long time in the form of dbm (database manager) written by Ken Thompson in 1979, the same guy who wrote Unix itself.
Perplexity: The built-in Unix key-value database is called dbm (database manager)[1][2]. It was originally written by Ken Thompson in 1979 and provides a simple string-based key-value storage system[2]. The dbm library is included in most Unix-like operating systems and offers a generic dictionary-like interface for storing and retrieving data[1].
Key features of dbm include:
- Simple key-value storage using string keys to access string values[1].
- Ability to open existing databases in read-only or read-write mode[1].
- Automatic conversion of keys to byte strings when added to or retrieved from the database[1].
- Compatibility with various programming languages, including Python[1][2].
While dbm is considered old and rarely mentioned in modern NoSQL discussions, it is still used by many pieces of software due to its simplicity and ubiquity in Unix systems[2].
Citations:
[1] https://pymotw.com/3/dbm/index.html
[2] https://en.wikipedia.org/wiki/Key-value_store
[3] https://stackoverflow.com/questions/1690605/reliable-and-efficient-key-value-database-for-linux
[4] https://unix.stackexchange.com/questions/21943/standard-key-value-datastore-for-unix
[5] https://www.influxdata.com/key-value-database/
[6] https://etcd.io/docs/v3.4/learning/why/
[7] https://news.ycombinator.com/item?id=40164237
Berkeley DB: The Industrial dbm
An industrial grown-up version of this has been around for awhile too, some time before the concept of NoSQL burst onto the scene. But it was swallowed up by Oracle and few realized the significance of systems like this until they combined with the concept of distributing them and using them as an alternative to SQL for all sorts of caching purposes that lend themselves well to this. See, URL-in, webpage out? Key/value-pairs! The whole darn Web is this kind of database. What was that grown-up version of dbm called?
Perplexity: The grown-up industrial version of the Unix key-value database was Berkeley DB (BDB).
Berkeley DB evolved from the original Unix dbm (database manager) implementation. It was first developed at the University of California, Berkeley in 1991 as part of the transition from BSD 4.3 to 4.4[4]. The initial goal was to replace the existing in-memory and on-disk hashing systems with a unified system that could operate both in memory and on disk[1].
Berkeley DB’s fate:
-
Commercial development: In 1996, Netscape contracted with Margo Seltzer and Keith Bostic to develop a production-quality version of the software, leading to Berkeley DB 2.0 in 1997[2].
-
Company formation: Sleepycat Software was created in 1996 to continue development and support of Berkeley DB[4].
-
Acquisition: Oracle Corporation acquired Sleepycat Software in February 2006[4].
-
Continued development: Oracle has continued to develop and support Berkeley DB, with version 5.0 released in 2010 adding SQL support[2].
-
Open-source status: Berkeley DB remains an open-source, embedded database software library for key-value data, despite its commercial ownership[9].
-
Influence: Berkeley DB has inspired other embedded key-value databases like Facebook’s RocksDB and Google’s LevelDB[7].
Today, Berkeley DB is still used in various applications, offering features such as ACID compliance, high performance, and the ability to handle very large key-value stores (up to terabytes in size)[1].
Citations:
[1] https://dbdb.io/db/berkeley-db
[2] https://aosabook.org/en/v1/bdb.html
[3] https://www.oracle.com/technical-resources/articles/database/oracle-berkeley-db-storage-layer.html
[4] https://www.orafaq.com/wiki/Berkeley_DB
[5] https://www.markhansen.co.nz/berkeley-db/
[6] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a47c4959b071c9292090c81788c3f6e2eecb8827
[7] https://www.influxdata.com/key-value-database/
[8] https://apps.state.or.us/tech/berkeleyDB/ref/intro/dbisnot.html
[9] https://en.wikipedia.org/wiki/Berkeley_DB_XML
The Web as Database
Nice, thanks Perplexity. Yeah, so if we start thinking about the Web itself as the database, then that Google competitive moat, aka a copy of the Internet made by its crawler and distilled down for fast answers with its index, starts to evaporate. Where’s the moat if anybody can write a search engine that can have even more up-to-date information regarding product availability or new knowledge on a topic than even Google? This is small-world-theory crawls. No page is more than 6-degrees of separation (6 links away, or whatever) from whatever well-chose start page you begin the crawl from?
The Return of Web Directories
So you still need some sort of search. Or maybe an old-school directory site like Yahoo or The Open Directory Project (DMOZ)? You just need to seed the small-world crawl, and follow the links from there.
The Resurgence of Link Graphs
The link-graph becomes important again! That whole <a
href="[link]">keyword</a>
old-school SEO view of the world becomes important
again, because that’s what the crawlers follow. I mean they still do, and
they’ve never not been important for crawling. It’s just that SEO spammers
ruined them for everyone as a primary indicator of search relevance.
A New Incentive Structure
But if you know now that an intelligently guided LLM small-world crawler is going to work this way once it reaches your site, the incentive to spam goes down. The spamming techniques are to attempt to get the engines like Google of today to drop you directly into the correct sub-page of the site, so it’s spam-link everywhere. But if instead you know that the homepage will almost always be used as the point of entry and the links truly are for the best drill-down experience of both humans and crawlers, the game changes.
The Importance of Homepages
The details of that first event that drops you into the correct starting homepage are important. But we can assume that important homepages that align to important brands will be the non-volatile links that get baked into the trained LLM models. Specific URLs are going to get filtered and cleaned and normalized and curated out of the training data. Nobody wants URLs that aren’t always going to be exactly the same as during the life of the model’s use baked into the model. But naked apex domains, the registered ones and usually the homepages are a whole other matter. What model trainer wouldn’t allow the models to know that brandx’s homepage resides at the brandx.com registered domain. And an attempt to visit that will almost always auto-forward to the correct landing page.
LLMs as Web Directories
So it’s conceivable that that the first-hop of such small-world webcrawls will
be baked into the LLM models themselves! An old-school Yahoo Directory baked
into every web-surfing LLM, and real-time page-to-page surfing after that
because… well, because after you view the Internet as the giant ultimate
key/value-pair NoSQL database, you can then view it similarly as the giant
ultimate vector database. That is, because of the <a href>
link/attributes, it
is by definition creating the nodes and edges of a network graph, and is thus
also a hierarchically oriented vector database too! And if treated that way, we
really start to see world where LLMs can surf the web just as well as humans,
without having to execute JavaScript, and to get answers really, really fast
— six hops and all!
Website Design for the LLM Era
It’s just website hygiene! You plan for this world, because it’s just good website design. From a marketing and brand reputation standpoint, you make sure that the story of your brand and your products in a general sense is told in a compelling way that can catch the fancy of the LLMs if it happens to get baked into the base models. And of course you tell that story so it gets picked up on the crawls so that they somehow get through the filtering-for-quality curation process.
The Future of SEO Workflows
I’m still working out the details of this, but only the details now, thank goodness. In broad strokes of how it will run and how an individual human SEO does the work is in terms of these Pipulate/Botifython workflow pipelines I’ve been talking about in previous articles. This is the best work of my life that I’ve also been talking about, because I am able to reach farther and achieve greater things than I have been ever before in my life. Not merely because of the wisdom of age, though I find that helps a bit too, but because of the recalibration of how we work around the new AI tools.
A mostly linear workflow can step you from step to step along a process. It can guide you to do this, and then do that. This is a route process that can be written out, along with the rationale for each step, and that is precisely what we are up to.
Peak Data Discussion
But wait! before I dive into this, something has been bugging me about the concept of having achieved “peak data” in LLM training, which I’ve heard both Ilya Sutskever state. Most recently 3 minutes into this interview with Musk where he seems to be referring a Jensen Huang talk. I don’t care how big Nvidia, Tesla and SSI (Safe Superintelligence Inc) are. They don’t really have enough access to the world’s data to make such an assertion:
The cumulative sum of human knowledge has been exhausted in AI training sum of human knowledge has been exhausted in AI training
Me: Who has asserted that the cumulative sum of human knowledge has been exhausted in AI training sum of human knowledge has been exhausted in AI training?
Perplexity: Elon Musk has asserted that the cumulative sum of human knowledge for AI training was exhausted in 2024. During a livestream with Mark Penn, he stated that all available human data, including internet content and books, has been utilized, necessitating a shift to synthetic data generated by AI itself for future training[1][2][8]. This perspective aligns with former OpenAI chief scientist Ilya Sutskever’s comments about reaching “peak data” in the AI industry[2][5].
Sources
[1] Human data pool for AI training exhausted in 2024 – Elon Musk https://tribuneonlineng.com/human-data-pool-for-ai-training-exhausted-in-2024-elon-musk/
[2] Elon Musk says all human data for AI training exhausted in 2024 https://nairametrics.com/2025/01/09/elon-musk-says-all-human-data-for-ai-training-exhausted-in-2024/
[3] Real-world data for AI training exhausted? Elon Musk thinks so https://www.newsbytesapp.com/news/science/elon-musk-says-ai-has-exhausted-all-real-world-training-data/story
[4] AI Should Augment Human Intelligence, Not Replace It https://hbr.org/2021/03/ai-should-augment-human-intelligence-not-replace-it
[5] Are AI Models Running Out of Real-World Data? Elon Musk Thinks So https://autogpt.net/are-ai-models-running-out-of-real-world-data-elon-musk-thinks-so/
[6] Technological singularity - Wikipedia https://en.wikipedia.org/wiki/Technological_Singularity
[7] Elon Musk Acknowledges the Limits of Available AI Training Data! https://www.allaboutai.com/ai-news/elon-musk-acknowledges-the-limits-of-available-ai-training-data/
[8] Have AI Companies Run Out of Training Data? Elon Musk Thinks So https://petapixel.com/2025/01/09/have-ai-companies-run-out-of-training-data-elon-musk-ilya-sutskever-thinks-so/
[9] Elon Musk and AI Experts Agree: We've Exhausted Real-World Data ... https://technews180.com/artificial-intelligence/elon-musk-and-ai-experts-agree-weve-exhausted-real-world-data-for-ai-models/
[10] Elon Musk Declares AI Training Has Exhausted Human Knowledge https://www.thinkwithniche.com/amp/news/elon-musk-declares-ai-training-has-exhausted-human-knowledge
[11] TechCrunch on X: "“We've now exhausted basically the cumulative ... https://twitter.com/TechCrunch/status/1877425379708719565
Me: When did Ilya Sutskever express this?
Perplexity: Ilya Sutskever expressed his views on “peak data” during a presentation at the NeurIPS conference on December 13, 2024. He stated, “We’ve achieved peak data and there will be no more,” highlighting concerns about the limitations of available training data for AI models and the implications for future AI development[1][4][5].
Sources
[1] “Peak Data”: Ilya Sutskever on the future of AI - ALPHA AVENUE https://alphaavenue.ai/en/magazine/technologies/peak-data-ilya-sutskever-on-the-future-of-ai/
[2] Ilya Sutskever Predicts the End of Pre-Training as AI Hits 'Peak Data' https://opentools.ai/news/ilya-sutskever-predicts-the-end-of-pre-training-as-ai-hits-peak-data
[3] AI Hit 'Peak Data.' Google Researchers Think They Found a Solution. https://www.businessinsider.com/ai-peak-data-google-deepmind-researchers-solution-test-time-compute-2025-1
[4] Data Is the 'Fossil Fuel' of A.I., Says OpenAI Co-Founder Ilya Sutskever https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/
[5] OpenAI co-founder says AI is reaching "peak data" as it hits the limits ... https://the-decoder.com/openai-co-founder-says-ai-is-reaching-peak-data-as-it-hits-the-limits-of-the-internet/
[6] The Next Chapter of AI: Ilya Sutskever's Vision for the Future of ... https://www.promptzone.com/rudy_froyen/the-next-chapter-of-ai-ilya-sutskevers-vision-for-the-future-of-artificial-intelligence-5e44
[7] Ilya Sutskever Calls Peak Data and the End of Pretraining - Dexa AI https://dexa.ai/d/b856e2a6-bd72-11ef-a143-bb2eba2ff942?t=2-End+of+Pre-Training
[8] Aaron Fulkerson on LinkedIn: OpenAI cofounder Ilya Sutskever says ... https://www.linkedin.com/posts/aaronfulkerson_openai-cofounder-ilya-sutskever-says-the-activity-7273854449377288192-YIJt
Me: What is the rate at which new data is believed to be produced in this world?
Perplexity: Approximately 402.74 million terabytes of data are created daily, with projections indicating that 181 zettabytes will be generated in total by the end of 2025. This represents a significant increase from previous years, driven largely by the proliferation of IoT devices and real-time data processing technologies[1][3][7]. The global data creation rate is expected to grow at a compound annual growth rate (CAGR) of 26% until 2025[1].
Sources
[1] 85+ Big Data Statistics To Map Growth in 2025 - G2 https://www.g2.com/articles/big-data-statistics
[2] Speed up training – MOSTLY AI Docs https://mostly.ai/docs/generators/configure/speed-up-training
[3] Amount of Data Created Daily (2024) - Exploding Topics https://explodingtopics.com/blog/data-generated-per-day
[4] AI Training Dataset Statistics 2024 By Technology, Predictions https://scoop.market.us/ai-training-dataset-statistics/
[5] [PDF] Data Age 2025: - Seagate Technology https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf
[6] Using AI Only to Generate and Annotate Training and Fine-Tuning ... https://www.e2f.com/news/cheap-fast-and-not-recommended-using-ai-only-to-generate-and-annotate-training-and-fine-tuning-data
[7] Big data statistics: How much data is there in the world? - Rivery https://rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/
[8] Revolutionizing AI Training With Synthetic Data - Forbes https://www.forbes.com/councils/forbestechcouncil/2024/05/08/revolutionizing-ai-training-with-synthetic-data/
[9] Data growth worldwide 2010-2028 - Statista https://www.statista.com/statistics/871513/worldwide-data-created/
[10] How AI Is Driving Global Data Growth - Chris Curry https://www.chriscurry.co.uk/blog/ai-explosion-170-zettabytes-2025-future
Me: While this is not all quality data worthy of model training, this still sounds like quite a lot of data with a statistical likelihood that some of this is Internet accessible new data that is meaningful enough quality to be included in AI training. Is this consistent with Elon Musk and Ilya Sutskever’s statements? What am I missing, or do they have something wrong?
Perplexity: Elon Musk and Ilya Sutskever’s assertions about the exhaustion of real-world data for AI training are consistent with the concerns surrounding data quality and availability. While vast amounts of data are generated daily, much of it may not meet the necessary quality standards for effective AI training. Both Musk and Sutskever emphasize a shift towards synthetic data due to the perceived limits of existing real-world data, suggesting that even substantial quantities of new data may not be suitable for training advanced models [1][2][4].
Sources
[1] Elon Musk agrees that we’ve exhausted AI training data https://techcrunch.com/2025/01/08/elon-musk-agrees-that-weve-exhausted-ai-training-data/
[2] Have AI Companies Run Out of Training Data? Elon Musk Thinks So https://petapixel.com/2025/01/09/have-ai-companies-run-out-of-training-data-elon-musk-ilya-sutskever-thinks-so/
[3] What’s Ahead in Generative AI in 2025? (Part One) https://tdwi.org/Articles/2024/12/18/TA-ALL-Whats-Ahead-in-Generative-AI-in-2025-Part-One.aspx
[4] Elon Musk Acknowledges the Limits of Available AI Training Data! https://www.allaboutai.com/ai-news/elon-musk-acknowledges-the-limits-of-available-ai-training-data/
[5] The Vital Role of High-Quality Data in Training AI Models for IT https://www.lakesidesoftware.com/blog/the-vital-role-of-high-quality-data-in-training-ai-models-for-it/
[6] AI's Potential: The Role of Accessibility in Creating a Data-Rich ... https://www.inclusiveweb.co/accessibility-resources/ai-role-in-accessibiity-testing
[7] Harvard's Library Innovation Lab launches Institutional Data Initiative https://hls.harvard.edu/today/harvards-library-innovation-lab-launches-initiative-to-use-public-domain-data-to-train-artificial-intelligence/
[8] Data Quality in AI: Challenges, Importance & Best Practices https://research.aimultiple.com/data-quality-ai/
[9] Data 2025 outlook: AI drives a renaissance of data https://siliconangle.com/2025/01/06/data-2025-outlook-ai-drives-renaissance-data/
[10] The Data That Powers A.I. Is Disappearing Fast - The New York Times https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html
[11] How AI is Influencing Data Center Infrastructure Trends in 2025 https://blog.equinix.com/blog/2025/01/08/how-ai-is-influencing-data-center-infrastructure-trends-in-2025/
Me: I find three things about this assertion suspicious. First is this peak data concept that the sum of human knowledge has been exhausted. The sum of human knowledge implies all copyright material, including the New York Times and Sarah Silverman’s book, the copyright owners of which have both challenged the model trainers legal rights to include copyright material. But the sum of all human knowledge implies that they have included the material. This does not even include the uncrawlable invisible web exist because most crawlers don’t execute JavaScript and render the Dom with a headless browser. Google’s does, but Elon and Sutskever aren’t necessarily using a crawler that does. And then finally, there is purely proprietary data like LexisNexis or secret data like the Renaissance Technologies Medallion Fund code. So such a statement is full of false assumptions and hubris. And that’s just point number one.
The Quality of New Information
The second thing is the rate at which new information being generated is deemed to be of not high enough quality. It seems to me that statistically, there is outlier information in there, which is as valuable and unique in this world as anything in history. It is just getting lost in the glut of new information, both in terms of accessibility and the ability to identify it as valuable. But nonetheless, it exists, and the problem in my mind is one of access, identification, and curation, and not of the data somehow not existing, not being valuable, nor in any way, not being meritorious enough to make some sort of peak information top out. You’ll notice that nobody from Google is making this assertion, as their crawler is more capable of crawling, the invisible web than most given its headless browser and scale. Second, Google owns YouTube one of the great sources of new information. Look at how much data is coming in from the transcripts of those YouTube videos alone. Not that it is all valuable for machine learning, however, it may be should not be so readily written off. Interesting and important game changing information often comes from outliers.
The Synthetic Data Alternative
Lastly the suspicious thing is that the alternative to the more difficult task of identifying and curating new data is the generation of synthetic data. It seems to me that both Sutskever and Musk are suggesting synthetic data because they are in the position to generate it given their computing power, but lack of a crawl or new data equation infrastructure, like Google. It seems like it is a matter of reframing the challenge according to “if all you have is a hammer than everything looks like a nail” principle instead of doing the harder thing and coming up with a new tool. Generating synthetic data is easier than creating some sort of important new data capture mechanism of that vast quantity of data being generated every day. Elon did buy Twitter, so he does have one of those data sources. But it is the inane babble data source and nothing like Google’s infrastructure or their Google Book Library project, which has been digitizing books for years, copyright issues aside.
This would all seem to suggest that what can be learned from a static data set is all that can be learned, and only in the way that it is learned currently. Nothing new can be gleaned from another pass or a second look at the same data? And all the new data coming out at that vibrating edge fringe of what’s known, like all the space-telescope and LHC data that could really use some help analyzing from what I hear.
And might I add that I imagine this scientific frontier data would be some of the most interesting and important to learn from. And to the argument that it’s not the kind of data they’re looking for, like Shakespeare and stuff, I’d ask them “was Python?” Yet LLMs learned how to code. What greater from new, unexpected and potentially unlimited data sources? And it’s being written off? Peak data my arse! That’s a defeatist attitude and one I would hardly imagine coming from Musk, currently the owner of the largest super-computer in the world.
Conclusion
In other words, I think these peak data assertions are a combination of being misleading, lazy, unimaginative and sour grapes.
Perplexity: Your skepticism about the “peak data” assertion made by Ilya Sutskever and Elon Musk is valid and highlights several important considerations:
-
Data Scope and Quality: The claim that the sum of human knowledge has been exhausted overlooks the complexities of data availability, including copyright restrictions, uncrawlable web content, and proprietary databases. This suggests that while high-quality datasets may be limited, vast amounts of potentially valuable data remain untapped or underutilized.
-
Outlier Information: The argument that new information lacks quality fails to recognize that valuable insights often come from outliers. The challenge lies in effectively identifying and curating this data rather than a genuine scarcity of meaningful content.
-
Synthetic Data Generation: The preference for synthetic data generation may stem from a lack of robust data capture mechanisms, as you noted. This approach could reflect a bias towards easier solutions rather than exploring innovative ways to access and leverage the wealth of existing data.
In summary, while Sutskever and Musk’s concerns about data limitations are grounded in real challenges, their conclusions may oversimplify the complexities of data availability and quality in AI training.
Sources
[1] Ilya Sutskever Predicts the End of Pre-Training as AI Hits 'Peak Data' https://opentools.ai/news/ilya-sutskever-predicts-the-end-of-pre-training-as-ai-hits-peak-data
[2] AI Hits 'Peak Data,' But Google DeepMind Has a Clever Plan to ... https://opentools.ai/news/ai-hits-peak-data-but-google-deepmind-has-a-clever-plan-to-keep-progress-rolling
[3] Google DeepMind researchers think they found a solution to AI's ... https://bestofai.com/article/google-deepmind-researchers-think-they-found-a-solution-to-ais-peak-data-problem
[4] OpenAI co-founder says AI is reaching "peak data" as it hits the limits ... https://the-decoder.com/openai-co-founder-says-ai-is-reaching-peak-data-as-it-hits-the-limits-of-the-internet/
[5] AI Hit 'Peak Data.' Google Researchers Think They Found a Solution. https://www.businessinsider.com/ai-peak-data-google-deepmind-researchers-solution-test-time-compute-2025-1
[6] Elon Musk agrees that we've exhausted AI training data https://finance.yahoo.com/news/elon-musk-agrees-weve-exhausted-040148636.html
[7] Ilya Sutskever Calls Peak Data and the End of Pretraining | Ask Dexa https://dexa.ai/d/b856e2a6-bd72-11ef-a143-bb2eba2ff942?t=1-Peak+Data+Debate
Gemini 2.0 Flash Experimental
I fed this conversation so far, at least the part about peak data into Gemini 2.0 Flash Experimental to get its evaluation.
Genini 2.0 Flash Experimental: You’re absolutely right to be skeptical about the “peak data” assertion. There are several reasons why:
- Limited Scope: Musk and Sutskever might be focusing on a specific type of high-quality data that’s readily available. They might be overlooking vast amounts of data that’s trickier to access, like the deep web or proprietary sources.
- Undervalued Outliers: Just because information is new doesn’t mean it’s low quality. Valuable insights can come from unexpected places. The challenge is finding those nuggets of value in the sea of data.
- Synthetic Data Bias: Perhaps their preference for synthetic data creation stems from their resources. They might have the computing power to generate it but lack the infrastructure for large-scale data capture, unlike Google with its powerful crawlers and vast libraries.
Here’s a breakdown of your points:
- Copyright and Uncrawlable Web: You’re right, copyright restrictions and uncrawlable parts of the web limit data availability. Additionally, there’s valuable private data like LexisNexis that’s not publicly accessible.
- Outlier Information Value: Absolutely! New information, even seemingly insignificant, can hold hidden gems. The key is having the tools to identify and curate this data effectively.
- Synthetic Data as Easier Option: It seems creating synthetic data might be their way of dealing with the data capture challenge. It’s like using a hammer for everything because that’s what you have, instead of finding a better tool for the job (better data capture mechanisms).
Overall, while Musk and Sutskever raise valid points about data limitations, their conclusions might be a bit hasty. There’s likely a lot more data out there waiting to be harnessed.
The Real Beginning
Alright, that peak data tangent was not really what I want to get to, which is this: another long-as-I-need outside business hours focused work-session with a nap behind me, so I have that morning energy but going into the strange vampire night hours where your mind can be at its peak.
Setting the Stage
And that’s where this article is really just beginning. All that stuff above about the history of the search industry, SEO, data transformations, Microsoft dirty tricks, small-world crawling and the Internet as 2 types of databases, the JavaScript execution challenge that makes the invisible web, and finally challenging the peak data assertions of Sutskever and Musk have all been just brain exercises. Limbering up. Reminding me who I am and what I’m doing here with these exercises, shifting power back to the little guy and trying to be one of those outliers whose work might have game-changing potential.
The Path Forward
The chisel-strikes I make next greatly determine this. My thought process and how I craft these chisel-strikes is all-important. As much as I want to “just port the Jupyter Notebook”, I have found my subconscious resisting the shovel-ware implementation that I could do, forcing me to do these carefully considered next steps you’re about to witness.
Normally I give one of the Gemini models a chance to help me with titles, descriptions and intros at this point because they were the first to the really big token windows capable of letting me prompt a whole article in one shot. Since then, certain OpenAI models have caught up, and in this case I already used Gemini above in an open-ended prompt. So the structured prompt that I use to put a bowtie on it goes to o1 Pro this time:
o1 Pro: Summary (in the author’s voice)
[Summary moved above as description / meta tag]
Ideas for Titles and Headlines
- “From Lycos to ChatGPT: My 30 Years in the Search Trenches”
- “SEO’s New Frontier: AI, Antitrust, and the Invisible Web”
- “Breaking Google’s Moat: The Rise of LLM Crawlers and Small-World Theories”
- “Peak Data? Why the Future of Search Hinges on Outlier Information”
- “Reimagining SEO in the AI Age: A Veteran’s View”
- “Synthetic Data vs. the Real Web: Where Do We Go Next?”
- “The Hidden Power of Homepages: Designing Sites for LLM Crawls”
- “New Search, Old Tricks: Microsoft, Google, and the AI Revolution”
My Opinion of the Article
I find this article refreshingly candid and wide-ranging, connecting the early, hands-on realities of building sites for search with today’s pressing questions about AI and antitrust. The author’s skepticism about “peak data” underlines a core insight: that true innovation demands deeper access to new and hidden data sets, rather than retreating to synthetic stand-ins. This piece cuts through hype by reminding us how easily we can overlook the vast trove of real information still out there—if only we’d muster the grit to find, curate, and leverage it. It’s a call for more inclusive, inventive approaches to future-proofing SEO and web infrastructure, and it does so in a personal, almost confessional style that makes the technological deep-dives both engaging and thought-provoking.
Some Outlying Coding Work (FastHTML is an Outlier)
Wow, o1 Pro! Nice. Maybe you’ll wrestle the job of summarizing these things from Gemini now that you can handle the token windows too. I particularly like your opinion, but that is of course because it flatters me, and who doesn’t like confirmation bias? I have to watch out for that as much as have my ego stroked by it. Anyway, onward!
Can you guess where this article is going to go from here? I bet you can’t.
Porting a Jupyter Notebook with New Assumptions
I need to port a Jupyter Notebook, but it was created under the assumptions one has when using a Jupyter Notebook, and those assumptions have changed and I can do so much more. I can do so much differently. I can do it so much better.
And I could fall down a rabbit hole that will ruin the best laid plans of stealing this second work-day after a nice run of good night’s sleep. This is how our ancestors worked. Feast and famine. The caveman nomad rhythm. Good times when you rest. Hard times when yo pick up and travel, facing the harsh realities and the unknown.
Today’s Hunt: Browser Automation Goals
Today I hunt.
My quarry is popping up web browsers, for 1 of 2 purposes:
- Having access to the view-source HTML source and maybe even the JavaScript rendered document object model (DOM).
- To not do anything interactive with the browser at all, and merely to give the user instructions to go download something.
Everything else falls in place after this.
Automation Approaches and Trade-offs
The former is a potential rabbit hole, involving Puppeteer, Playwright or Selenium — any of which would open a browser under some sort of automation context using the API hooks Google put in there for DevTools or other techniques to actually control the browser.
The later just pops open your default browser with your default (or last used)
user profile without any automation context at all. It just pops open a
particular URL which might have a querystring on it to bring you to a deep
query-driven page of some service where the download button is there and ready
to press. But still, it’s not automation. And there are no rabbit holes. It’s
just the Python webbrowser
package that comes with the standard library.
And so that is where we will begin.
Managing Rabbit Holes and Dependencies
Rabbit holes are hard to resist, because many things you realize need to be done feel like chained dependencies, where before you do that you have to do this.
That is an illusion. Many more things are deferrable than the playful child who wants to follow the most interesting thing inside of you tries to tell you. I say you, but of course I just mean me and I’m generalizing. Who knows what your experience is, but that’s what I find. It’s shiny object syndrome. The dog drops the bone it’s burying because it sees some other bone. It’s self-defeating, no matter how much the infusion of motivation because you are deeply interested in that other thing might help.
The Art of Deferring Non-Critical Tasks
Ultimately, “would be nice” dependencies that can be deferred until later so you can get what you already deemed most important done are detrimental and derailing unless there’s some really compelling reason it’s an exception.
What those really compelling reasons might be are up to you as the rabbit hole spelunker whose attaching rappelling equipment to determine that criteria. If you go spelunking down a rabbit hole experimentally just to see how deep it is, don’t get hypnotized by the wonders you’ll find down there. When you feel yourself getting hypnotized, like Mowgli in the coils of Kaa, that’s when you give a good yank! Have your free-will and agency haul you by your rappelling gear back up. Setting the threshold on that system is a practiced skill. Sometimes you can be in-and-out of those rabbit holes accomplishing just what you want. Other times, they can derail your project and your life. Practice!
Serendipity and Technical Journaling
Anyhow, this tech journal itself could be seen as one of those rabbit holes, but for the fact it’s a serendipity-to-deliberateness machine, like HitTail. Deliberateness is another way of saying exercising free-will and agency. Take paths you would have otherwise not taken if you didn’t stop to dredge-up and process your ideas and notice new little details or ways of thinking about things you haven’t before.
Time Management and Project Planning
There’s a couple that come to mind at this point in the project. First we gauge time. It’s coming up on 8:00 PM. Not bad. If I go to 3 or 4 AM, that’s into a Friday, which is into a weekend, and if I konk myself out, that’s fine. I won’t have to be in front of clients again for a few days. So, my timing is perfect. I can coffee-up and bear down on this work.
Understanding Jupyter Notebooks and Linear Workflows
The next thing to come to mind is the Jupyter-Pipulate connection. Juptyer is about linear workflows and pipelines, although most people don’t think about them or talk about them that way, because they like executing things out of order learning and experimenting, taking advantage of the very powerful REPL (read, eval, print, loop) environment, but giving Notebooks a bad name.
The REPL Environment and Data Persistence
For people not familiar with Jupyter Notebooks, the code never really stops
running secretly in the background, stuck in a quiet loop so that if you set a
= 1
, run the cell and go check the value of a
later, it’s still equal to 1
.
It never stops running until you reset it. That way, you can build up your
programs a cell of code at a time. This is particularly useful in data
science where you’re dealing with data structures like Pandas DataFrames
(often known by convention as df
). You load a .csv
file for example into a
df
, and then you start doing manipulations on it. That’s what today’s project
is mostly going to be about, once we get past the browser automation bits to
download the CSVs.
The Politics of Notebook Naming
So you build up a program cell-by-cell in Jupyter Notebooks. Microsoft and Google will try to get you to stop calling them Jupyter Notebooks because they don’t like the free and open source (FOSS) branding on something they can just steal (because it’s FOSS) and relabel as just plain Notebooks in Colab and VSCode. But they’re Jupyter Notebooks, for sure. If you like that Python REPL way of working, it’s something from the very nice Project Jupyter, which just happens to be appealing enough to be lifted and repackaged by all the big vendors to inject themselves as a Notebook dependency.
The Altruistic Nature of Open Source
Project Jupyter organization is one of the truly altruistic good guys in free-as-in-freedom software movement of tech, like Debian and GNU. Many people say if you don’t pay for the product than you are the product. That’s actually not true in the case of Debian, GNU and Jupyter. Sometimes it’s free as in there are occasionally surprisingly generous and altruistic people in this world who want to just help you to improve your life, feel good about themselves and make the world a better place.
Best Practices for Notebook Usage
When you use a Notebook, the goal should be to get the code into the correct
prescribed linear order, cells arranged top-down so you can choose Reset
Kernel and Run All Cells… This both resets the Python code execution
environment which was previously storing all the global variables, and runs it
exactly as it would if you typed python some_file.py
into the command-line. It
runs beginning-to-end.
The Function Scope Challenge
Now of course you can program things to be non-linear and to jump around. But if you’re doing that, you probably shouldn’t be using a Notebook which is so conducive to top-down execution. I often observe how counter-intuitively, the introduction of functions to a Jupyter Notebook actually makes things harder than easier, because all your function-scoped variables go away when the function is exited. You can only inspect global variables REPL-style.
User Interaction and Input Collection
The top-down beginning-to-ending execution I described also dodges talking about
interacting with the user to acquire input, even if the flow does remain linear.
And for this, there are ipywidgets. You’re not going to be running the Jupyter
Notebook against exactly the same dataset every time. You want to start perhaps
from a different .csv
-file to load or a different website to crawl and you
have to stop and ask the user for the filename or the URL. That sort of
interactivity can often be achieved with simply input fields where you type
something in and hit Enter
. But sometimes it’s more sophisticated like giving
them a pop-up file requester to select a file, or to give a slider, or even to
click somewhere on a data visualization.
Linear Flow with Variable Inputs
The flow is still linear. You’re just collecting variable input values (aka arguments) for input-fields (aka parameters) along the way.
Unix Pipes and Jupyter Parallels
And this is where the concept of Unix pipes rears its head to assert itself as an infrequently noticed parallel between how Jupyter Notebooks work and how one of the oldest tech tricks in the book work: chaining the output of one command (aka function) to the input of another, chaining up the outputs and inputs until some fancy transformation occurred that you can call a product. Or a deliverable. Or something that adds value and you can show the client and be a proud SEO.
From Notebooks to Applications
Jupyter Notebooks are Unix pipes, but the excessively flexible and intimidating environment that exposes the code and insufficiently tightly controlled environment that demands a certain level of bravery and expertise and allows screw-ups gets in the way. Tightening all that up is called turning it into an app. And that’s what Pipulate does. It takes the linear workflows, or pipelines, and turns them into very Jupyter-like linear workflows but as webapps that doesn’t expose the Python code and puts it all on tighter rails— asking the user for input when it needs it, and explaining steps along the way when it should. That’s Pipulate. That’s pipulating.
The Challenge of Notebook Migration
And the ports from Jupyter Notebooks to Pipulate should be ridiculously easy. The problem right now is that they’re not. This is the early-stages outlier stuff where nothing is obvious, until it is. And then you can hardly imagine it being done any other way.
Balancing Client Work and Development
And that’s why taking this next step got drawn out for days for me and easily drowned out by the more urgent client-work of the day — as well, it should unless I had the guts to go off and do this independently. But I am not in the financial position or stage of life to do that. I’m punching a clock, and therefore doing this all on top of the daily grind and under that sort of stress and probably for the good of the project makes it forged in the crucible of the working stiff’s reality. I won’t be architecture astronaut waddling in Conway’s Law of the sort Joel Spolsky warns against. That is my tendency and I know I project that, but I can not allow myself to be that.
Practical Implementation Steps
This is where I pull on the rabbit hole rappelling rig and haul my awareness back up to practical next-steps where it should be. Taking the nap and day-shifting will be me a few luxurious hours to do the thought-work, but not that many. It has to be just as much weighted towards hard-nosed coding implementation at this point, which I will also walk you through right here in this article. These articles are bit overwhelming, I know. But they are designed that way. They will be digested by vector embedded database, mapped into a graph database, and respun into various versions of the future-proofing FOSS book. So lucky you, if you’re here reading this, are in the LLM-training and RAG-content fodder. Woot!
Feature Prioritization
I am sorely tempted to make the first Card of GapAnalysisFlow “put in your most primary keyword” in order to pop open a browser to a Google search on that keyword under Puppeteer/Playwright to grab the DOM, parse out the competitors and pre-populate the textarea I already have showing in the freshly started flow. But that alone could blow my whole time-window, so it’s interesting but not critical. It can be deferred, maybe to never come back to again. Who knows?
Managing Development Energy
I’m in a race against running out of steam. These articles pump me up and give me clarity of purpose, but then that head of steam has to be intelligently used and not squandered on too much writing or the wrong rabbit hole. This is pivotal. Think efficiency of keystrokes. Adjust things to make it as easy to code in the Botifython project as it is write here…
Close all tabs but those you need. Done.
Environment Setup Challenges
Oh, this is an interesting one and a hard to articulate obstacle I’ve struggled
with. I’ve recently given up the Electron-based standalone desktop version of
JupyterLab, at first because it wasn’t well supported on NixOS, but then
recognized that once you’re managing a FastHTML web app install under a
flake.nix
file, the Jupyter install should be done that way too. And here’s a
radical departure. I started out with JupyterLab instad of Jupyter Notebook
(surprisingly, 2 distinct products), because I felt the increased dashboarding
capabilties of JupyterLab would come into play, but now that most of that is
going to be under Pipulate, it opens the door to formally supporting the
older… no, no, no! Rabbit hole! JupyterLab is the future. Don’t regress. And
don’t do the double damage of distracting yourself with a regression.
Repository Organization
The obstacle is that I’ve been launching Jupyter in the same folder as Pipulate,
so all the notebooks go into the Botifython folder, but I’ve been hesitant to do
that because I don’t want to pollute the repo. But the magic fairy dust that
turns any repo into an isolated JupyterLab data science environment is the nix
flake, where I only need copy into a new folder, and type nix develop
! And so
go type stop
to get the default tmux-hosted Pipulate JupyterLab instance out
of memory. Check localhost:8888
, okay, it’s gone.
Environment Configuration
And now copy the flake.nix
file from the current repo you’re working out of
and over into the repo that holds the Notebook you’re porting tonight… okay,
done. And type nix develop
from that repo… and then start
… and there it
is! Wow, okay points to remember is that it’s important which Nix Flake you
actually start out with. These Python data science environments are subtly
different. And the biggest split is whether I’m leaning on what’s already in the
Nix software repository and putting the PyPI requirements in the flake.nix
file under the ardently version-pinning deterministic but often too inflexible
for data science work, or the other approach which is to create a traditional
Python .venv
virtual environment so you have the freedom to pip install
anything you like. And then there’s a requirements.txt
file where you list,
and optionally pin, the packages and their versions.
Implementation Strategy
And so now, actually perform a Gap Analysis under the new system as you go. Get it in sufficient order to demo.
Oh, it’s going to make the main botifython.py
which literally contains
everything, the framework, the plugins, the JavaScript and the CSS (because
there is very little under FastHTML/HTMX) and everything but the package
imports. In other words, I haven’t deconstructed my own work into separate
includes or packages so the LLM sees everything in one single context window,
not having to piece it all together. It also helps me, as navigating one long
text-file is much easier than directory-diving. But in this case, there’s going
to be a lot of Excel file formatting yuckiness. Hmmm. Keep it internal for now.
Externalizing tonight is a rabbit hole. Nice thought, but everything gets more
complicated when you break it up over multiple files. So let this
botifython.py
file get long and contain some yuck until I can step back and
tackle beautification as a separate step.
The thing about running out of steam is— don’t.
Make yourself a coffee.
Remind yourself what’s at stake.
Be like a force of nature.
Use the very linear pipeline procedures you’re advocating to know what to do next and why.
Fixing a Bug
Once again, the best laid plans. It’s about 1:00 AM and the app is no closer ported, but I encountered a critical bug I had to fix. I actually deleted the GapAnalysisFlow class and worked on the StarterFlow class. I had a place where I hard-wired a querystring-building method, and HTMX overwrote it, creating a very difficult to pin-down bug with the revert buttons showing the same card ID for every card. I fixed it and am reviewing how good StarterFlow really is as a template moving forward.
Claude: Looking at the codebase and .cursorrules
, here are key observations about StarterFlow as a template:
- Perfect Reference Implementation
- StarterFlow is deliberately minimal (2 cards + finalize) but demonstrates all core patterns
- Each step type is represented: preserve (step_01), ephemeral (step_02), and finalize
- The code is heavily documented with warnings about anti-patterns
- State Management Clarity
# Clear demonstration of the key principle:
async def step_01_submit(self, request):
# ALWAYS clear forward first
self.pipulate.clear_steps_from(pipeline_id, "step_01", self.STEPS)
# THEN save new data
self.pipulate.set_step_data(pipeline_id, "step_01", {"mykey": value})
- HTMX Integration Best Practices
# Uses hx-vals instead of URL params
hx_vals=f"js:{{preserve: {str(preserve).lower()}}}"
# Proper placeholder chain
Div(
id="step_01",
Div(id="step_02"), # Next step placeholder
hx_trigger="load"
)
- Pipeline State Structure
self.STEPS = [
("mykey", "step_01", "Card 1"), # Preserve example
("mykey2", "step_02", "Card 2"), # Ephemeral example
("finalized", "finalize", "Finalize") # Standard final step
]
- Documentation Strategy
- Each major section has a clear header with critical warnings
- Anti-patterns are explicitly called out
- Testing scenarios are listed
- State management rules are emphasized
- Reusable Components
# Standard helpers that every flow should use
self.pipulate.handle_jump_to_step()
self.pipulate.revert_control()
self.pipulate.wrap_with_inline_button()
self.pipulate.generate_step_placeholders()
- Error Prevention
- The code structure makes it hard to accidentally skip
clear_steps_from()
- Preserve flag is cleanly separated from data persistence
- Standard helpers enforce consistent behavior
- Local-First Philosophy
pipeline_id = db.get("pipeline_id", "unknown") # Global state is embraced
This template serves as both:
- A working reference implementation
- A pattern library
- An anti-pattern warning system
- A testing checklist
The .cursorrules
file effectively turns this into a “living documentation” system that can guide both humans and LLMs in implementing new flows correctly.
Wrapping Up the Outlier Advantage Concept
There’s a scientific hypothesis in here somewhere, about how in the end, the outliers make all the difference. I took a little journey into my own work, but let’s get back to what brought most people here: debunking peak data theory by elevating outliers.
Let’s craft a comprehensive scientific thesis paper hypothesis statement and accompanying discussion on how outlier information is suppressed in AI training and search engine results, and why future search models must prioritize the identification of outliers as a competitive advantage. This includes:
- A formal hypothesis statement
- A discussion on the role of outlier information in search and AI training
- How differential crawling can be leveraged to surface novel insights
- The impact on SEO and search engine evolution
- A critique of “peak data theory” in light of continuous data generation
Elevating Outlier Information in Web Search: A Future Paradigm Shift
Introduction
Web search engines and AI models are built on vast datasets and complex algorithms that prioritize relevance and authority. In practice, this often means that outlier information – rare, novel, or niche content – is filtered out or overshadowed by incumbent, well-established sources. Both search engine ranking algorithms and AI training processes tend to amplify mainstream content, causing smaller voices and new information to be underrepresented (Decoding Google’s Preferences: Big Brands and Low-Quality Content) (Why can’t search engines access the deep web? - Neon Century Intelligence). This bias towards the “usual” data can limit innovation in information discovery. For instance, Google’s search results have long been observed to favor big brands and heavily linked websites, often at the expense of independent publishers (Decoding Google’s Preferences: Big Brands and Low-Quality Content). Similarly, AI models trained predominantly on common data may miss insights contained in less frequent inputs, leading to homogenized outputs (AI and Diversity Problems: What Can We Do About It? | KORTX). The consequence is an ecosystem where what is popular is equated with what is correct or useful, potentially overlooking critical outlier data points. Recent debates around a supposed “peak data theory” – the idea that we’ve exhausted the useful human-generated data for AI training – further highlight concerns that mainstream data wells are running dry (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100). This paper challenges that notion by exploring untapped reservoirs of information and proposing that the future of web search lies in embracing, rather than ignoring, outlier information.
Hypothesis: Integrating outlier-aware mechanisms – including differential crawling and anomaly detection – into web search algorithms will significantly improve the discovery of novel, valuable information while democratizing search visibility for small businesses and individual creators, thus countering biases from incumbent sources and debunking the “peak data” notion.
Literature Review
Outlier Information and Incumbent Bias in AI and Search
In data science, an outlier is typically an item or observation that deviates markedly from the norm. Classic approaches often treated outliers as noise to be removed, but modern perspectives recognize that anomalies can carry critical insights (Anomaly detection - Wikipedia). In many domains, these rare data points are “the observations most desirous in the entire data set” (Anomaly detection - Wikipedia) – they could signal novel discoveries or emerging trends. Despite this, prevailing AI and search systems systematically underrepresent such anomalies.
AI Training Data Bias: Large AI models are trained on enormous corpora of text from the web, books, and other sources. However, the distribution of this training data is highly imbalanced. Common and popular content (e.g. Wikipedia articles, news from major outlets) comprises the bulk of the data, while niche publications or minority viewpoints form a long tail that is often under-sampled. As a result, AI systems tend to reproduce the majority viewpoint or mainstream knowledge, sometimes struggling with rare facts or unconventional perspectives. For example, if an AI never “experiences” a particular niche fact during training, it cannot output that fact later (AI and Diversity Problems: What Can We Do About It? | KORTX). The lack of diverse, long-tail information in training sets has been identified as a source of bias and blind spots in AI (AI and Diversity Problems: What Can We Do About It? | KORTX). Models may provide answers that echo widely available information while ignoring correct but less common data points. This effect is compounded by data cleaning and filtering procedures that often remove outliers to improve overall data quality, potentially eliminating precisely the kinds of valuable anomalies that could enrich an AI’s knowledge base. Over time, this creates AI assistants that are highly proficient at regurgitating consensus information but less adept at surfacing novel insights that fall outside the norm.
Search Engine Ranking Bias: Similar incumbent biases are evident in web search engines. Mainstream search algorithms (like Google’s PageRank-based systems and subsequent machine-learning ranking signals) heavily favor pages with many inbound links, high user engagement, and established reputation. While this approach improves relevance and trustworthiness on average, it inherently skews toward incumbent sources – the websites that already have visibility and authority. Empirical observations have shown that large, well-known websites dominate top search results, making it hard for newer or smaller sites to gain traction (Decoding Google’s Preferences: Big Brands and Low-Quality Content). Google’s historical reliance on link authority meant that sites with high PageRank (often big brands or networks of sites) “could essentially rank for almost any keyword they desired,” often outcompeting more relevant but lesser-known pages (Decoding Google’s Preferences: Big Brands and Low-Quality Content) (Decoding Google’s Preferences: Big Brands and Low-Quality Content). Even as algorithms evolved, users continue to report that independent, high-quality content can be buried beneath low-quality pages from big brands (Decoding Google’s Preferences: Big Brands and Low-Quality Content). A telling example comes from recipe search: generic recipe aggregators and large food websites often rank above more authentic, specialized recipes from individual bloggers, leading to a homogeneity in results (Decoding Google’s Preferences: Big Brands and Low-Quality Content). Small businesses and individual creators frequently find that SEO best practices and quality content are not always enough to overcome the advantage that incumbents have in domain authority and user click-through rates (Why Do Search Engines Remove Small Websites and Prioritize Larger Ones? - LessWrong 2.0 viewer) (Decoding Google’s Preferences: Big Brands and Low-Quality Content). In effect, popularity begets more visibility – a feedback loop that marginalizes outlier information and niche providers.
Notably, search engine companies themselves acknowledge certain biases in their systems. They actively exclude a large portion of crawled content from their indexes to keep results “relevant.” Google, for instance, had crawled over 130 trillion webpages by 2016, yet indexed only an estimated 45 billion (roughly 0.03%) for serving in results (Why can’t search engines access the deep web? - Neon Century Intelligence). The rest – which includes a vast amount of less-known or transient information – is deliberately left out in favor of what the algorithm deems most pertinent (Why can’t search engines access the deep web? - Neon Century Intelligence). While this improves efficiency and relevance, it also means many potentially useful outliers never see the light of day on a search results page. The bias toward relevance as defined by user engagement and link structure can inadvertently become a bias toward the status quo or majority. Recent research in algorithmic fairness in search underscores that these systems can amplify existing social and informational biases, reinforcing a narrow band of perspectives unless countermeasures are taken (Decoding Google’s Preferences: Big Brands and Low-Quality Content).
The Case for Elevating Outlier Data
The dominance of incumbent sources in both AI and search has prompted calls for a competitive differentiator – a new approach or platform that can set itself apart by highlighting the valuable information that others overlook. The premise is that while mainstream algorithms churn out highly similar, redundancy-heavy results, an intelligent system that mines the outliers could provide unique value to users. Consumers are increasingly vocal that today’s search experiences are becoming stale or overly optimized (for example, many have noticed a proliferation of duplicate content or SEO-driven pages crowding the first page of Google). A differentiator could be an engine that intentionally surfaces less-common results to give users a broader perspective.
Some experimental search services have already explored this direction. Million Short, for instance, is a niche search engine that allows users to remove the top 100, 1,000, or even 1 million most popular websites from the results. By doing so, it reveals the “long tail” of search — sites that “just don’t make it to the top of the search engine results” under normal algorithms (Anybody know of any search engines that ignore the big sites?) (Million Short Search Engine Serves Up Nothing But Obscure Results - Search Engine Watch). The motivation, as Million Short’s creators put it, was to see what one might find by “removing an entire slice of the web” (Million Short Search Engine Serves Up Nothing But Obscure Results - Search Engine Watch). The outcome is often striking: queries return genuinely obscure yet relevant pages that users would almost never see otherwise. In one example, a technology blogger tried searching for his own name. Google’s results were dominated by his social media profiles and content from major platforms, whereas Million Short uncovered an old personal website he built in 2002 that Google had “buried… a long time ago” (Million Short Search Engine Serves Up Nothing But Obscure Results - Search Engine Watch). That site wasn’t commercially important, but its resurfacing illustrated the point that hidden gems (or at least nostalgic artifacts) exist beyond the reach of traditional search rankings (Million Short Search Engine Serves Up Nothing But Obscure Results - Search Engine Watch). This example highlights how a focus on outliers can differentiate a search experience – providing something new and interesting where mainstream engines yield the “same old” results (Million Short Search Engine Serves Up Nothing But Obscure Results - Search Engine Watch).
Beyond specialized search engines, academic literature also supports the value of incorporating novelty and diversity into information retrieval. Diversification of search results (ensuring a variety of domains or viewpoints are represented) has been studied as a way to increase user satisfaction, especially for ambiguous queries. Novelty detection techniques, similarly, have been explored to identify and promote new information in news search and document retrieval. The underlying theme is that users can benefit from seeing not just the most popular information, but also the most informationally rich information, which sometimes lies in outlier content. Moreover, in fields like finance, cybersecurity, or science, it is often the outlier data – the unusual spike in a chart, the rare event in a log – that carries crucial warning signals or breakthroughs (Anomaly detection - Wikipedia) (Anomaly detection - Wikipedia). By analogy, the web’s outliers could contain early indicators of emerging knowledge (a lone researcher’s blog about a new discovery, an obscure forum discussing an invention, etc.). A search engine that brings those to the forefront could act as an early detection system for innovation.
Continuous Data Growth and the “Peak Data” Fallacy
Underpinning the argument for outlier-focused search is an appreciation of how much data remains untapped. The concept of “peak data,” popularized by recent comments from tech leaders, posits that AI development might stall because we’ve already harvested all available human-generated data to train the smartest models (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100). Indeed, billionaire Elon Musk claimed that AI companies have effectively “run out of data” and “exhausted the sum of human knowledge” for training advanced models (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100). This alarming notion suggests that further improvements in AI will require synthetic data (AI-generated content) since there’s supposedly no new real data left to consume (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100).
However, a closer look at the state of the web and information production strongly contradicts the peak data theory. Far from reaching a data plateau, humanity is producing information at an unprecedented rate. An intelligence report notes that with over 5.25 billion internet users and 2.5 quintillion bytes of data generated every day, the growth of digital content shows “no sign of stopping.” (Why can’t search engines access the deep web? - Neon Century Intelligence) What’s more, the vast majority of existing content is still hidden or unutilized: an estimated 96–99% of internet content is not indexed by search engines (Why can’t search engines access the deep web? - Neon Century Intelligence). This includes materials in the deep web and invisible web – such as databases, archives of historical pages, private forums, academic repositories, and dynamically generated content – that conventional search bots either cannot or choose not to index (Why can’t search engines access the deep web? - Neon Century Intelligence). In other words, the pool of data that mainstream AI and search are drawing from is merely the tip of a much larger iceberg. The so-called deep web contains “hundreds of billions of webpages” from older or less accessible sources and countless data points from apps and other online systems that are not captured by traditional crawlers (Why can’t search engines access the deep web? - Neon Century Intelligence). If one considers also the continuous stream of new knowledge (scientific data, IoT data, new websites launching daily, etc.), it becomes evident that we are far from any true limit of data. Rather than peak data, a more accurate description might be partial data utilization – we have peaked in how much of the easily accessible data we’ve used, but not in what is actually available.
Critically, the omission of outlier information is a key reason for this under-utilization. Search engines exclude content for various valid reasons (spam, redundancy, privacy, irrelevance), but among the excluded may be pieces of valid but niche information that simply didn’t score highly on traditional metrics (Why can’t search engines access the deep web? - Neon Century Intelligence). As we continue to generate content, the challenge is less about quantity and more about intelligently sifting through it to find the pearls in the ocean of data. This realization motivates new methodologies to harness continuous data generation – especially from non-traditional sources – ensuring that valuable insights are not lost in the shuffle.
Methodological Considerations for Outlier Recognition in Search
To integrate outlier information into search, we must rethink both crawling (how data is collected) and ranking (how results are ordered). Below, we outline key methodological components for implementing an outlier-sensitive search system:
1. Differential Crawling for Novel Data Discovery
Traditional web crawlers work by periodically fetching pages and following links to discover new content. They may revisit popular sites frequently and less-known sites rarely, based on heuristics or sitemaps. In an outlier-focused approach, we propose differential crawling, which emphasizes detecting what changed over time in the web corpus. Instead of treating each crawl as a fresh snapshot in isolation, the search engine would compare consecutive crawls to identify new or substantially updated information. By computing “diffs” of web pages or entire sites between crawls, the crawler can flag content that has appeared or changed significantly since the last visit. This process can surface novel data points (for example, a new statistic added to a report, a new product listed on a small e-commerce site, a breaking news story on a local blog) that might not yet be widely linked or referenced elsewhere.
The potential of differential crawling lies in catching the earliest emergence of information. If a small website publishes a groundbreaking insight, a conventional crawler might see it, but the ranking algorithm might ignore it due to lack of signals (no other site links to it yet, few users have clicked it, etc.). With differential analysis, that page’s new content is explicitly recognized as novel. The search index can annotate it with a “new information” score, and for queries related to that topic, the engine could elevate this page even if it lacks traditional popularity signals. In essence, the search engine becomes proactive in finding “what’s changed on the web recently” and injecting those changes into relevant search results. This approach aligns with the idea behind Google’s “Query Deserves Freshness” (QDF) algorithm – which boosts fresh content for time-sensitive queries – but extends it by not only focusing on freshness in popular sources but freshness even in obscure sources. It’s a systematic way to capture the delta of human knowledge as it grows.
Implementing differential crawling at scale requires robust infrastructure. The crawler must maintain a cache or index of previous content states and perform comparisons, which can be computationally intensive. One strategy is to prioritize differential analysis for pages or sites known to produce unique content (for example, official data portals, specialist forums, niche blogs) or for topics that are rapidly evolving. Machine learning can assist by predicting which parts of the web are likely to yield novel information (e.g., a cluster of new domain registrations in a certain industry might indicate emerging companies worth monitoring). Additionally, some content changes are trivial (like a date update or minor typo fix) – the system would need filters to distinguish substantive changes from noise. Nonetheless, the concept is already practiced in web monitoring tools; as noted in a web development context, “crawlers can monitor websites for changes, alerting users or systems when updates are made”, a technique useful for tracking competitors or interests (Web Crawler - How to Build a Custom Crawler). In a search engine, the “user” is the indexing system itself, which gets alerted to significant changes and can react by re-ranking or highlighting those pages.
2. Outlier Detection and Novelty Scoring in Ranking Algorithms
Crawling ensures new and oddball content is collected, but the ranking process decides whether that content reaches users. To recalibrate search algorithms for outlier surfacing, we introduce an outlier detection layer in the ranking pipeline. This would involve computing a novelty score or outlier score for content relative to other content in the index. Several approaches can contribute to this:
-
Content Uniqueness Analysis: Using techniques from natural language processing, the search engine can analyze a page’s text to determine how much of its information is unique compared to other documents on the same topic. For example, if many documents talk about a certain product using identical specifications, but one obscure review mentions a previously undocumented flaw or feature, that review contains high-uniqueness information. A TF-IDF (term frequency–inverse document frequency) or more advanced embedding-based comparison could identify sentences or facts that are not found elsewhere in the corpus. Those novel segments would boost the page’s outlier score. The higher the score, the more the ranking algorithm might favor showing it (especially if the user’s query hints they might be looking for something beyond standard fare).
-
Rarity of Source Signal: The engine could track how often a particular source/domain’s content appears for a given query or topic. If a source rarely appears (indicating it’s not an incumbent) but when it does, it contains relevant info, the algorithm can give it a diversity boost. This prevents the same handful of websites from monopolizing the results and gives space to lesser-known sources. It’s a controlled introduction of serendipity: giving the user one or two results that are “different.” In practical terms, for a competitive query dominated by big sites, the algorithm might reserve a lower rank slot for a high-novelty result from a small site as a trial. User engagement with that result (clicks, time on page, feedback) can then inform the system if the outlier was valuable or not, refining the approach over time.
-
Temporal Novelty and Trends: Marrying the differential crawl data with ranking, the engine could recognize when a query relates to something that has new information recently added. A hypothetical example: imagine querying a medical condition. Normally, top results are WebMD or Wikipedia. But suppose a very new research paper or a forum post (on a niche site) has unveiled a breakthrough about that condition just yesterday. A temporal novelty-aware ranker would detect that this query intersects with a recent change/event, and it would pull in the new content even if it hasn’t yet gained links or citations. Essentially, it treats recency plus uniqueness as a strong relevance signal in itself. This method requires constantly correlating queries with updated content clusters, but modern search infrastructures and real-time indexing capabilities (seen in news search and Twitter search) make it feasible.
-
Anomaly Filtering vs. Misinformation: One challenge in outlier detection is ensuring that important outliers are elevated, not misleading ones. Not every piece of unique information is true or useful; some outliers are outliers because they are simply incorrect (e.g. a lone website claiming a false scientific fact). To address this, methodological safeguards are needed. The search engine could cross-verify outlier facts against authoritative databases when possible, or at least flag that a result contains unverified novel claims. User feedback mechanisms (allowing users to mark results as helpful or not) and reputation scores for domains can be incorporated so that trusted but new information is distinguished from dubious outliers. This remains a difficult area – essentially the search engine must become an arbiter of “interesting but true” vs “interesting but false” – but a careful combination of fact-checking subsystems and gradual integration can mitigate overt problems. In research terms, this is akin to anomaly detection combined with a secondary classification: separate the valuable anomalies from the noise. As noted in the definition of anomaly detection, historically anomalies were tossed out to improve model performance, but when the anomalies themselves are of interest, one must “identify and separate [them] from noise or irrelevant outliers.” (Anomaly detection - Wikipedia) In a search context, this means distinguishing between an outlier piece of useful data and an outlier piece of spam or error.
3. Algorithmic Recalibration and User Experience
Implementing the above changes would entail recalibrating the overall ranking algorithm of a search engine. Current algorithms weigh factors like textual relevance to the query, page quality, authority (links), and user behavior metrics. We propose introducing new ranking factors or modifiers: a Novelty Score (based on content uniqueness and recency) and a Diversity Score (based on source rarity or representativeness of alternative perspectives). The system might then rank results using a composite that slightly boosts content with high Novelty if it still meets a basic relevance threshold for the query.
From a user experience perspective, the search results page might not look dramatically different, but users may start noticing that, along with the expected well-known links, they occasionally see a result from an unfamiliar site that nonetheless answers their query. Over time, if this approach is successful, it could change SEO practices: instead of purely competing on popular keywords and link building, content creators could focus on providing genuinely unique information, knowing that the search engine will reward novelty and depth. It incentivizes quality over quantity in content creation.
Additionally, there could be user-facing tools to enhance the experience of discovering outliers. For example, a toggle or filter for “include novel/less common results” could be provided (somewhat like how one can filter by date or content type today). Power users and researchers might use such a feature to intentionally scout for unconventional information. This ties back to the earlier example of Million Short – a modern search engine might not make everyone remove the top sites by default, but it could offer an option to explore beyond the top sites easily. In essence, the methodology doesn’t seek to overthrow the fundamentals of search ranking, but to augment them with outlier awareness, thereby broadening what users are exposed to.
Discussion
Implications for SEO and Democratizing Search Visibility
If search engines recalibrate to better surface outlier data, the implications for the search ecosystem are significant. SEO (Search Engine Optimization) strategies would need to evolve. Traditionally, SEO revolves around aligning content with known ranking factors – using popular keywords, obtaining backlinks from reputable sites, improving click-through rates, etc. In a system that rewards outliers, originality becomes a currency. Content creators who provide niche expertise or unique data might find themselves ranking higher than before, even without massive backlink profiles. This levels the playing field for small businesses and individual content creators. A small business that publishes a truly insightful analysis or a unique solution to a problem could outrank a large competitor’s generic page, if the search algorithm deems that analysis to fill a gap in the existing knowledge graph of that query. In other words, the visibility of small players can increase, helping to democratize traffic and attention online.
From the user’s perspective, democratization means a richer array of viewpoints and sources. It could reduce the dominance of a few websites that currently appear for almost every general query (Wikipedia, Amazon, Pinterest, etc., depending on the context) and introduce lesser-known voices. Users might discover niche blogs, local businesses, or specialized communities more easily, lending a more decentralized feel to the web again. This fragmentation of results needs careful balancing – users still want relevant answers quickly, not a random grab bag – but the goal is to ensure the best answer isn’t missed simply because it came from a newcomer or a low-traffic site.
For honest webmasters and content creators, this shift is encouraging: it emphasizes substance. Instead of chasing the latest SEO trick or pouring resources solely into link-building, creators can focus on producing content that stands out by its information value. In fact, sites that are incumbents would also need to pay attention – no longer can they rely on their domain authority if another site has information they don’t. This could spur more innovation and thoroughness in content from big players too, as they realize the most informative content wins, not just the most well-connected content.
Of course, this scenario also comes with challenges. SEO spammers might try to game an outlier-based system by generating fake uniqueness – for instance, auto-generating pages filled with nonsense claims or random data to appear novel. Search algorithms would have to guard against such tactics by validating information and maintaining quality thresholds (as discussed in methodology). But the hope is that the cost and difficulty of faking credible unique content is higher than recycling or scraping existing content, thus deterring some of the mass-produced spam.
An outlier-friendly search also means rethinking evaluation metrics for search quality. Currently, metrics like click-through rate, bounce rate, or satisfaction surveys might favor familiar results (users tend to click results they recognize). Introducing unknown results might initially get fewer clicks simply due to lack of user awareness or trust. Search providers would need to educate users or gradually build trust in these novel results. Over time, if users find those unexpected results useful, it can enhance the perceived quality of the search engine (“I always find something new when I search here, not just the same stuff as everywhere else”).
Finally, democratized visibility has societal implications. It could amplify underrepresented voices – for example, content from marginalized communities might have less mainstream presence but could gain audience if recognized as filling an informational void. This democratization aligns with some scholars’ call to make search engines more equitable in how they distribute attention. By lessening the rich-get-richer dynamic of traffic, we move toward a web where meritocratic discovery — finding the best info regardless of source size — is more possible.
Continuous Data Generation vs. “Peak Data” – A New Outlook
The notion of incorporating outlier information also directly addresses and refutes the “peak data theory.” Rather than accepting that we’ve tapped out the world’s knowledge, this approach says: we have only tapped what our current tools deem worthy. There is a vast expanse of data left untouched or underutilized, and it’s growing every minute. By designing search engines and AI with outlier discovery in mind, we open up avenues to utilize that data.
One immediate benefit is for AI training: search engines that find new or niche content could feed that content to AI researchers or even directly into AI models (with proper vetting). Instead of relying on synthetic data to fill knowledge gaps, we can feed models with fresh real data that was previously unseen. This includes data from the deep web, digitized archives, or simply new websites that are just starting out. For example, an AI model could be updated with insights from a specialist forum that a differential crawl identified as having a one-of-a-kind discussion on a technical issue. Over time, this can make AI systems more robust and knowledgeable, prolonging the AI improvement curve using real human knowledge rather than artificial self-play alone.
Critically, by tapping continuous data generation, we avoid the feedback loop of AI training on AI data. Experts have warned that if models start to train predominantly on content generated by other models (due to lack of new human data), it could lead to model collapse or a degradation of quality in a closed loop (Elon Musk Warns: AI Is Reaching the Limits of Human Knowledge) (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100). Real data, especially outliers, act as fresh nutrients for the AI ecosystem, preventing stagnation. The “exhaustion” of human knowledge that Musk referred to (Elon Musk claims AI companies have run out of data and ‘exhausted’ sum of human knowledge | indy100) is only an apparent exhaustion under current paradigms. By venturing into the 96% of the web that is not indexed (Why can’t search engines access the deep web? - Neon Century Intelligence) and acknowledging that new knowledge is constantly created (through scientific research, cultural evolution, etc.), we ensure there is a continual stream of material to learn from.
In a way, re-focusing on outliers is a return to the exploratory roots of the web. The early internet in the 1990s and early 2000s was less centralized – users often stumbled upon interesting personal pages or niche sites through webrings, directories, or just serendipity. As search engines matured, they tightened the funnel, emphasizing the most popular destinations. The future could see a blend of that old serendipity with modern relevance algorithms – giving users the confidence that they’ll get useful information, but also the occasional surprise of a genuinely new piece of knowledge or a source they never knew existed. It treats the web more like a dynamic, living data source than a static library of already-known reference material.
In summary, the continuous generation of data and the presence of a vast invisible web mean we are far from any ceiling of information. What is needed are better tools to mine this abundance. By recalibrating search to value outliers, we not only improve search itself, but we also create a pipeline of fresh information into the broader knowledge ecosystem, including AI. This approach positions us to never run out of discoverable data, continually feeding curiosity and innovation.
Conclusion
This exploration into the role of outlier information in web search suggests that the future of search could shift from a paradigm of filtering out anomalies to one of embracing them. We formally hypothesized that incorporating outlier detection and differential crawling into search algorithms would enhance the discovery of novel content and level the playing field for diverse content creators. Through our review and analysis, we found strong rationale to support this hypothesis: from the documented biases favoring incumbents in both AI training and search rankings (Decoding Google’s Preferences: Big Brands and Low-Quality Content) (Why can’t search engines access the deep web? - Neon Century Intelligence), to the proven value of anomalies in yielding critical insights (Anomaly detection - Wikipedia), to the clear evidence of a vast and growing pool of data that remains untapped by current search engines (Why can’t search engines access the deep web? - Neon Century Intelligence). Methodologically, we outlined how such a system might be implemented, highlighting the need to compare web snapshots over time and re-weight ranking factors to include novelty and diversity.
The implications of success are profound. Users would gain a richer search experience that not only answers their questions but also broadens their horizons with information they might not encounter otherwise. Small businesses and independent creators would see greater opportunities to reach audiences based on merit of content rather than marketing budget alone. The knowledge landscape would become more distributed and exploratory, potentially accelerating how new ideas spread. Importantly, this strategy offers a counterpoint to the idea that we’ve hit a ceiling in data utility – by showing that innovation in how we search can unlock new reservoirs of information, keeping the trajectory of discovery on an upward climb.
Moving forward, there are important areas for further research. One is the development of robust evaluation metrics for search quality that account for novelty and user learning, not just immediate relevance. Another is refining algorithms to tell apart useful outliers from harmful misinformation – a task that may involve interdisciplinary collaboration between technologists, domain experts, and perhaps crowd-sourced verification. There will also be engineering challenges in scaling differential crawling and storage of historical web data, which might be addressed through advances in big data processing and maybe decentralized indexing approaches.
In conclusion, far from reaching “peak data,” we are on the cusp of a new data frontier. By recalibrating our search tools to value outlier information, we can transform the web into a more inclusive, dynamic, and insightful space for everyone. The hypothesis put forth is not just about improving search results; it’s about expanding the collective knowledge we can access. The future of web search, therefore, may belong not only to the fastest algorithms or the largest indices, but to the smartest strategies in uncovering the unknown unknowns that lie hidden in plain sight across the internet.
Gemini’s Take
This article details my journey through the evolving search landscape, from early SEO tactics to the AI-driven revolution, critiquing “peak data” theories and outlining a vision for open-source tools that empower individuals to leverage the internet’s vast potential. I explore the technical intricacies of web crawling, data processing, and the challenges posed by big tech monopolies, while emphasizing the untapped resources of the invisible web and the continuous generation of new data.
Ideas for Titles and Headlines:
- “Beyond Peak Data: My Quest to Democratize Web Crawling”
- “From Static Sites to AI: 30 Years of SEO and the Future of Search”
- “Challenging Big Tech: How Open-Source Tools Can Unlock the Invisible Web”
- “The Myth of Peak Data: Why AI’s Future Relies on New Information, Not Just Synthetic Data”
- “Future-Proofing SEO: My Blueprint for LLM-Guided Web Crawls”
- “Inside My Code: Building a Pipeline for the AI-Powered Web”
- “Why Google’s Moat Is Evaporating: The Dawn of Small-World Crawls”
- “From HitTail to Pipulate: My Obsession with Serendipity and Deliberateness in CEO”
My Opinion of the Article:
This article is a fascinating blend of personal narrative and technical insight, offering a unique perspective on the evolution of search and the challenges facing the industry today. The author’s candid reflections on their career, combined with their critical analysis of “peak data” theories, make for a compelling read. The article effectively bridges the gap between historical context and future possibilities, highlighting the importance of open-source tools and innovative workflows in an AI-driven world. The author’s passion for democratizing access to information shines through, making this piece both informative and inspiring.