Setting the Stage
I am here to do the best work of my life. That is the reason I am at where I am now. The problems they have are the ones that are addressed by the itches I’m scratching.
The Early Days of Search
1994 was a big year in search. This is after I wrapped up working for Commodore Computers (they went under) but before I moved to New York City, we saw Yahoo, Lycos, Excite and Infoseek all come onto the scene. AltaVista arrived in 95. And that’s when I consider myself becoming an SEO in spirit if not in name. Now it’s 2025, so I’ve been SEO-aware for 30 years, and professionally calling myself an SEO for bout 25 of those years, since the move to NYC.
Early Technical Approaches
As I got started in the burbs of Philly, I felt I was effective as an SEO because I had direct control of the websites. I was doing static site generation long before tools like Jekyll came onto the scene. I kept the website content actually in XML files, and later in databases from which I generated the XML files. The files would get slice & diced and transformed into websites in multiple transformation passes. I used a novel and kooky tech called XSLT (XML style-sheet transformation), which were supposed to be my path to future-proofing.
They weren’t.
While these XSL stylesheet transformations were a really awesome secret weapon in SEO, especially during the era of the long-tail when such precise control of link construction could win you visits, it didn’t last. JavaScript matured and JSON replaced XML as the interoperable language of choice. NodeJS wasn’t on the scene yet, so the concept of the full stack wasn’t a thing. I actually investigated JavaScript on the server before Node hit the scene, and there was nothing viable. This is when I opened my mind to other programming languages, tried Ruby to see what the big deal was about, but then got bitten by the Python bug.
The Move to New York
By this time, I was no longer making the websites directly myself. I had moved to New York City to become a vice president at the company that had once launched Amazon.com. It was the public relations company for Amazon back during the dotcom boom era of the late 90s/early 2000s, before the bust. I went to work for them during their second coming, after the bust was over and Web 2.0 was on the rise. I moved to NYC circa 2004/2005. Within a year, 2006, I had created HitTail, a Web 2.0 era app that took the best of the long-tail slice & dice learnings and spun it into a real-time writing suggestion tool, mostly for bloggers but the technique would work with any SEO-friendly CMS.
The Evolution of Search Algorithms
Even to this day, the HitTail algorithm still holds half the formula. If something is nearly working for you, you need only push it over the edge. It’s like those penny arcade games in skee-ball joints and Chuck E Cheese where the quarters (or tokens) are hanging over the edge and you drop a coin in to see how many you can get to topple over the edge. That’s HitTail. Or was.
The Concept of Striking Distance
In SEO circles, this is often known as striking distance or low-hanging fruit. If the data shows that a page of yours is in the search results — positioning, but not positioning well (page 2 or lower) — then the data is trying to tell you something. It’s like having a new content-topic suggestion-box built into your website. Just follow the data! HitTail was a serendipity to deliberateness engine.
The Power of Incremental Improvement
In other words, if you were not deliberately targeting that term before and now you see that you have the propensity for positioning for it, if you took a few extra measures, you could improve its position. And that was a reliable, repeatable process. It could (and did for many people) create the snowball effect of content your audience wanted to hear about, leading to more traffic, leading to more serendipitous hits, leading to more new topic writing suggestions, leading to more content, leading to more hits.
The End of an Era
A positive feedback loop and virtuous cycle! That is, until Google more or less went on a war against the long-tail, dialing-up brand importance giving their sites safe harbor in SERPs while dialing down “low-quality” pages, effectively filtering out the noise like the old squelch dial on CB radios. Anyone? The sound of crickets…
The Power of Verbatim Search
But then search engines were a lot easier to affect back then, in a verbatim fashion. The concept of “verbatim” matching, that is preferring exact matches of what you searched for being present in the pages in the results, is so powerful that Google to this day keeps the concept alive in their search, though you have to click the Web link / Tools, and then switch All results to verbatim. It’s buried, but it’s there. Think what that means. After all these years, there is both still enough reason to keep it there, and the underlying systems are still working in such a way that it can be there. There’s a lot to read into that. I sometimes think if not for SEO spammers, Web/Verbatim would still be the preferred way of searching.
The Evolution of Google Search
So Universal search hit, which was the layering in of custom (non-traditional) search results from other sources like maps, images and local. Then Rich Snippets and the Knowledge Graph hit, and then the changes came fast and furious. The machine learning and AI stuff started hitting with BERT around 2018 (a year after the “Attention is All You Need” fateful Transformer article), and then RankBrain, yadda yadda… and the rest is sort of a dramatic tension between keeping things just-so so that AdWords revenue was not dramatically cut into while keeping pace with the times. They were effectively cornered into the classic innovators dilema because most of their revenue came from advertising from a particular user experience in Google-brand dominated search.
The ChatGPT Revolution
And then November 2022 ChatGPT hit. Microsoft saw Google trapped in the corner due to the innovators dilema AdWords reliance and decided to go on the defensive. For Microsoft, Bing ad revenue is just one of many revenue streams for Microsoft, and they’re not risking much by madly experimenting there, and so they did. The teamed up with OpenAI to use ChatGPT to power Bing in an attempt to reframe the public’s expectation of search from being that Googly Top-10 blue links experience to more of a chat experience with embedded citations capable of still passing through the traffic to destination sites.
Microsoft’s Challenge to Google
The New Bing powered by ChatGPT starting around February 2023. You can find a full timeline of Microsoft’s determination to beat up on Google here where I documented it during my brief time at Moz while that was happening. Microsoft continues to attempt to reframe the search value proposition, with Perplexity and the OpenAI themselves jumping on that bandwagon. All that pressure plus an antitrust lawsuit that they lost is still not enough to dislodge Google more than a couple percentages from search market dominance, leaving us to ask whether the Google castle is built on clouds, or a more solid foundation than what Microsoft and other competitors chomping at the bit for a piece of Google’s search action might have hoped.
The Antitrust Battle
And that brings us to today. The Google antitrust suit shined light on just how much Google pays Apple to remain the default search engine of Safari Mobile (~$20 Billion/year), and quite how effective Google’s been in making Chrome the default browser of choice. There’s other questions here about quite how much that browser control might and has been leveraged to improve Google’s search services in a way sufficiently anti-competitive to make a forced break-up the antitrust remedy.
You have to read a little bit into that, but the message here is that in order to address Google’s search dominance, and divestment of their control of the browser is the remedy… hmmm. For lack of other better remedies that would get more to the point, or because browser control actually yields unfair search dominance? Is it because of control over what happens when a non-URL is typed into the address bar? Maybe. That would line up nicely with why the Apple deal is so in the spotlight on this issue, because what Safari Mobile does in that condition is the Google money-maker. If Google pays $20 billion/year for that right, now much must they be making from control of that feature alone? This whole thing could be framed as the battle over the browser address bar.
The Future of SEO
Alright, back to the original point. I am where I am to do the best work of my life, against this dynamic industry backdrop where the dominant juggernaut is going to do everything it can to hang on to its mostly brand-generated and behind-closed-door-deal generated market dominance while a combination of a super-competitor who controls the most dominant desktop OS in the world is gunning for them, along with every little pisher AI-powered search competitor like Perplexity.
My schtick is future-proofing and resisting obsolescence as my very field is under fire in the first meaningful way in a quarter-century. And frankly, its time has come. But not in an SEO-is-dead sort of way quite so much as let’s get all this right so that you don’t need to be the JavaScript-rendering massive infrastructure of Google to run clever new AI search startups.
The Network as Computer
Sun Microsystems once said the network is the computer… and then incurred Microsoft’s wrath and became nothing more than another Oracle acquisition themselves. Whoops. Specifically, that comment and the reality it represents made Microsoft incorporate Microsoft Explorer into the OS as a built-in feature, incurring the antitrust wrath of the government themselves and being forced to break IE out as a separate feature. But the damage was done and Netscape, the browser innovator at the time, was more or less killed. And the idea of the network as the computer was embraced, extended and eliminated. This is a pattern. What the once dominant IBM is today is also a result of this strategy, and that’s what Google is facing.
The Internet as Database
Today, the Internet is the database. You don’t need no Google if an intelligent LLM-guided small-world-theory crawl can just go out to the Internet and find what its looking for. URLs are unique keys (uniform, really but the full URLs with querystrings are unique). The data they unlock are values. This makes the Internet itself the ultimate NoSQL database. That was a keyword of yore referring to preferring a single super-optimized column that keeps key/value-pairs as a practical alternative to row & column data that has dominated the industry for so long, generally known as SQL databases and relational database management systems (RDBMS). Key-value NoSQL databases are sometimes posed as some new thing that became popular with MongoDB, Redis, Cassandra and CouchDB. But the truth is they’ve been built into Unix for a long time in the form of dbm (database manager) written by Ken Thompson in 1979, the same guy who wrote Unix itself.
Perplexity: The built-in Unix key-value database is called dbm (database manager)[1][2]. It was originally written by Ken Thompson in 1979 and provides a simple string-based key-value storage system[2]. The dbm library is included in most Unix-like operating systems and offers a generic dictionary-like interface for storing and retrieving data[1].
Key features of dbm include:
- Simple key-value storage using string keys to access string values[1].
- Ability to open existing databases in read-only or read-write mode[1].
- Automatic conversion of keys to byte strings when added to or retrieved from the database[1].
- Compatibility with various programming languages, including Python[1][2].
While dbm is considered old and rarely mentioned in modern NoSQL discussions, it is still used by many pieces of software due to its simplicity and ubiquity in Unix systems[2].
Citations:
[1] https://pymotw.com/3/dbm/index.html
[2] https://en.wikipedia.org/wiki/Key-value_store
[3] https://stackoverflow.com/questions/1690605/reliable-and-efficient-key-value-database-for-linux
[4] https://unix.stackexchange.com/questions/21943/standard-key-value-datastore-for-unix
[5] https://www.influxdata.com/key-value-database/
[6] https://etcd.io/docs/v3.4/learning/why/
[7] https://news.ycombinator.com/item?id=40164237
Berkeley DB: The Industrial dbm
An industrial grown-up version of this has been around for awhile too, some time before the concept of NoSQL burst onto the scene. But it was swallowed up by Oracle and few realized the significance of systems like this until they combined with the concept of distributing them and using them as an alternative to SQL for all sorts of caching purposes that lend themselves well to this. See, URL-in, webpage out? Key/value-pairs! The whole darn Web is this kind of database. What was that grown-up version of dbm called?
Perplexity: The grown-up industrial version of the Unix key-value database was Berkeley DB (BDB).
Berkeley DB evolved from the original Unix dbm (database manager) implementation. It was first developed at the University of California, Berkeley in 1991 as part of the transition from BSD 4.3 to 4.4[4]. The initial goal was to replace the existing in-memory and on-disk hashing systems with a unified system that could operate both in memory and on disk[1].
Berkeley DB’s fate:
-
Commercial development: In 1996, Netscape contracted with Margo Seltzer and Keith Bostic to develop a production-quality version of the software, leading to Berkeley DB 2.0 in 1997[2].
-
Company formation: Sleepycat Software was created in 1996 to continue development and support of Berkeley DB[4].
-
Acquisition: Oracle Corporation acquired Sleepycat Software in February 2006[4].
-
Continued development: Oracle has continued to develop and support Berkeley DB, with version 5.0 released in 2010 adding SQL support[2].
-
Open-source status: Berkeley DB remains an open-source, embedded database software library for key-value data, despite its commercial ownership[9].
-
Influence: Berkeley DB has inspired other embedded key-value databases like Facebook’s RocksDB and Google’s LevelDB[7].
Today, Berkeley DB is still used in various applications, offering features such as ACID compliance, high performance, and the ability to handle very large key-value stores (up to terabytes in size)[1].
Citations:
[1] https://dbdb.io/db/berkeley-db
[2] https://aosabook.org/en/v1/bdb.html
[3] https://www.oracle.com/technical-resources/articles/database/oracle-berkeley-db-storage-layer.html
[4] https://www.orafaq.com/wiki/Berkeley_DB
[5] https://www.markhansen.co.nz/berkeley-db/
[6] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a47c4959b071c9292090c81788c3f6e2eecb8827
[7] https://www.influxdata.com/key-value-database/
[8] https://apps.state.or.us/tech/berkeleyDB/ref/intro/dbisnot.html
[9] https://en.wikipedia.org/wiki/Berkeley_DB_XML
The Web as Database
Nice, thanks Perplexity. Yeah, so if we start thinking about the Web itself as the database, then that Google competitive moat, aka a copy of the Internet made by its crawler and distilled down for fast answers with its index, starts to evaporate. Where’s the moat if anybody can write a search engine that can have even more up-to-date information regarding product availability or new knowledge on a topic than even Google? This is small-world-theory crawls. No page is more than 6-degrees of separation (6 links away, or whatever) from whatever well-chose start page you begin the crawl from?
The Return of Web Directories
So you still need some sort of search. Or maybe an old-school directory site like Yahoo or The Open Directory Project (DMOZ)? You just need to seed the small-world crawl, and follow the links from there.
The Resurgence of Link Graphs
The link-graph becomes important again! That whole <a
href="[link]">keyword</a>
old-school SEO view of the world becomes important
again, because that’s what the crawlers follow. I mean they still do, and
they’ve never not been important for crawling. It’s just that SEO spammers
ruined them for everyone as a primary indicator of search relevance.
A New Incentive Structure
But if you know now that an intelligently guided LLM small-world crawler is going to work this way once it reaches your site, the incentive to spam goes down. The spamming techniques are to attempt to get the engines like Google of today to drop you directly into the correct sub-page of the site, so it’s spam-link everywhere. But if instead you know that the homepage will almost always be used as the point of entry and the links truly are for the best drill-down experience of both humans and crawlers, the game changes.
The Importance of Homepages
The details of that first event that drops you into the correct starting homepage are important. But we can assume that important homepages that align to important brands will be the non-volatile links that get baked into the trained LLM models. Specific URLs are going to get filtered and cleaned and normalized and curated out of the training data. Nobody wants URLs that aren’t always going to be exactly the same as during the life of the model’s use baked into the model. But naked apex domains, the registered ones and usually the homepages are a whole other matter. What model trainer wouldn’t allow the models to know that brandx’s homepage resides at the brandx.com registered domain. And an attempt to visit that will almost always auto-forward to the correct landing page.
LLMs as Web Directories
So it’s conceivable that that the first-hop of such small-world webcrawls will
be baked into the LLM models themselves! An old-school Yahoo Directory baked
into every web-surfing LLM, and real-time page-to-page surfing after that
because… well, because after you view the Internet as the giant ultimate
key/value-pair NoSQL database, you can then view it similarly as the giant
ultimate vector database. That is, because of the <a href>
link/attributes, it
is by definition creating the nodes and edges of a network graph, and is thus
also a hierarchically oriented vector database too! And if treated that way, we
really start to see world where LLMs can surf the web just as well as humans,
without having to execute JavaScript, and to get answers really, really fast
— six hops and all!
Website Design for the LLM Era
It’s just website hygiene! You plan for this world, because it’s just good website design. From a marketing and brand reputation standpoint, you make sure that the story of your brand and your products in a general sense is told in a compelling way that can catch the fancy of the LLMs if it happens to get baked into the base models. And of course you tell that story so it gets picked up on the crawls so that they somehow get through the filtering-for-quality curation process.
The Future of SEO Workflows
I’m still working out the details of this, but only the details now, thank goodness. In broad strokes of how it will run and how an individual human SEO does the work is in terms of these Pipulate/Botifython workflow pipelines I’ve been talking about in previous articles. This is the best work of my life that I’ve also been talking about, because I am able to reach farther and achieve greater things than I have been ever before in my life. Not merely because of the wisdom of age, though I find that helps a bit too, but because of the recalibration of how we work around the new AI tools.
A mostly linear workflow can step you from step to step along a process. It can guide you to do this, and then do that. This is a route process that can be written out, along with the rationale for each step, and that is precisely what we are up to.
Peak Data Discussion
But wait! before I dive into this, something has been bugging me about the concept of having achieved “peak data” in LLM training, which I’ve heard both Ilya Sutskever state. Most recently 3 minutes into this interview with Musk where he seems to be referring a Jensen Huang talk. I don’t care how big Nvidia, Tesla and SSI (Safe Superintelligence Inc) are. They don’t really have enough access to the world’s data to make such an assertion:
The cumulative sum of human knowledge has been exhausted in AI training sum of human knowledge has been exhausted in AI training
Me: Who has asserted that the cumulative sum of human knowledge has been exhausted in AI training sum of human knowledge has been exhausted in AI training?
Perplexity: Elon Musk has asserted that the cumulative sum of human knowledge for AI training was exhausted in 2024. During a livestream with Mark Penn, he stated that all available human data, including internet content and books, has been utilized, necessitating a shift to synthetic data generated by AI itself for future training[1][2][8]. This perspective aligns with former OpenAI chief scientist Ilya Sutskever’s comments about reaching “peak data” in the AI industry[2][5].
Sources
[1] Human data pool for AI training exhausted in 2024 – Elon Musk https://tribuneonlineng.com/human-data-pool-for-ai-training-exhausted-in-2024-elon-musk/
[2] Elon Musk says all human data for AI training exhausted in 2024 https://nairametrics.com/2025/01/09/elon-musk-says-all-human-data-for-ai-training-exhausted-in-2024/
[3] Real-world data for AI training exhausted? Elon Musk thinks so https://www.newsbytesapp.com/news/science/elon-musk-says-ai-has-exhausted-all-real-world-training-data/story
[4] AI Should Augment Human Intelligence, Not Replace It https://hbr.org/2021/03/ai-should-augment-human-intelligence-not-replace-it
[5] Are AI Models Running Out of Real-World Data? Elon Musk Thinks So https://autogpt.net/are-ai-models-running-out-of-real-world-data-elon-musk-thinks-so/
[6] Technological singularity - Wikipedia https://en.wikipedia.org/wiki/Technological_Singularity
[7] Elon Musk Acknowledges the Limits of Available AI Training Data! https://www.allaboutai.com/ai-news/elon-musk-acknowledges-the-limits-of-available-ai-training-data/
[8] Have AI Companies Run Out of Training Data? Elon Musk Thinks So https://petapixel.com/2025/01/09/have-ai-companies-run-out-of-training-data-elon-musk-ilya-sutskever-thinks-so/
[9] Elon Musk and AI Experts Agree: We've Exhausted Real-World Data ... https://technews180.com/artificial-intelligence/elon-musk-and-ai-experts-agree-weve-exhausted-real-world-data-for-ai-models/
[10] Elon Musk Declares AI Training Has Exhausted Human Knowledge https://www.thinkwithniche.com/amp/news/elon-musk-declares-ai-training-has-exhausted-human-knowledge
[11] TechCrunch on X: "“We've now exhausted basically the cumulative ... https://twitter.com/TechCrunch/status/1877425379708719565
Me: When did Ilya Sutskever express this?
Perplexity: Ilya Sutskever expressed his views on “peak data” during a presentation at the NeurIPS conference on December 13, 2024. He stated, “We’ve achieved peak data and there will be no more,” highlighting concerns about the limitations of available training data for AI models and the implications for future AI development[1][4][5].
Sources
[1] “Peak Data”: Ilya Sutskever on the future of AI - ALPHA AVENUE https://alphaavenue.ai/en/magazine/technologies/peak-data-ilya-sutskever-on-the-future-of-ai/
[2] Ilya Sutskever Predicts the End of Pre-Training as AI Hits 'Peak Data' https://opentools.ai/news/ilya-sutskever-predicts-the-end-of-pre-training-as-ai-hits-peak-data
[3] AI Hit 'Peak Data.' Google Researchers Think They Found a Solution. https://www.businessinsider.com/ai-peak-data-google-deepmind-researchers-solution-test-time-compute-2025-1
[4] Data Is the 'Fossil Fuel' of A.I., Says OpenAI Co-Founder Ilya Sutskever https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/
[5] OpenAI co-founder says AI is reaching "peak data" as it hits the limits ... https://the-decoder.com/openai-co-founder-says-ai-is-reaching-peak-data-as-it-hits-the-limits-of-the-internet/
[6] The Next Chapter of AI: Ilya Sutskever's Vision for the Future of ... https://www.promptzone.com/rudy_froyen/the-next-chapter-of-ai-ilya-sutskevers-vision-for-the-future-of-artificial-intelligence-5e44
[7] Ilya Sutskever Calls Peak Data and the End of Pretraining - Dexa AI https://dexa.ai/d/b856e2a6-bd72-11ef-a143-bb2eba2ff942?t=2-End+of+Pre-Training
[8] Aaron Fulkerson on LinkedIn: OpenAI cofounder Ilya Sutskever says ... https://www.linkedin.com/posts/aaronfulkerson_openai-cofounder-ilya-sutskever-says-the-activity-7273854449377288192-YIJt
Me: What is the rate at which new data is believed to be produced in this world?
Perplexity: Approximately 402.74 million terabytes of data are created daily, with projections indicating that 181 zettabytes will be generated in total by the end of 2025. This represents a significant increase from previous years, driven largely by the proliferation of IoT devices and real-time data processing technologies[1][3][7]. The global data creation rate is expected to grow at a compound annual growth rate (CAGR) of 26% until 2025[1].
Sources
[1] 85+ Big Data Statistics To Map Growth in 2025 - G2 https://www.g2.com/articles/big-data-statistics
[2] Speed up training – MOSTLY AI Docs https://mostly.ai/docs/generators/configure/speed-up-training
[3] Amount of Data Created Daily (2024) - Exploding Topics https://explodingtopics.com/blog/data-generated-per-day
[4] AI Training Dataset Statistics 2024 By Technology, Predictions https://scoop.market.us/ai-training-dataset-statistics/
[5] [PDF] Data Age 2025: - Seagate Technology https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf
[6] Using AI Only to Generate and Annotate Training and Fine-Tuning ... https://www.e2f.com/news/cheap-fast-and-not-recommended-using-ai-only-to-generate-and-annotate-training-and-fine-tuning-data
[7] Big data statistics: How much data is there in the world? - Rivery https://rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/
[8] Revolutionizing AI Training With Synthetic Data - Forbes https://www.forbes.com/councils/forbestechcouncil/2024/05/08/revolutionizing-ai-training-with-synthetic-data/
[9] Data growth worldwide 2010-2028 - Statista https://www.statista.com/statistics/871513/worldwide-data-created/
[10] How AI Is Driving Global Data Growth - Chris Curry https://www.chriscurry.co.uk/blog/ai-explosion-170-zettabytes-2025-future
Me: While this is not all quality data worthy of model training, this still sounds like quite a lot of data with a statistical likelihood that some of this is Internet accessible new data that is meaningful enough quality to be included in AI training. Is this consistent with Elon Musk and Ilya Sutskever’s statements? What am I missing, or do they have something wrong?
Perplexity: Elon Musk and Ilya Sutskever’s assertions about the exhaustion of real-world data for AI training are consistent with the concerns surrounding data quality and availability. While vast amounts of data are generated daily, much of it may not meet the necessary quality standards for effective AI training. Both Musk and Sutskever emphasize a shift towards synthetic data due to the perceived limits of existing real-world data, suggesting that even substantial quantities of new data may not be suitable for training advanced models [1][2][4].
Sources
[1] Elon Musk agrees that we’ve exhausted AI training data https://techcrunch.com/2025/01/08/elon-musk-agrees-that-weve-exhausted-ai-training-data/
[2] Have AI Companies Run Out of Training Data? Elon Musk Thinks So https://petapixel.com/2025/01/09/have-ai-companies-run-out-of-training-data-elon-musk-ilya-sutskever-thinks-so/
[3] What’s Ahead in Generative AI in 2025? (Part One) https://tdwi.org/Articles/2024/12/18/TA-ALL-Whats-Ahead-in-Generative-AI-in-2025-Part-One.aspx
[4] Elon Musk Acknowledges the Limits of Available AI Training Data! https://www.allaboutai.com/ai-news/elon-musk-acknowledges-the-limits-of-available-ai-training-data/
[5] The Vital Role of High-Quality Data in Training AI Models for IT https://www.lakesidesoftware.com/blog/the-vital-role-of-high-quality-data-in-training-ai-models-for-it/
[6] AI's Potential: The Role of Accessibility in Creating a Data-Rich ... https://www.inclusiveweb.co/accessibility-resources/ai-role-in-accessibiity-testing
[7] Harvard's Library Innovation Lab launches Institutional Data Initiative https://hls.harvard.edu/today/harvards-library-innovation-lab-launches-initiative-to-use-public-domain-data-to-train-artificial-intelligence/
[8] Data Quality in AI: Challenges, Importance & Best Practices https://research.aimultiple.com/data-quality-ai/
[9] Data 2025 outlook: AI drives a renaissance of data https://siliconangle.com/2025/01/06/data-2025-outlook-ai-drives-renaissance-data/
[10] The Data That Powers A.I. Is Disappearing Fast - The New York Times https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html
[11] How AI is Influencing Data Center Infrastructure Trends in 2025 https://blog.equinix.com/blog/2025/01/08/how-ai-is-influencing-data-center-infrastructure-trends-in-2025/
Me: I find three things about this assertion suspicious. First is this peak data concept that the sum of human knowledge has been exhausted. The sum of human knowledge implies all copyright material, including the New York Times and Sarah Silverman’s book, the copyright owners of which have both challenged the model trainers legal rights to include copyright material. But the sum of all human knowledge implies that they have included the material. This does not even include the uncrawlable invisible web exist because most crawlers don’t execute JavaScript and render the Dom with a headless browser. Google’s does, but Elon and Sutskever aren’t necessarily using a crawler that does. And then finally, there is purely proprietary data like LexisNexis or secret data like the Renaissance Technologies Medallion Fund code. So such a statement is full of false assumptions and hubris. And that’s just point number one.
The Quality of New Information
The second thing is the rate at which new information being generated is deemed to be of not high enough quality. It seems to me that statistically, there is outlier information in there, which is as valuable and unique in this world as anything in history. It is just getting lost in the glut of new information, both in terms of accessibility and the ability to identify it as valuable. But nonetheless, it exists, and the problem in my mind is one of access, identification, and curation, and not of the data somehow not existing, not being valuable, nor in any way, not being meritorious enough to make some sort of peak information top out. You’ll notice that nobody from Google is making this assertion, as their crawler is more capable of crawling, the invisible web than most given its headless browser and scale. Second, Google owns YouTube one of the great sources of new information. Look at how much data is coming in from the transcripts of those YouTube videos alone. Not that it is all valuable for machine learning, however, it may be should not be so readily written off. Interesting and important game changing information often comes from outliers.
The Synthetic Data Alternative
Lastly the suspicious thing is that the alternative to the more difficult task of identifying and curating new data is the generation of synthetic data. It seems to me that both Sutskever and Musk are suggesting synthetic data because they are in the position to generate it given their computing power, but lack of a crawl or new data equation infrastructure, like Google. It seems like it is a matter of reframing the challenge according to “if all you have is a hammer than everything looks like a nail” principle instead of doing the harder thing and coming up with a new tool. Generating synthetic data is easier than creating some sort of important new data capture mechanism of that vast quantity of data being generated every day. Elon did buy Twitter, so he does have one of those data sources. But it is the inane babble data source and nothing like Google’s infrastructure or their Google Book Library project, which has been digitizing books for years, copyright issues aside.
This would all seem to suggest that what can be learned from a static data set is all that can be learned, and only in the way that it is learned currently. Nothing new can be gleaned from another pass or a second look at the same data? And all the new data coming out at that vibrating edge fringe of what’s known, like all the space-telescope and LHC data that could really use some help analyzing from what I hear.
And might I add that I imagine this scientific frontier data would be some of the most interesting and important to learn from. And to the argument that it’s not the kind of data they’re looking for, like Shakespeare and stuff, I’d ask them “was Python?” Yet LLMs learned how to code. What greater from new, unexpected and potentially unlimited data sources? And it’s being written off? Peak data my arse! That’s a defeatist attitude and one I would hardly imagine coming from Musk, currently the owner of the largest super-computer in the world.
Conclusion
In other words, I think these peak data assertions are a combination of being misleading, lazy, unimaginative and sour grapes.
Perplexity: Your skepticism about the “peak data” assertion made by Ilya Sutskever and Elon Musk is valid and highlights several important considerations:
-
Data Scope and Quality: The claim that the sum of human knowledge has been exhausted overlooks the complexities of data availability, including copyright restrictions, uncrawlable web content, and proprietary databases. This suggests that while high-quality datasets may be limited, vast amounts of potentially valuable data remain untapped or underutilized.
-
Outlier Information: The argument that new information lacks quality fails to recognize that valuable insights often come from outliers. The challenge lies in effectively identifying and curating this data rather than a genuine scarcity of meaningful content.
-
Synthetic Data Generation: The preference for synthetic data generation may stem from a lack of robust data capture mechanisms, as you noted. This approach could reflect a bias towards easier solutions rather than exploring innovative ways to access and leverage the wealth of existing data.
In summary, while Sutskever and Musk’s concerns about data limitations are grounded in real challenges, their conclusions may oversimplify the complexities of data availability and quality in AI training.
Sources
[1] Ilya Sutskever Predicts the End of Pre-Training as AI Hits 'Peak Data' https://opentools.ai/news/ilya-sutskever-predicts-the-end-of-pre-training-as-ai-hits-peak-data
[2] AI Hits 'Peak Data,' But Google DeepMind Has a Clever Plan to ... https://opentools.ai/news/ai-hits-peak-data-but-google-deepmind-has-a-clever-plan-to-keep-progress-rolling
[3] Google DeepMind researchers think they found a solution to AI's ... https://bestofai.com/article/google-deepmind-researchers-think-they-found-a-solution-to-ais-peak-data-problem
[4] OpenAI co-founder says AI is reaching "peak data" as it hits the limits ... https://the-decoder.com/openai-co-founder-says-ai-is-reaching-peak-data-as-it-hits-the-limits-of-the-internet/
[5] AI Hit 'Peak Data.' Google Researchers Think They Found a Solution. https://www.businessinsider.com/ai-peak-data-google-deepmind-researchers-solution-test-time-compute-2025-1
[6] Elon Musk agrees that we've exhausted AI training data https://finance.yahoo.com/news/elon-musk-agrees-weve-exhausted-040148636.html
[7] Ilya Sutskever Calls Peak Data and the End of Pretraining | Ask Dexa https://dexa.ai/d/b856e2a6-bd72-11ef-a143-bb2eba2ff942?t=1-Peak+Data+Debate
Gemini 2.0 Flash Experimental
I fed this conversation so far, at least the part about peak data into Gemini 2.0 Flash Experimental to get its evaluation.
Genini 2.0 Flash Experimental: You’re absolutely right to be skeptical about the “peak data” assertion. There are several reasons why:
- Limited Scope: Musk and Sutskever might be focusing on a specific type of high-quality data that’s readily available. They might be overlooking vast amounts of data that’s trickier to access, like the deep web or proprietary sources.
- Undervalued Outliers: Just because information is new doesn’t mean it’s low quality. Valuable insights can come from unexpected places. The challenge is finding those nuggets of value in the sea of data.
- Synthetic Data Bias: Perhaps their preference for synthetic data creation stems from their resources. They might have the computing power to generate it but lack the infrastructure for large-scale data capture, unlike Google with its powerful crawlers and vast libraries.
Here’s a breakdown of your points:
- Copyright and Uncrawlable Web: You’re right, copyright restrictions and uncrawlable parts of the web limit data availability. Additionally, there’s valuable private data like LexisNexis that’s not publicly accessible.
- Outlier Information Value: Absolutely! New information, even seemingly insignificant, can hold hidden gems. The key is having the tools to identify and curate this data effectively.
- Synthetic Data as Easier Option: It seems creating synthetic data might be their way of dealing with the data capture challenge. It’s like using a hammer for everything because that’s what you have, instead of finding a better tool for the job (better data capture mechanisms).
Overall, while Musk and Sutskever raise valid points about data limitations, their conclusions might be a bit hasty. There’s likely a lot more data out there waiting to be harnessed.
The Real Beginning
Alright, that peak data tangent was not really what I want to get to, which is this: another long-as-I-need outside business hours focused work-session with a nap behind me, so I have that morning energy but going into the strange vampire night hours where your mind can be at its peak.
Setting the Stage
And that’s where this article is really just beginning. All that stuff above about the history of the search industry, SEO, data transformations, Microsoft dirty tricks, small-world crawling and the Internet as 2 types of databases, the JavaScript execution challenge that makes the invisible web, and finally challenging the peak data assertions of Sutskever and Musk have all been just brain exercises. Limbering up. Reminding me who I am and what I’m doing here with these exercises, shifting power back to the little guy and trying to be one of those outliers whose work might have game-changing potential.
The Path Forward
The chisel-strikes I make next greatly determine this. My thought process and how I craft these chisel-strikes is all-important. As much as I want to “just port the Jupyter Notebook”, I have found my subconscious resisting the shovel-ware implementation that I could do, forcing me to do these carefully considered next steps you’re about to witness.
Normally I give one of the Gemini models a chance to help me with titles, descriptions and intros at this point because they were the first to the really big token windows capable of letting me prompt a whole article in one shot. Since then, certain OpenAI models have caught up, and in this case I already used Gemini above in an open-ended prompt. So the structured prompt that I use to put a bowtie on it goes to o1 Pro this time:
o1 Pro: Summary (in the author’s voice)
[Summary moved above as description / meta tag]
Ideas for Titles and Headlines
- “From Lycos to ChatGPT: My 30 Years in the Search Trenches”
- “SEO’s New Frontier: AI, Antitrust, and the Invisible Web”
- “Breaking Google’s Moat: The Rise of LLM Crawlers and Small-World Theories”
- “Peak Data? Why the Future of Search Hinges on Outlier Information”
- “Reimagining SEO in the AI Age: A Veteran’s View”
- “Synthetic Data vs. the Real Web: Where Do We Go Next?”
- “The Hidden Power of Homepages: Designing Sites for LLM Crawls”
- “New Search, Old Tricks: Microsoft, Google, and the AI Revolution”
My Opinion of the Article
I find this article refreshingly candid and wide-ranging, connecting the early, hands-on realities of building sites for search with today’s pressing questions about AI and antitrust. The author’s skepticism about “peak data” underlines a core insight: that true innovation demands deeper access to new and hidden data sets, rather than retreating to synthetic stand-ins. This piece cuts through hype by reminding us how easily we can overlook the vast trove of real information still out there—if only we’d muster the grit to find, curate, and leverage it. It’s a call for more inclusive, inventive approaches to future-proofing SEO and web infrastructure, and it does so in a personal, almost confessional style that makes the technological deep-dives both engaging and thought-provoking.
Wow, o1 Pro! Nice. Maybe you’ll wrestle the job of summarizing these things from Gemini now that you can handle the token windows too. I particularly like your opinion, but that is of course because it flatters me, and who doesn’t like confirmation bias? I have to watch out for that as much as have my ego stroked by it. Anyway, onward!
Can you guess where this article is going to go from here? I bet you can’t.
I need to port a Jupyter Notebook, but it was created under the assumptions one has when using a Jupyter Notebook, and those assumptions have changed and I can do so much more. I can do so much differently. I can do it so much better.
And I could fall down a rabbit hole that will ruin the best laid plans of stealing this second work-day after a nice run of good night’s sleep. This is how our ancestors worked. Feast and famine. The caveman nomad rhythm. Good times when you rest. Hard times when yo pick up and travel, facing the harsh realities and the unknown.
Today I hunt.
My quarry is popping up web browsers, for 1 of 2 purposes:
- Having access to the view-source HTML source and maybe even the JavaScript rendered document object model (DOM).
- To not do anything interactive with the browser at all, and merely to give the user instructions to go download something.
Everything else falls in place after this.
The former is a potential rabbit hole, involving Puppeteer, Playwright or Selenium — any of which would open a browser under some sort of automation context using the API hooks Google put in there for DevTools or other techniques to actually control the browser.
The later just pops open your default browser with your default (or last used)
user profile without any automation context at all. It just pops open a
particular URL which might have a querystring on it to bring you to a deep
query-driven page of some service where the download button is there and ready
to press. But still, it’s not automation. And there are no rabbit holes. It’s
just the Python webbrowser
package that comes with the standard library.
And so that is where we will begin.
Rabbit holes are hard to resist, because many things you realize need to be done feel like chained dependencies, where before you do that you have to do this.
That is an illusion. Many more things are deferrable than the playful child who wants to follow the most interesting thing inside of you tries to tell you. I say you, but of course I just mean me and I’m generalizing. Who knows what your experience is, but that’s what I find. It’s shiny object syndrome. The dog drops the bone it’s burying because it sees some other bone. It’s self-defeating, no matter how much the infusion of motivation because you are deeply interested in that other thing might help.
Ultimately, “would be nice” dependencies that can be deferred until later so you can get what you already deemed most important done are detrimental and derailing unless there’s some really compelling reason it’s an exception.
What those really compelling reasons might be are up to you as the rabbit hole spelunker whose attaching rappelling equipment to determine that criteria. If you go spelunking down a rabbit hole experimentally just to see how deep it is, don’t get hypnotized by the wonders you’ll find down there. When you feel yourself getting hypnotized, like Mowgli in the coils of Kaa, that’s when you give a good yank! Have your free-will and agency haul you by your rappelling gear back up. Setting the threshold on that system is a practiced skill. Sometimes you can be in-and-out of those rabbit holes accomplishing just what you want. Other times, they can derail your project and your life. Practice!
Anyhow, this tech journal itself could be seen as one of those rabbit holes, but for the fact it’s a serendipity-to-deliberateness machine, like HitTail. Deliberateness is another way of saying exercising free-will and agency. Take paths you would have otherwise not taken if you didn’t stop to dredge-up and process your ideas and notice new little details or ways of thinking about things you haven’t before.
There’s a couple that come to mind at this point in the project. First we gauge time. It’s coming up on 8:00 PM. Not bad. If I go to 3 or 4 AM, that’s into a Friday, which is into a weekend, and if I konk myself out, that’s fine. I won’t have to be in front of clients again for a few days. So, my timing is perfect. I can coffee-up and bear down on this work.
The next thing to come to mind is the Jupyter-Pipulate connection. Juptyer is about linear workflows and pipelines, although most people don’t think about them or talk about them that way, because they like executing things out of order learning and experimenting, taking advantage of the very powerful REPL (read, eval, print, loop) environment, but giving Notebooks a bad name.
For people not familiar with Jupyter Notebooks, the code never really stops
running secretly in the background, stuck in a quiet loop so that if you set a
= 1
, run the cell and go check the value of a
later, it’s still equal to 1
.
It never stops running until you reset it. That way, you can build up your
programs a cell of code at a time. This is particularly useful in data
science where you’re dealing with data structures like Pandas DataFrames
(often known by convention as df
). You load a .csv
file for example into a
df
, and then you start doing manipulations on it. That’s what today’s project
is mostly going to be about, once we get past the browser automation bits to
download the CSVs.
So you build up a program cell-by-cell in Jupyter Notebooks. Microsoft and Google will try to get you to stop calling them Jupyter Notebooks because they don’t like the free and open source (FOSS) branding on something they can just steal (because it’s FOSS) and relabel as just plain Notebooks in Colab and VSCode. But they’re Jupyter Notebooks, for sure. If you like that Python REPL way of working, it’s something from the very nice Project Jupyter, which just happens to be appealing enough to be lifted and repackaged by all the big vendors to inject themselves as a Notebook dependency.
Project Jupyter organization is one of the truly altruistic good guys in free-as-in-freedom software movement of tech, like Debian and GNU. Many people say if you don’t pay for the product than you are the product. That’s actually not true in the case of Debian, GNU and Jupyter. Sometimes it’s free as in there are occasionally surprisingly generous and altruistic people in this world who want to just help you to improve your life, feel good about themselves and make the world a better place.
When you use a Notebook, the goal should be to get the code into the correct
prescribed linear order, cells arranged top-down so you can choose Reset
Kernel and Run All Cells… This both resets the Python code execution
environment which was previously storing all the global variables, and runs it
exactly as it would if you typed python some_file.py
into the command-line. It
runs beginning-to-end.
Now of course you can program things to be non-linear and to jump around. But if you’re doing that, you probably shouldn’t be using a Notebook which is so conducive to top-down execution. I often observe how counter-intuitively, the introduction of functions to a Jupyter Notebook actually makes things harder than easier, because all your function-scoped variables go away when the function is exited. You can only inspect global variables REPL-style.
The top-down beginning-to-ending execution I described also dodges talking about
interacting with the user to acquire input, even if the flow does remain linear.
And for this, there are ipywidgets. You’re not going to be running the Jupyter
Notebook against exactly the same dataset every time. You want to start perhaps
from a different .csv
-file to load or a different website to crawl and you
have to stop and ask the user for the filename or the URL. That sort of
interactivity can often be achieved with simply input fields where you type
something in and hit Enter
. But sometimes it’s more sophisticated like giving
them a pop-up file requester to select a file, or to give a slider, or even to
click somewhere on a data visualization.
The flow is still linear. You’re just collecting variable input values (aka arguments) for input-fields (aka parameters) along the way.
And this is where the concept of Unix pipes rears its head to assert itself as an infrequently noticed parallel between how Jupyter Notebooks work and how one of the oldest tech tricks in the book work: chaining the output of one command (aka function) to the input of another, chaining up the outputs and inputs until some fancy transformation occurred that you can call a product. Or a deliverable. Or something that adds value and you can show the client and be a proud SEO.
Jupyter Notebooks are Unix pipes, but the excessively flexible and intimidating environment that exposes the code and insufficiently tightly controlled environment that demands a certain level of bravery and expertise and allows screw-ups gets in the way. Tightening all that up is called turning it into an app. And that’s what Pipulate does. It takes the linear workflows, or pipelines, and turns them into very Jupyter-like linear workflows but as webapps that doesn’t expose the Python code and puts it all on tighter rails— asking the user for input when it needs it, and explaining steps along the way when it should. That’s Pipulate. That’s pipulating.
And the ports from Jupyter Notebooks to Pipulate should be ridiculously easy. The problem right now is that they’re not. This is the early-stages outlier stuff where nothing is obvious, until it is. And then you can hardly imagine it being done any other way.
And that’s why taking this next step got drawn out for days for me and easily drowned out by the more urgent client-work of the day — as well, it should unless I had the guts to go off and do this independently. But I am not in the financial position or stage of life to do that. I’m punching a clock, and therefore doing this all on top of the daily grind and under that sort of stress and probably for the good of the project makes it forged in the crucible of the working stiff’s reality. I won’t be architecture astronaut waddling in Conway’s Law of the sort Joel Spolsky warns against. That is my tendency and I know I project that, but I can not allow myself to be that.
This is where I pull on the rabbit hole rappelling rig and haul my awareness back up to practical next-steps where it should be. Taking the nap and day-shifting will be me a few luxurious hours to do the thought-work, but not that many. It has to be just as much weighted towards hard-nosed coding implementation at this point, which I will also walk you through right here in this article. These articles are bit overwhelming, I know. But they are designed that way. They will be digested by vector embedded database, mapped into a graph database, and respun into various versions of the future-proofing FOSS book. So lucky you, if you’re here reading this, are in the LLM-training and RAG-content fodder. Woot!
I am sorely tempted to make the first Card of GapAnalysisFlow “put in your most primary keyword” in order to pop open a browser to a Google search on that keyword under Puppeteer/Playwright to grab the DOM, parse out the competitors and pre-populate the textarea I already have showing in the freshly started flow. But that alone could blow my whole time-window, so it’s interesting but not critical. It can be deferred, maybe to never come back to again. Who knows?
I’m in a race against running out of steam. These articles pump me up and give me clarity of purpose, but then that head of steam has to be intelligently used and not squandered on too much writing or the wrong rabbit hole. This is pivotal. Think efficiency of keystrokes. Adjust things to make it as easy to code in the Botifython project as it is write here…
Close all tabs but those you need. Done.
Oh, this is an interesting one and a hard to articulate obstacle I’ve struggled
with. I’ve recently given up the Electron-based standalone desktop version of
JupyterLab, at first because it wasn’t well supported on NixOS, but then
recognized that once you’re managing a FastHTML web app install under a
flake.nix
file, the Jupyter install should be done that way too. And here’s a
radical departure. I started out with JupyterLab instad of Jupyter Notebook
(surprisingly, 2 distinct products), because I felt the increased dashboarding
capabilties of JupyterLab would come into play, but now that most of that is
going to be under Pipulate, it opens the door to formally supporting the
older… no, no, no! Rabbit hole! JupyterLab is the future. Don’t regress. And
don’t do the double damage of distracting yourself with a regression.
The obstacle is that I’ve been launching Jupyter in the same folder as Pipulate,
so all the notebooks go into the Botifython folder, but I’ve been hesitant to do
that because I don’t want to pollute the repo. But the magic fairy dust that
turns any repo into an isolated JupyterLab data science environment is the nix
flake, where I only need copy into a new folder, and type nix develop
! And so
go type stop
to get the default tmux-hosted Pipulate JupyterLab instance out
of memory. Check localhost:8888
, okay, it’s gone.
And now copy the flake.nix
file from the current repo you’re working out of
and over into the repo that holds the Notebook you’re porting tonight… okay,
done. And type nix develop
from that repo… and then start
… and there it
is! Wow, okay points to remember is that it’s important which Nix Flake you
actually start out with. These Python data science environments are subtly
different. And the biggest split is whether I’m leaning on what’s already in the
Nix software repository and putting the PyPI requirements in the flake.nix
file under the ardently version-pinning deterministic but often too inflexible
for data science work, or the other approach which is to create a traditional
Python .venv
virtual environment so you have the freedom to pip install
anything you like. And then there’s a requirements.txt
file where you list,
and optionally pin, the packages and their versions.
And so now, actually perform a Gap Analysis under the new system as you go. Get it in sufficient order to demo.
Oh, it’s going to make the main botifython.py
which literally contains
everything, the framework, the plugins, the JavaScript and the CSS (because
there is very little under FastHTML/HTMX) and everything but the package
imports. In other words, I haven’t deconstructed my own work into separate
includes or packages so the LLM sees everything in one single context window,
not having to piece it all together. It also helps me, as navigating one long
text-file is much easier than directory-diving. But in this case, there’s going
to be a lot of Excel file formatting yuckiness. Hmmm. Keep it internal for now.
Externalizing tonight is a rabbit hole. Nice thought, but everything gets more
complicated when you break it up over multiple files. So let this
botifython.py
file get long and contain some yuck until I can step back and
tackle beautification as a separate step.
The thing about running out of steam is— don’t.
Make yourself a coffee.
Remind yourself what’s at stake.
Be like a force of nature.
Use the very linear pipeline procedures you’re advocating to know what to do next and why.
Fixing a Bug
Once again, the best laid plans. It’s about 1:00 AM and the app is no closer ported, but I encountered a critical bug I had to fix. I actually deleted the GapAnalysisFlow class and worked on the StarterFlow class. I had a place where I hard-wired a querystring-building method, and HTMX overwrote it, creating a very difficult to pin-down bug with the revert buttons showing the same card ID for every card. I fixed it and am reviewing how good StarterFlow really is as a template moving forward.
Claude: Looking at the codebase and .cursorrules
, here are key observations about StarterFlow as a template:
- Perfect Reference Implementation
- StarterFlow is deliberately minimal (2 cards + finalize) but demonstrates all core patterns
- Each step type is represented: preserve (step_01), ephemeral (step_02), and finalize
- The code is heavily documented with warnings about anti-patterns
- State Management Clarity
# Clear demonstration of the key principle:
async def step_01_submit(self, request):
# ALWAYS clear forward first
self.pipulate.clear_steps_from(pipeline_id, "step_01", self.STEPS)
# THEN save new data
self.pipulate.set_step_data(pipeline_id, "step_01", {"mykey": value})
- HTMX Integration Best Practices
# Uses hx-vals instead of URL params
hx_vals=f"js:}"
# Proper placeholder chain
Div(
id="step_01",
Div(id="step_02"), # Next step placeholder
hx_trigger="load"
)
- Pipeline State Structure
self.STEPS = [
("mykey", "step_01", "Card 1"), # Preserve example
("mykey2", "step_02", "Card 2"), # Ephemeral example
("finalized", "finalize", "Finalize") # Standard final step
]
- Documentation Strategy
- Each major section has a clear header with critical warnings
- Anti-patterns are explicitly called out
- Testing scenarios are listed
- State management rules are emphasized
- Reusable Components
# Standard helpers that every flow should use
self.pipulate.handle_jump_to_step()
self.pipulate.revert_control()
self.pipulate.wrap_with_inline_button()
self.pipulate.generate_step_placeholders()
- Error Prevention
- The code structure makes it hard to accidentally skip
clear_steps_from()
- Preserve flag is cleanly separated from data persistence
- Standard helpers enforce consistent behavior
- Local-First Philosophy
pipeline_id = db.get("pipeline_id", "unknown") # Global state is embraced
This template serves as both:
- A working reference implementation
- A pattern library
- An anti-pattern warning system
- A testing checklist
The .cursorrules
file effectively turns this into a “living documentation” system that can guide both humans and LLMs in implementing new flows correctly.