[[{“value”:”Regarding your post yesterday Are LLMs running out of data?, the National Archives has 13.5 billion pieces of paper and only 240 million are digitized. Some of this is classified or otherwise restricted, but surely we can do better than less than 2%. NARA says they aim to increase this to 500 million by the end of 2026. Even with
The post Thomas Storrs on elastic data supply (from my email) appeared first on Marginal REVOLUTION.”}]]
Regarding your post yesterday Are LLMs running out of data?, the National Archives has 13.5 billion pieces of paper and only 240 million are digitized. Some of this is classified or otherwise restricted, but surely we can do better than less than 2%.
NARA says they aim to increase this to 500 million by the end of 2026. Even with an overly generous $1/record estimate, it makes sense to me for someone to digitize much of the remaining 13 billion though the incentives are tricky for private actors. Perhaps a consortium of AI companies could make it work. It’s a pure public good so I would be happy with a federal appropriation.
Admittedly, I have a parochial interest in specific parts’ being digitized on mid-20th century federal housing policy. Nonetheless, supply of data for AI is obviously elastic and there’s some delicious low-hanging fruit available.
The National Archives are probably the biggest untapped source of extant data. There are hundreds of billions of more pages around the world though.
The post Thomas Storrs on elastic data supply (from my email) appeared first on Marginal REVOLUTION.
Data Source, Web/Tech
Leave a Reply