Table of Links
The media, filter bubbles and echo chambers
Network effects and Information Cascades
Discussion
We provide a theoretical framework for defining “knowledge collapse”, whereby dependence on generative AI such as large language models may lead to a reduction in the long-tails of knowledge. Our simulation study suggests that such harm can be mitigated to the extent that (a) we are aware of the of the possible value of niche, specialized and eccentric perspectives that may be neglected by AI-generated data and continue to seek them out, (b) AI-systems are not recursively interdependent, as occurs if they use other AI-generated content as inputs or suffer from other generational effects, and (c) AI-generated content is as representative as possible of the full distribution of knowledge.
Each of these suggest practical implications for how to manage AI adoption. First, while our work does not justify an outright ban, measures should be put in place to ensure safeguards against widespread or complete reliance on AI models. For every hundred people who read a one-paragraph summary of a book, there should be a human somewhere who takes the time to sit down and read it, in hopes that she can then provide feedback
on distortions or simplifications introduced elsewhere. One extension to the model would be to allow for generational change but endogenize the choice of public subsidies to protect ‘tail’ knowledge. This is arguably what is done by governments that support academic and artistic endeavors that would otherwise have been underprovided by the private market. Protecting the diversity of information means also paying attention to the effect of AI adoption on the revenue streams of journalists that produce and not merely transmit information (e.g. Cage´, 2016).
Secondly, there is an obvious need to avoid building recursively dependent AI systems (e.g. where one LLM or agent provides answers based on another AI-generated summary, etc.) and thereby playing an LLM-mediated game of ‘telephone’. At a minimum, this requires a concerted effort to distinguish human- from AI-generated data. Preserving access to ‘unmediated’ texts, such as through a well-conceived retrieval augmented generation approach, can preserve the long-tails of knowledge (Delile et al., 2024), as may generating multiple results and re-ranking (Li et al., 2023).
Finally, while much recent attention has been on the problem of LLMs misleadingly presenting fiction as fact (hallucination), this may be less of an issue than the problem of representativeness across a distribution of possible responses. Hallucination of verifiable, concrete facts is often easy to correct for. Yet many real world questions do not have well-defined, verifiably true and false answers. If a user asks, for example, “What causes inflation?” and a LLM answers “monetary policy”, the problem isn’t one of hallucination, but of the failure to reflect the full-distribution of possible answers to the question, or at least provide an overview of the main schools of economic thought.
This could be considered in the setup of frameworks for reinforcement learning from human feedback and related approaches to shaping model outputs, since humans may by default prefer simple, monolithic answers over those that represent the diversity of perspectives. Particular care should also be given in the context of the use of AI in education, to ensure students consider not only the veracity of AI-generated answers but also their variance, representativeness, and biases, that is, to what extent they represent the full distribution of possible answers to a question.
The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) demonstrate the advantage of training LLMs on the maximum amount of (quality) data. A valuable empirical question is therefore whether this leads to increasing or decreasing diversity within the training data (and the raises the related problem of the lack of transparency in the data used to train models). There are many diverse texts that could be included to expand the corpus, but practically, the approach of market-focused participants may be to focus on seeking texts with the lowest marginal cost (conditional on quality). This might exacerbate a reliance on texts that are not representative of the general public, such as if social media texts are easy to collect but not representative of the perspective of people who don’t have access to social media or selfselect out of them. Or, optimistically, companies with a global audience might be incentivized to seek out “low and very-low resource languages” (e.g. Gemini Team et al., 2023) and perhaps even the viewpoints and cultural perspectives of diverse users. Consideration should be given to ensuring and encouraging such diverse inputs as well as to monitoring of the diversity of outputs.
References
Abdollahpouri, H.; Mansoury, M.; Burke, R.; Mobasher, B.; and Malthouse, E. 2021. User-centered Evaluation of Popularity Bias in Recommender Systems. In Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’21, 119–129. New York, NY, USA: Association for Computing Machinery.
Angelucci, C.; Cage, J.; and Sinkinson, M. forthcoming. Media Competition and News Diets. American Economic Journal: Microeconomics.
Arora, S.; Ge, R.; Liang, Y.; Ma, T.; and Zhang, Y. 2017. Generalization and Equilibrium in Generative Adversarial Nets (GANs). In Proceedings of the 34th International Conference on Machine Learning, 224–232. PMLR. ISSN: 2640-3498.
Bakshy, E.; Rosenn, I.; Marlow, C.; and Adamic, L. 2012. The role of social networks in information diffusion. In Proceedings of the 21st international conference on World Wide Web, WWW ’12, 519–528. New York, NY, USA: Association for Computing Machinery.
Ball, Z., and Lewis, K. 2018. Mass Collaboration Project Recommendation Within Open-Innovation Design Networks. Journal of Mechanical Design 141(021105).
Banerjee, A. V. 1992. A Simple Model of Herd Behavior. The Quarterly Journal of Economics 107(3):797–817.
Barbera, P. 2020. Social media, echo chambers, and ´ political polarization. In Social Media and Democracy: The State of the Field, Prospects for Reform. Cambridge University Press. 34–55.
Barbieri, N.; Bonchi, F.; and Manco, G. 2013. Topicaware social influence propagation models. Knowledge and Information Systems 37(3):555–584.
Barrat, A.; Barthelemy, M.; and Vespignani, A. 2008. ´ Dynamical Processes on Complex Networks. Cambridge University Press.
Bikhchandani, S.; Hirshleifer, D.; and Welch, I. 1998. Learning from the Behavior of Others: Conformity, Fads, and Informational Cascades. The Journal of Economic Perspectives 12(3):151–170. Publisher: American Economic Association.
Bohacek, M., and Farid, H. 2023. Nepotistically Trained Generative-AI Models Collapse. arXiv:2311.12202 [cs].
Boone, C.; Carroll, G. R.; and van Witteloostuijn, A. 2002. Resource Distributions and Market Partitioning: Dutch Daily Newspapers, 1968 TO 1994. American Sociological Review 67(3):408–431.
Brynjolfsson, E.; Yu; Hu; and Smith, M. D. 2006. From Niches to Riches: Anatomy of the Long Tail. MIT Sloan Management Review.
Brynjolfsson, E.; Hu, Y. J.; and Smith, M. D. 2003. Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers. Management Science 49(11):1580–1596.
Cage, J. 2016. ´ Saving the media: Capitalism, crowdfunding, and democracy. Harvard University Press.
Cage, J. 2020. Media competition, information provi- ´ sion and political participation: Evidence from French local newspapers and elections, 1944–2014. Journal of Public Economics 185:104077.
Castellano, C.; Fortunato, S.; and Loreto, V. 2009. Statistical physics of social dynamics. Reviews of modern physics 81(2):591.
Centola, D. 2010. The Spread of Behavior in an Online Social Network Experiment. Science 329(5996):1194– 1197. Publisher: American Association for the Advancement of Science.
Chen, C., and Shu, K. 2023. Combating Misinformation in the Age of LLMs: Opportunities and Challenges. arXiv:2311.05656 [cs].
Chen, L.; Chen, P.; and Lin, Z. 2020. Artificial Intelligence in Education: A Review. IEEE Access 8:75264– 75278.
Chen, L.; Razniewski, S.; and Weikum, G. 2023. Knowledge Base Completion for Long-Tail Entities. arXiv:2306.17472 [cs].
Cherniack, S. 1994. Book Culture and Textual Transmission in Sung China. Harvard Journal of Asiatic Studies 54(1):5–125. Publisher: Harvard-Yenching Institute.
Chonka, P.; Diepeveen, S.; and Haile, Y. 2023. Algorithmic power and African indigenous languages: search engine autocomplete and the global multilingual Internet. Media, Culture & Society 45(2):246–265. Publisher: SAGE Publications Ltd.
Christian, B. 2021. The alignment problem: How can machines learn human values? Atlantic Books.
Cinelli, M.; De Francisci Morales, G.; Galeazzi, A.; Quattrociocchi, W.; and Starnini, M. 2021. The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118(9):e2023301118.
Cinus, F.; Minici, M.; Monti, C.; and Bonchi, F. 2022. The Effect of People Recommenders on Echo Chambers and Polarization. Proceedings of the International AAAI Conference on Web and Social Media 16:90–101.
Das, D.; De Langis, K.; Martin-Boyle, A.; Kim, J.; Lee, M.; Kim, Z. M.; Hayati, S. A.; Owan, R.; Hu, B.; Parkar, R.; Koo, R.; Park, J.; Tyagi, A.; Ferland, L.; Roy, S.; Liu, V.; and Kang, D. 2024. Under the Surface: Tracking the Artifactuality of LLM-Generated Data. arXiv:2401.14698 [cs].
Delile, J.; Mukherjee, S.; Van Pamel, A.; and Zhukov, L. 2024. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge. arXiv:2402.12352 [cs].
Dittmar, J. E. 2011. Information Technology and Economic Change: The Impact of The Printing Press *. The Quarterly Journal of Economics 126(3):1133– 1172.
Dohmatob, E.; Feng, Y.; Yang, P.; Charton, F.; and Kempe, J. 2024. A Tale of Tails: Model Collapse as a Change of Scaling Laws. arXiv:2402.07043 [cs].
Douglas, S. J. 2002. Mass media: From 1945 to the present. A Companion to Post-1945 America 78–95.
Eisenstein, E. L. 1980. The Printing Press as an Agent of Change. Cambridge University Press.
Festinger, L.; Schachter, S.; and Back, K. 1950. Social pressures in informal groups; a study of human factors in housing. Social pressures in informal groups; a study of human factors in housing. Oxford, England: Harper. Pages: 240.
Fisher, L. 2024. UK government to trial ‘red box’ AI tools to improve ministerial efficiency.
Gao, C.; Wang, S.; Li, S.; Chen, J.; He, X.; Lei, W.; Li, B.; Zhang, Y.; and Jiang, P. 2023. CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender System. ACM Transactions on Information Systems 42(1):14:1–14:27.
Gemini Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Gesi, J.; Shen, X.; Geng, Y.; Chen, Q.; and Ahmed, I. 2023. Leveraging Feature Bias for Scalable Misprediction Explanation of Machine Learning Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 1559–1570. ISSN: 1558-1225.
Goldenberg, J.; Libai, B.; and Muller, E. 2001. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. Marketing Letters 12(3):211–223.
Goodfellow, I. 2016. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160.
Graham, R. 2023. The ethical dimensions of Google autocomplete. Big Data & Society 10(1):20539517231156518. Publisher: SAGE Publications Ltd.
Grice, H. P. 1975. Logic and conversation. In Speech acts. Brill. 41–58.
Gruhl, D.; Guha, R.; Liben-Nowell, D.; and Tomkins, A. 2004. Information diffusion through blogspace. Proceedings of the 13th international conference on World Wide Web 491–501. Conference Name: WWW04: The 2004 World Wide Web Conference (in conjunction with ACM Conference on Electronic Commerce [EC’04]).
Guo, Y.; Shang, G.; Vazirgiannis, M.; and Clavel, C. 2023. The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text. arXiv:2311.09807 [cs].
Hackforth, R. 1972. Plato: Phaedrus. Cambridge University Press.
Havelock, E. A. 2019. The Literate Revolution in Greece and Its Cultural Consequences. Princeton University Press.
Hegel, G. W. F. 2018. Hegel: The phenomenology of spirit. Oxford University Press.
Heidari, A.; Jafari Navimipour, N.; Dag, H.; and Unal, M. 2023. Deepfake detection using deep learning methods: A systematic and comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery e1520.
Henrich, J. 2004. Demography and cultural evolution: How adaptive cultural processes can produce maladaptive losses—the tasmanian case. American Antiquity 69(2):197–214.
Herder, J. G. 2024. Ideas for the Philosophy of the History of Mankind. Princeton University Press.
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.; Welbl, J.; Clark, A.; et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020. The Curious Case of Neural Text Degeneration. arXiv:1904.09751 [cs].
Jamieson, K. H., and Cappella, J. N. 2008. Echo Chamber: Rush Limbaugh and the Conservative Media Establishment. Oxford University Press.
Jiang, R.; Chiappa, S.; Lattimore, T.; Gyorgy, A.; and ¨ Kohli, P. 2019. Degenerate Feedback Loops in Recommender Systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 383–390.
Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; and Raffel, C. 2023. Large Language Models Struggle to Learn Long-Tail Knowledge. In Proceedings of the 40th International Conference on Machine Learning, 15696–15707. PMLR. ISSN: 2640-3498.
Kant, I. 1933. Critique of pure reason (norman kemp smith, translator). New York: The Modem Library.
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Karlas, B.; Dao, D.; Interlandi, M.; Li, B.; Schelter, S.; ˇ Wu, W.; and Zhang, C. 2022. Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines. arXiv:2204.11131 [cs].
Keijzer, M. A., and Mas, M. 2022. The complex link ¨ between filter bubbles and opinion polarization. Data Science 5(2):139–166.
Kermack, W. O., and McKendrick, A. G. 1927. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character 115(772):700–721.
Khodadi, M.; Allahyari, A.; Vagnozzi, S.; and Mota, D. F. 2020. Black holes with scalar hair in light of the event horizon telescope. Journal of Cosmology and Astroparticle Physics 2020(09):026.
Klug, D.; Qin, Y.; Evans, M.; and Kaufman, G. 2021. Trick and please. a mixed-method study on user assumptions about the tiktok algorithm. In Proceedings of the 13th ACM Web Science Conference 2021, 84– 92.
Kramar, J.; Lieberum, T.; Shah, R.; and Nanda, N. 2024. ´ AtP*: An efficient and scalable method for localizing LLM behaviour to components. arXiv:2403.00745 [cs].
Kuhn, T. S. 1997. The structure of scientific revolutions, volume 962. University of Chicago press Chicago.
Kurnsteiner, P.; Wilms, M. B.; Weisheit, A.; Gault, B.; ¨ Jagle, E. A.; and Raabe, D. 2020. High-strength ¨ Damascus steel by additive manufacturing. Nature 582(7813):515–519. Publisher: Nature Publishing Group.
Layton, B. 1989. The significance of basilides in ancient christian thought. Representations 28:135–151.
Li, H.; Ning, Y.; Liao, Z.; Wang, S.; Li, X. L.; Lu, X.; Brahman, F.; Zhao, W.; Choi, Y.; and Ren, X. 2023. In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search. arXiv:2311.07237 [cs].
Lin, A.; Wang, J.; Zhu, Z.; and Caverlee, J. 2022. Quantifying and Mitigating Popularity Bias in Conversational Recommender Systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, 1238–1247. New York, NY, USA: Association for Computing Machinery.
Mannheim, K. 1952. The sociological problem of generations. Essays on the Sociology of Knowledge 306:163–195.
Melis, G.; Gyorgy, A.; and Blunsom, P. 2022. Mutual ¨ information constraints for Monte-Carlo objectives to prevent posterior collapse especially in language modelling. The Journal of Machine Learning Research 23(1):75:3266–75:3301.
Mesoudi, A., and Whiten, A. 2008. The multiple roles of cultural transmission experiments in understanding human cultural evolution. Philosophical Transactions of the Royal Society B: Biological Sciences 363(1509):3489. Publisher: The Royal Society.
Mokyr, J. 2011. The Gifts of Athena: Historical Origins of the Knowledge Economy. In The Gifts of Athena. Princeton University Press.
Nash, L. L. 1978. Concepts of Existence: Greek Origins of Generational Thought. Daedalus 107(4):1–21. Publisher: The MIT Press.
Nazer, L. H.; Zatarah, R.; Waldrip, S.; Ke, J. X. C.; Moukheiber, M.; Khanna, A. K.; Hicklen, R. S.; Moukheiber, L.; Moukheiber, D.; Ma, H.; and Mathur, P. 2023. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health 2(6):e0000278. Publisher: Public Library of Science.
Nettle, D., and Romaine, S. 2000. Vanishing Voices: The Extinction of the World’s Languages. Oxford University Press.
Nowak, A.; Szamrej, J.; and Latane, B. 1990. From ´ private attitude to public opinion: A dynamic theory of social impact. Psychological Review 97(3):362– 376. Place: US Publisher: American Psychological Association.
Ong, W. J. 2013. Orality and Literacy: 30th Anniversary Edition. Routledge.
Opdahl, A. L.; Tessem, B.; Dang-Nguyen, D.-T.; Motta, E.; Setty, V.; Throndsen, E.; Tverberg, A.; and Trattner, C. 2023. Trustworthy journalism through ai. Data & Knowledge Engineering 146:102182.
O’Reilly, T. 2005. What Is Web 2.0.
Pariser, E. 2011. The filter bubble: What the Internet is hiding from you. penguin UK.
Pfister, D. S. 2011. The Logos of the Blogosphere: Flooding the Zone, Invention, and Attention in the Lott Imbroglio. Argumentation and Advocacy 47(3):141–162.
Russo, L., et al. 2003. The forgotten revolution: how science was born in 300 BC and why it had to be reborn. Springer Science & Business Media.
Seymour, L. M.; Maragh, J.; Sabatini, P.; Di Tommaso, M.; Weaver, J. C.; and Masic, A. 2023. Hot mixing: Mechanistic insights into the durability of ancient Roman concrete. Science Advances 9(1):eadd1602. Publisher: American Association for the Advancement of Science.
Sharma, N.; Liao, Q. V.; and Xiao, Z. 2024. Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking. arXiv:2402.05880 [cs].
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; and Anderson, R. 2023. The curse of recursion: Training on generated data makes models forget.
Smith, L., and Sørensen, P. 2000. Pathological Outcomes of Observational Learning. Econometrica 68(2):371–398. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1468- 0262.00113.
Su, Y.; Lan, T.; Wang, Y.; Yogatama, D.; Kong, L.; and Collier, N. 2022. A Contrastive Framework for Neural Text Generation. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 21548–21561. Curran Associates, Inc.
Taleb, N. N. 2007. Black Swans and the Domains of Statistics. The American Statistician. Publisher: Taylor & Francis.
Tversky, A., and Kahneman, D. 1973. Availability: A heuristic for judging frequency and probability. Cognitive Psychology 5(2):207–232.
Weil, P. 2008. Overlapping Generations: The First Jubilee. Journal of Economic Perspectives 22(4):115– 134.
Wendler, C.; Veselovsky, V.; Monea, G.; and West, R. 2024. Do Llamas Work in English? On the Latent Language of Multilingual Transformers. arXiv:2402.10588 [cs].
Wu, Z.; Geiger, A.; Icard, T.; Potts, C.; and Goodman, N. 2023. Interpretability at Scale: Identifying Causal Mechanisms in Alpaca. Advances in Neural Information Processing Systems 36:78205–78226.
Wu, T. 2011. The master switch: The rise and fall of information empires. Vintage.
Zamora Bonilla, J. P. 2006. Science Studies and the Theory of Games. Perspectives on Science 14(4):525– 557.
Zamora-Bonilla, J. 2010. What Games Do Scientists Play? Rationality and Objectivity in a Game-Theoretic Approach to the Social Construction of Scientific Knowledge. In Suarez, M.; Dorato, M.; and R ´ edei, ´ M., eds., EPSA Epistemology and Methodology of Science: Launch of the European Philosophy of Science Association. Dordrecht: Springer Netherlands. 323–332.