Opening up ChatGPT: Evidence-based measures of openness and transparency in instruction-tuned large language models

With the first excitement of ChatGPT dying down, people are catching up on the risk of relying on closed and proprietary models that may stop being supported overnight or may change in undocumented ways. Good news: we’ve tracked developments in this field and there are now over 20 alternatives with varying degrees of openness, most of them more transparent than ChatGPT.

Ah yes, LLaMA 2, you may think, I heard about that one. Nope — by our measures that is literally the least open model currently available. This is from the company that has had no qualms experimenting on millions of Facebook users without consent and wrecking the mental health and body images of further untold millions on Instagram. They don’t deserve an ounce of trust when they release an “open” language model that they trained, using god knows which training data and bespoke “Meta reward modelling” based on over a million undisclosed data points.*

So anyway, I have two things to share:

  1. We’ve published a short paper on why openness is important and how to assess it across a wide range of dimensions, from availability to technical documentation and access methods. We’ve used LLM+RLHF architectures as a timely test case but the framework is more generally applicable to current generative AI and ML releases.
  2. Along with the paper comes a live tracker backed by a crowd-sourced repository that enables us to keep up with this fast-evolving field. For instance, while the paper was published before LLaMA 2 was released, it’s right there on our live tracker, and as I mentioned above it’s not looking good: it is the least open model of all ~20 ‘open’ models currently available.

In this post I’ll give some personal background on this project. First off, credits to first author Andreas Liesenfeld who pitched this idea as early as January 2023, when there were at best a handful of open alternatives. I confess I was skeptical at first —mostly because one of our problems is that we have too many paper ideas— but we decided to keep tracking this space until March, when a CUI paper deadline provided a useful anchoring point to make a go / no go decision. Obviously it was a go: by then, there were >10 alternatives, at revision time end of May we added another 5, and the live tracker currently features over 20 in total, with several more on our radar (you can help us out if you want!).

My personal interest in LLMs goes back to at least 2020, when I saw GPT3 and blogged about the unstoppable tide of uninformation this kind of language model was bound to cause if released to the masses. A year ago, in September 2022, I used GPT3 in teaching, first impressing my (undergrad) students with some smooth text output and then letting them demystify the system for themselves by giving them assignments to probe the limits and poke holes in its abilities (they were very good at that).

So in November 2022 I watched the release and breathless press coverage of ChatGPT with mild exasperation (and made a prediction: this creates a whole new market for generative AI detection). And I started worrying more about the closed and proprietary nature of ChatGPT. By cynically giving folks ‘free research access’ and keeping all prompts, what OpenAI was doing was harvesting human collective intelligence at unprecedented scale. Truly OpenAI gives with one hand and takes away with the other.

Our paper calls out OpenAI for their exploitative practices and highlights the role of publicly funded science in building alternatives that can be used responsibly in fundamental research and in education. ChatGPT as it is is unfit for responsible use. Fortunately, there are enough alternatives currently to not have to resort to it except as the cynical prototype of maximally closed corporate profiteering — right at the very bottom of our live tracker.

Critical and constructive contributions

A final note of clarification. Opening up ChatGPT is important, not because this technology is so beneficial or offers a good model of language (it does not), but because we can only effectively understand and responsibly limit it when it is open sourced: when we can audit its data and document its harms; when we can examine the reinforcement learning methods to study the contribution of human labour; when we can tinker with models to test the consequences of synthetic data and relying on automatic evaluation methods.

From what I’ve seen so far I doubt that bringing this technology into the world will be a net benefit to humanity. Many current harms have been pointed out by bright minds all around the globe. Let me single out Emily Bender, Timnit Gebru, Margaret Mitchell, Angelina McMillan-Major and their teams in particular. Not only have they articulated some of the most important fundamental critiques of large language models (Bender et al. 2021), they have also made immense constructive contributions towards doing things better, by spelling out frameworks for model cards (e.g., Mitchell et al. 2019) and data statements (e.g., Gebru et al. 2021, McMillan-Major et al. 2023). Also, Abeba Birhane and co-authors like Vinay Uday Prabhu and Emmanuel Kahembwe (Birhane et al. 2021) deserve major credit for showing how to do the incredibly important work of auditing datasets used in current machine learning. Only some models currently are open enough to allow this kind of auditing, but it is absolutely critical for any responsible use.

Many of these elements are directly incorporated into our framework as dimensions on which projects can claim openness points. The BLOOMZ model, an outcome of an audacious year long global collaboration under the moniker ‘BigScience Workshop’, currently tops our list as a project with exemplary openness on all fronts. We’ve been quite impressed by it (and it is no surprise that some of the same names mentioned above are involved in it), even if openness doesn’t mean there are no problems with language models. This project shows what scientists can do when they put their minds to working together, and when openness and transparency are included as a design criterion right from the start.

Our work is only a small pebble in this larger stream of work towards responsible, transparent, and accountable AI/ML. As we write:

Openness is not the full solution to the scientific and ethical challenges of conversational text generators. Open data will not mitigate the harmful consequences of thoughtless deployment of large language models, nor the questionable copyright implications of scraping all publicly available data from the internet. However, openness does make original research possible, including efforts to build reproducible workflows and understand the fundamentals of LLM + RLHF architectures. Openness also enables checks and balances, fostering a culture of accountability for data and its curation, and for models and their deployment. We hope that our work provides a small step in this direction.

Thanks for reading!

PaperLive trackerRepository


* At least that’s what the July 18, 2023 version of their own report hosted on their own server said. I try not to cite stuff that is not clearly version controlled; here is their link.

  • Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM.
  • Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. “Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes.” arXiv.
  • Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. “Datasheets for Datasets.” Communications of the ACM 64 (12): 86–92.
  • Gebru, Timnit, Emily M. Bender, Angelina McMillan-Major, and Margaret Mitchell. 2023. “Statement from the Listed Authors of Stochastic Parrots on the ‘AI Pause’ Letter.”
  • Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.” In ACM Conference on Conversational User Interfaces (CUI ’23). Eindhoven.
  • McMillan-Major, Angelina, Emily M. Bender, and Batya Friedman. 2023. “Data Statements: From Technical Concept to Community Practice.” ACM Journal on Responsible Computing.
  • Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. “Model Cards for Model Reporting.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–29. FAT* ’19. New York, NY, USA: Association for Computing Machinery.

1 thought on “Opening up ChatGPT: Evidence-based measures of openness and transparency in instruction-tuned large language models”

Leave a Reply

Your email address will not be published. Required fields are marked *