Democratizing AI with open-source language models

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

May 17, 2023

This article was contributed by Koen Vervloesem

When OpenAI made its chatbot ChatGPT available to the public in November 2022, it immediately became a hit. However, despite the company's name, the underlying algorithm isn't open. Furthermore, ChatGPT users require a connection to OpenAI's cloud service and face usage restrictions. In the meantime, several open-source or freely available alternatives have emerged, with some even able to run on consumer hardware. Although they can't match ChatGPT's performance yet, rapid advancements are occurring in this field, to the extent that some people at the companies developing these artificial intelligence (AI) models have begun to worry.

ChatGPT is presented as a bot that users interact with, which generates human-like text based on their input. OpenAI has fine-tuned it specifically for engaging in conversations and providing contextually appropriate responses. It's capable of handling a variety of tasks, such as generating content or ideas, translating languages, answering questions, and even providing suggestions for code in various programming languages. ChatGPT also responds to follow-up questions, challenges incorrect premises, and rejects inappropriate requests. How does this work? Under the hood, ChatGPT uses a neural network trained on vast amounts of text to generate new content based on the input it receives. Think of it as an advanced form of autocomplete suggestions.

A neural network is a learning algorithm inspired by how our brains function. It consists of a large number of nodes, known as neurons, that receive input from other neurons and perform a mathematical function to calculate their output, which then goes to other neurons. Each input has a weight attached to it that determines how much that value contributes to the result. The neural network's architecture (how the layers of neurons are connected) and its weights determine its functionality.

The network starts with random weights; therefore, when it receives text as input (we're glossing over some details, such as how text is encoded in numbers), its output is also random. As a result, the network has to be trained using training data: an input text with a corresponding output text. Each time the training data input enters the network, its output is compared with the training data's corresponding output. The weights are then adjusted to decrease the difference between the predicted and correct output. In this way, the network undergoes a learning process until it becomes proficient at predicting text.

Such a large neural network capable of generating human-like text is called a large language model (LLM). It typically involves billions to several hundred billion weights, also known as parameters. ChatGPT is based on some large language models developed by OpenAI, with names like GPT-3.5 or (the most recent version) GPT-4. GPT stands for Generative Pre-trained Transformer and is a specific type of large language model, introduced by OpenAI in 2018, and based on the Transformer architecture invented by Google in 2017. Since GPT-3.5, OpenAI hasn't disclosed the size of its models; GPT-3 (released in May 2020) had 175 billion parameters and was trained on 570GB of text.

When large language models are trained on a broad range of data, as is the case with GPT-3.5 and GPT-4, they are also known as foundational models. Their broad training makes them adaptable to various tasks. A foundational model can be fine-tuned by training the model (or part of it) with new data for a specific task or a specific subject-matter domain. This is what OpenAI has done with ChatGPT: it has fine-tuned its foundational GPT models with conversations in which humans played both the user and the AI role. The result is a model specifically fine-tuned to follow a user's instructions and provide human-like responses.

BLOOM

Companies developing large language models lack incentives to open-source their models and the code to run them since training the models requires significant computing power and financial investment. To make the development of these models sustainable, companies need to be able to build a profitable business around it. OpenAI aims to do this by offering the paid ChatGPT Plus subscription and its pay-per-use API access.

Last year, the situation changed with BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), which is freely available. This large language model was the result of a global collaboration involving over a thousand scientists from more than 250 institutions participating as volunteers in the BigScience collective. The project was started by Hugging Face, a company that provides a machine-learning platform, with contributions from NVIDIA, Microsoft, and the French research institution CNRS.

Development of BLOOM occurred entirely in public. The model was trained for 3.5 months on the Jean Zay supercomputer in Paris, utilizing 384 NVIDIA A100 Tensor Core GPUs each with 80GB of RAM. The data set comprised 1.6TB of text (341 billion words) in 46 human languages and 13 programming languages, and the model had 176 billion parameters, which is comparable to GPT-3.

BigScience developed a new license to publish BLOOM: the Responsible AI Licence (RAIL). Its main purpose is to minimize risks arising from irresponsible use of the model. According to the license, users are not allowed to use the model to generate false information with the intent to harm others; nor may they neglect to disclose that the generated text was machine-generated. RAIL is not an open-source license and does not appear on the list of OSI-approved licenses. Developers can use the BLOOM model with Hugging Face's Apache-2-licensed transformers library in their own code, subject to the terms of RAIL.

Smaller large language models

A downside of BLOOM is that it's still too large for convenient local use. In principle, anyone could download the 330GB model, but using it would require substantial hardware. BigScience also released smaller versions of the model, and this trend of smaller models was continued by others. In February, Meta announced a language model called LLaMA, which is available in versions with sizes of seven billion, 13 billion, 33 billion, and 65 billion parameters. According to the developers, the 13B version performs as well as OpenAI's GPT-3, while being a factor of ten smaller. And unlike GPT-3, which requires multiple A100 GPUs to operate, LLaMA-13B needs only one GPU to achieve the same performance.

Meta trained LLaMA on publicly available data sets, such as Wikipedia and Common Crawl. The code to run LLaMA is GPLv3-licensed, but to obtain the full weights of the model, users were required to fill out a form and agree to a "non-commercial bespoke license". Moreover, Meta proved to be quite selective in granting access. But within a week, the weights were leaked on BitTorrent, and LLaMA kickstarted the development of a lot of derivatives. Stanford University introduced Alpaca 7B, based on the LLaMA model with seven billion parameters and supplemented with instructions based on OpenAI's text-davinci-003 model of the GPT-3.5 family. Both the data set and the model were released under the CC BY-NC 4.0 license and thus do not permit commercial use. One reason for this is that OpenAI's terms of use disallow the development of models that compete with OpenAI.

Subsequently, the open research organization Large Model Systems Organization published Vicuna, a LLaMA-based model fine-tuned on 70,000 conversations of users with ChatGPT. This was accomplished using ShareGPT, which is a browser extension for Google Chrome that is designed to easily share ChatGPT conversations. According to the researchers, Vicuna achieves 90% of ChatGPT's quality and outperforms LLaMA and Alpaca in 90% of cases. Both the code (Apache 2 license) and the weights (13 billion parameters subject to LLaMA's license) have been made public. However, since Vicuna is based on LLaMA and on output from ChatGPT, commercial use is not allowed.

This issue prompted US software company Databricks to develop an open-source large language model suitable for commercial use: Dolly 2.0. It is based on EleutherAI's Pythia model with 12 billion parameters and was trained on the Pile text data set, then fine-tuned on 15,000 instructions with answers. To achieve this, the company engaged more than 5,000 of its employees. Dolly is trained on open-ended questions, closed questions, extracting factual information from texts, summarizing texts, brainstorming, classification, and creative writing tasks, all of this in English only. The 23.8GB dolly-v2-12b model can be downloaded from Databricks' page on Hugging Face. The model is using the MIT license, while the databricks-dolly-15k data set is published under the CC BY-SA 3.0 license.

Following this, Stability AI, the creators of the Stable Diffusion open-source model for generating images, published its own family of large language models: StableLM, under a CC BY-SA 4.0 license. Additionally, MosaicML introduced its MPT-7B family of open-source commercially usable large language models (some of them Apache 2 licensed). Another interesting development is BigCode, a project kickstarted by ServiceNow Research and Hugging Face to develop large language models for completing and writing code from other code and natural language descriptions. Their first model, StarCoder, has been trained on permissively licensed data from GitHub and is using the OpenRAIL license, an updated version of the Responsible AI License.

Crowdsourcing open data sets

With new open-source (or freely available) language models emerging regularly (many of which can be found in the awesome-totally-open-chatgpt repository), various organizations have started considering ways to streamline the development of data sets for training models. One such organization is LAION (Large-scale Artificial Intelligence Open Network), a non-profit research organization aiming to democratize AI. With Open Assistant, it plans to develop large language models capable of running on consumer hardware.

Open Assistant is still under development, and currently focuses mainly on collecting data sets with the help of users. The project already boasts a data set of 600,000 interactions, contributed by 13,000 volunteers. Everyone can lend a hand in this endeavor, as explained in the documentation. For example, users are tasked with grading an answer provided by another person, based on parameters such as quality or politeness. Another task involves offering an answer in the role of a chatbot to a user's request. Volunteers may also be asked to select the best response from two possible answers. Tasks are available not only in English, but in many other languages as well. The researchers intend to train language models using the data set generated by these volunteer tasks. It's worth noting that OpenAI has a similar approach with ChatGPT: the company pays (low-wage) contractors to assist in training its language model and to help identify toxic content.

Running language models on consumer hardware

Running large language models with tens to hundreds of billions of parameters on consumer hardware is not feasible. However, with the trend of smaller language models, initiated by LLaMA, operating ChatGPT-like software on a PC becomes possible. An important project for running these models is Georgi Gerganov's MIT-licensed llama.cpp. It enables users to run LLaMa, Alpaca, and other LLaMA-based models locally on their computers. It runs entirely on the CPU, which is possible by applying 4-bit quantization of the models. This reduces the accuracy of the weights to four bits to reduce memory consumption and computational complexity. Llama.cpp supports Linux, macOS, and Windows. Users can chat with the model via a command-line interface.

The model has to be downloaded first, for example from Hugging Face's model hub, and then converted into a specific format. Instructions to do this are provided in llama.cpp's README file. As a test, I downloaded the 8.14GB model ggml-vic13b-q4_0.bin, which was already in the correct format for llama.cpp. The program loads the model fully in RAM, so it's crucial to ensure the computer has sufficient free RAM.

After compiling llama.cpp, running the model in interactive mode is as simple as:

    $ ./main -m models/ggml-vic13b-q4_0.bin -i

On my somewhat dated laptop with four-core Intel Core i7-10510U 1.8GHz CPU and 24GB RAM, this took 20 seconds to start up. After posing the question "What is the most common way of transportation in Amsterdam?", the Vicuna model began to generate its response word by word, taking 15 minutes to complete the task. The resulting answer was vague but not incorrect, resembling a text from a travel guide:

Amsterdam, the capital city of the Netherlands, is known for its picturesque canals and bustling bike culture. It's no surprise that the most common way to get around Amsterdam is by bicycle. However, there are also plenty of other transportation options available in the city, including buses, trams, trains, and even boats. Whether you prefer two wheels or four, there's a mode of transportation in Amsterdam that will suit your needs.

Various projects have emerged that aim to make using open-source language models more akin to a ChatGPT-like experience. For example, Nathan Sarrazin has been working on Serge, a web-based interface for llama.cpp. It runs in a Docker container, allowing users to download a model from a list, choose its settings, and then initiate a chat. Another similar undertaking is oobabooga's text-generation-webui, which supports both CPUs and GPUs for running models. Using GPUs can significantly improve the performance.

Is open-source AI gaining an edge?

In early May, a document penned by Google's Luke Sernau was leaked: "We have no moat, and neither does OpenAI". The author contends that the open-source community has been rapidly catching up with commercial efforts: "They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months." Sernau also clearly recognizes the advantages of the open-source development model. After detailing the numerous innovations that have occurred within a month of LLaMA's weights being leaked, he notes that anyone can tinker: "Many of the new ideas are from ordinary people." The barrier to entry to contribute to these open-source large language models is just "one person, an evening, and a beefy laptop".

Sernau continues with lessons for Google (and OpenAI), focusing on LoRA, a technique that accelerates fine-tuning of models and is already used by a lot of open-source projects in this domain. Thanks to LoRA, almost anyone with an idea can generate a fine-tuned model in under a day and for around $100. "At that pace, it doesn't take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage", he said, adding that focusing on maintaining some of the largest models on the planet actually puts Google at a disadvantage. At the end, he even made the case for opening Google's large language models. He perceives Meta as the clear winner in all of this, because most open-source innovation is happening on top of its LLaMA model architecture.

If Sernau is right, it means that language models could become a commodity, fueled by the innovative nature and fast-paced development model of the open-source AI community. This would enable researchers, non-profit organizations, and small businesses to access AI capabilities without depending on cloud-based services or expensive subscription fees. Looking back at the numerous language models that have been published during the last few months, we can wonder how long it will take before we have some usable, completely open-source, AI assistants to help us with our daily tasks.

Index entries for this article
GuestArticles	Vervloesem, Koen

(Log in to post comments)

Democratizing AI with open-source language models

Posted May 17, 2023 19:53 UTC (Wed) by davidstrauss (guest, #85867) [Link]

> nor may they neglect to disclose that the generated text was machine-generated.

For the license to impose terms on use of the generated result, the AI model's authors (read: copyright holders/licensors) would need some claim on generated content. However, in the US, a machine/algorithm (or its owner) never gains such a claim on content that the machine has processed or generated.

For generated content, many lawyers are skeptical that OpenAI even holds the rights that they "grant" back to users. Consensus seems to be that the generated work is, at most, a derivative work of the prompts/inputs.

So, I suspect this disclosure requirement is unlikely to be enforceable in the US, at least to a meaningful level. I could possibly see it enforced on the first party that directly used the AI to generate the content, but it's not clear how those terms would carry over to someone else using the generated content.

This situation is similar to how a proprietary compiler's license can't require royalties for distribution of the compiled code.

(Standard "I am not a lawyer"/"this is not legal advice" disclaimer, though I have decades of experience in software licensing topics.)

Democratizing AI with open-source language models

Posted May 18, 2023 1:43 UTC (Thu) by developer122 (guest, #152928) [Link]

Such licences are also strongly reminiscent of "ethical" licences that impose restrictions on the use of software, ranging from "no using for military purposes" to "only use this software for the greater good" and other such nebulous concepts.

Even if these are enforceable, they are absolute garbage. It's a colossal reach for a software licence to restrict actives one can undertake that don't pertain to the software itself. They fall into the same hole the FSF did with the AGPL, trying to use a licence (technology) to solve a societal problem.

When all you have is a hammer, everything looks like a nail.

Democratizing AI with open-source language models

Posted May 17, 2023 22:00 UTC (Wed) by IanKelling (subscriber, #89418) [Link]

So Dolly 2.0, the "open-source language model" is based on multiple data sets and the large one is not freely licensed. The authors say roughly they probably don't have the right to distribute some of it but training a model is fair use so don't worry about it ( https://arxiv.org/abs/2304.01373 ).

Up until now, if someone called a non-software work open source, I'd expect any generated output to have the source data be published under a free license, as the natural translation of the meaning of open source, plus the free culture definition specifies something like that: "all underlying source data should be available alongside the work itself under the same conditions" https://freedomdefined.org/Definition.

But now, if you google for "open source language model", it seems to have quickly become a widespread term that means that at least the large data set is not under a free software license or open source license or free culture license. The reason for this seems to basically be that freely licensed datasets are much smaller and less useful, so the people making these models have decided to promote a new definition that is different and degraded in an important way, and they seem to be doing it successfully. Am I understanding this right?

Democratizing AI with open-source language models

Posted May 17, 2023 22:30 UTC (Wed) by IanKelling (subscriber, #89418) [Link]

And a complicating factor in all this is that the ai models are meant to be modified, so the idea of source and binary from software and even most generated free culture works are not equivalent.

Democratizing AI with open-source language models

Posted May 18, 2023 0:49 UTC (Thu) by flussence (subscriber, #85566) [Link]

> The reason for this seems to basically be that freely licensed datasets are much smaller and less useful, so the people making these models have decided to promote a new definition that is different and degraded in an important way, and they seem to be doing it successfully. Am I understanding this right?

Yes, you are. The reason they don't publish the training data is because they'd open themselves up to be sued for statutory copyright infringement amounting to some large multiple of the entire GDP of the planet, and they know it. It was never theirs to use in the first place.

Democratizing AI with open-source language models

Posted May 18, 2023 5:32 UTC (Thu) by IanKelling (subscriber, #89418) [Link]

Well, that is one very common case. But I noticed the case of the Pythia data set is that they are distributing it and they will just remove parts of it if someone says to them they have the copyright and don't permit distribution.

They seem to be distinct cases, and maybe having the data set even under a nonfree license is like having free source code because, for example if you have a 10tb data set of individual sentences, for the purpose of model creation, you don't modify individual sentences, you add or remove them.

Democratizing AI with open-source language models

Posted May 18, 2023 13:48 UTC (Thu) by dvrabel (subscriber, #9500) [Link]

Your brain is a "large language model" trained over years from a huge body of copyrighted works. Is the output of your brain copyrighted by those original works? Of course not, so why should a mathematical model trained in a conceptually similar way be any different?

Democratizing AI with open-source language models

Posted May 18, 2023 14:41 UTC (Thu) by mb (subscriber, #50428) [Link]

>Is the output of your brain copyrighted by those original works? Of course not

Of course it can be.
If you learn a pattern and reproduce it, it might be copyrighted by the original copyright owners, if sufficient amounts are copied.
There are no hard criteria here. Whether it is a copyright violation has to be decided on a case by case basis.
Neither a brain nor a neural network does magically erase copyright.

Democratizing AI with open-source language models

Posted May 18, 2023 14:56 UTC (Thu) by smurf (subscriber, #17840) [Link]

It might be helpful to remember that copyright is a rather recent invention, and the Disney extensions (copyright shall not be forever, but extending it by 20 years every 20 years appears to be OK – the copyright to The Mouse! shall!! not!!! expire!!!1!) even more so.

That being said, the problem seems to be that the language models are large enough to "remember" specific works, but not structured enough to also remember the authors, let alone tell the user that use of the text might be restricted.

Democratizing AI with open-source language models

Posted May 18, 2023 15:05 UTC (Thu) by epa (subscriber, #39769) [Link]

Surely after running the model, you fuzzy-string-match its output against the entire training corpus, and warn the user if it seems to be parroting an existing work.

Democratizing AI with open-source language models

Posted May 18, 2023 15:09 UTC (Thu) by mb (subscriber, #50428) [Link]

Correct.

Another problem with judging brains and neural networks equally w.r.t. copyright is:

If you use a brain do rewrite (re-implement) some pre-existing work into something that does exactly the same thing, then this is a huge effort. It is allowed, no copyright is transferred over to the new work, but it is a huge effort. So it won't be done, unless somebody really wants to invest time and/or money into it. And then it's perfectly fine, that the new author owns all copyright on the new thing.

If you use the Copyright-Eraser-3000 neural network to rewrite (re-implement) some pre-existing work into something that does exactly the same thing, then it's just a click of a button, essentially. It is no effort. It would render copyright useless, because you could always take any copyrighted material and convert it into something equivalent with copyrights removed within seconds.

So we must either
- judge brains and neural networks differently.
- or get rid of copyright law entirely.

Democratizing AI with open-source language models

Posted May 19, 2023 14:35 UTC (Fri) by Lennie (subscriber, #49641) [Link]

I guess someone has 2 make multiple AI agents that one reads the original and generates a description and other reads the description and generates a new work. Aka clean room re-implementation and copyright bypassed.

Democratizing AI with open-source language models

Posted May 19, 2023 18:58 UTC (Fri) by mb (subscriber, #50428) [Link]

That would not fix the problem in any way.
Adding an intermediate language step and splitting up the machine learning part into two independent instances that pass the intermediate code between them does not increase the effort significantly. It's still a click of a button in Copyright-Eraser-3001 (Cleanroom edition).

The copyright concepts around clean-room-reimplementation and simple reimplementation do not work at all with machine learning.

Democratizing AI with open-source language models

Posted May 23, 2023 6:50 UTC (Tue) by Lennie (subscriber, #49641) [Link]

Wikipedia defines: "Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners."

So I guess the lawyer step was missing ?

Or are you referring to something else ?

The clean room implementation would get far enough away from the original copyrighted work assuming it was done correctly, but maybe you are referring to the situation that the output might be a re-implementation of the copyright of someone else (because it was trained on other works).

Democratizing AI with open-source language models

Posted May 23, 2023 18:47 UTC (Tue) by mb (subscriber, #50428) [Link]

> Or are you referring to something else ?

Yes. The problem is, that the creators of copyright law never has AI in mind.
So it's not made for it.

It's time to either think of a way to adapt copyright law, or to get rid of it entirely. Because if we just let machine learning models remove copyright and be ok with that, it renders copyright law completely useless. I could then always just take foreign copyrighted A and make my copyrighted B out of it very quickly and cheaply. What's copyright good for then?

> So I guess the lawyer step was missing ?

Yes. If you keep a non-trivial human step, I would say it's probably fine then. The machine steps can be seen as (de-)compilation then.

Democratizing AI with open-source language models

Posted May 23, 2023 19:24 UTC (Tue) by Lennie (subscriber, #49641) [Link]

> Because if we just let machine learning models remove copyright and be ok with that, it renders copyright law completely useless. I could then always just take foreign copyrighted A and make my copyrighted B out of it very quickly and cheaply. What's copyright good for then?

That was exactly what I'm implying... did we just figure out that in a year copyright's current roles will be useless ?

Democratizing AI with open-source language models

Posted May 24, 2023 5:38 UTC (Wed) by smurf (subscriber, #17840) [Link]

Huh.

Copyright has, or originally had, a point. The point was not to allow publishers to rip off authors by printing verbatim copies of their work without paying them.

The point is not, and never was, to prevent me from reading your work and then writing something in the same style, or using the general concepts, or whatever.

So what the heck is the problem? Sure a machine could rewrite Stephen King's latest blockbuster until it's no longer immediately recognizeable as a copy. So could I. Approximately nobody would buy that, however.

So what's the problem we're trying to solve here?

Democratizing AI with open-source language models

Posted May 24, 2023 7:24 UTC (Wed) by mb (subscriber, #50428) [Link]

If I can remove Copyright from a program by clicking a button in Copyright-Remover-3000 AI app, the the rip-off-protection is useless and uneffective.
Rewriting by human takes significant amounts of ressources. That is the only reason why the ripoff protection works today.

Democratizing AI with open-source language models

Posted May 24, 2023 8:42 UTC (Wed) by paulj (subscriber, #341) [Link]

The original point of copyright was for publishers - note *not* authors - to re-assert the oligopoly they previously enjoyed over the printing of books.

The argument "won't someone think of the poor authors?" was made by the *printers*, but for completely self-serving reasons. The printers never cared for the authors before when they had their oligopoly by guild and royal decree, nor did they once they got copyright. They just made authors sign over the rights.

Democratizing AI with open-source language models

Posted May 24, 2023 11:37 UTC (Wed) by farnz (subscriber, #17727) [Link]

If you're interested in the background to this comment, look up the "Statute of Anne", the first copyright law. As paulj says, it wasn't about the authors at all - it was a consequence of the Stationers' Company trying to recreate their monopoly on printing after it became legal for anyone to own a printing press.

Democratizing AI with open-source language models

Posted May 31, 2023 12:27 UTC (Wed) by JohnFA (guest, #165385) [Link]

This seems like a potentially very significant issue for Open Source?

If code can be reimplemented by LLMs and escape copyright projection, this would neutralise copyleft.

Are you aware of any other discussions around this issue? Is it as serious as it sounds?

Democratizing AI with open-source language models

Posted Jun 1, 2023 20:46 UTC (Thu) by floppus (guest, #137245) [Link]

It's certainly an issue that has been discussed quite a bit, including here on LWN, in the context of GitHub's Copilot.

There's certainly no consensus about whether this sort of practice (laundering algorithms via large-scale AI models to escape copyright) is legal, nor any consensus about whether it ought to be legal. In practice it seems many people are doing it and not worrying about whether it's legal or not.

Does AI threaten the concept of copyleft? Yes, clearly. But copyleft was never (or shouldn't have been) the end goal. Will AI ultimately be harmful or beneficial to *software freedom*? That remains to be seen.

Democratizing AI with open-source language models

Posted Jun 2, 2023 4:13 UTC (Fri) by JohnFA (guest, #165385) [Link]

Sounds like it'll be legal in Japan: https://technomancers.ai/japan-goes-all-in-copyright-does...

Democratizing AI with open-source language models

Posted Jun 3, 2023 16:39 UTC (Sat) by JohnFA (guest, #165385) [Link]

Can you elaborate on how AI might be beneficial to software freedom?

Perhaps by making it possible for more people to usefully contribute to Open Source projects?

Democratizing AI with open-source language models

Posted Jun 4, 2023 9:21 UTC (Sun) by joib (subscriber, #8541) [Link]

If the legal interpretation is going to be that LLM produced content (code, prose, whatever) is not a derivative work of the training material, that would more or less make copyright irrelevant. Does it matter that Windows is proprietary if you can tell a LLM to "make me an OS with a flashy GUI and a win32 compatible ABI".

Democratizing AI with open-source language models

Posted Jun 4, 2023 9:54 UTC (Sun) by mb (subscriber, #50428) [Link]

You don't have the Windows source code as training material.
Just writing (or in this case generating) a compatible OS is not a derived work (as in copyright), of course. See ReactOS.

The problem is the other way around: You can train on Open Source software source code and create equivalent proprietary software from it. That is where the question whether this process would erase copyright comes up.

Democratizing AI with open-source language models

Posted Jun 4, 2023 14:55 UTC (Sun) by kleptog (subscriber, #1183) [Link]

> The problem is the other way around: You can train on Open Source software source code and create equivalent proprietary software from it. That is where the question whether this process would erase copyright comes up.

Copyright is about copying. Either the output looks substantially like the input, or it doesn't. Whether it was generated by an LLM or by a person typing isn't relevant. This is a "colour of your bits" thing.

Democratizing AI with open-source language models

Posted Jun 4, 2023 15:51 UTC (Sun) by mb (subscriber, #50428) [Link]

Yes, that's essentially what I said.
You put your colored bits into the model and they come out in a transformed way on the other end.
If that processing erases copyright from those bits, then copyright law has essentially become useless.

Democratizing AI with open-source language models

Posted Jun 4, 2023 20:27 UTC (Sun) by kleptog (subscriber, #1183) [Link]

> You put your colored bits into the model and they come out in a transformed way on the other end. If that processing erases copyright from those bits, then copyright law has essentially become useless.

The point is that the copyright status (ie colour) of something cannot be determined just by looking at the bits, so any processing you do cannot "erase" any copyright either. Any argument ending with "so copyright law becomes useless" is bunk, because courts will simply make it not useless.

It's like those people who claimed to make a copyright remover by running the bits through an obfuscation algorithm and then reversing it. That's just not how it works.

Democratizing AI with open-source language models

Posted Jun 4, 2023 20:51 UTC (Sun) by mb (subscriber, #50428) [Link]

> so any processing you do cannot "erase" any copyright either.

I do not think this is what the big players think.
Currently large language models are trained on Open Source code and they emit source code without a license transfer. That is what all those "programming AIs" do.

>It's like those people who claimed to make a copyright remover by running the bits through an obfuscation algorithm and then reversing it. That's just not how it works.

I fully agree. But the big large language model developers apparently disagree.

Democratizing AI with open-source language models

Posted Jun 5, 2023 5:00 UTC (Mon) by smurf (subscriber, #17840) [Link]

> The point is that the copyright status (ie colour) of something cannot be determined just by looking at the bits

true

> so any processing you do cannot "erase" any copyright either.

false. Japan just declared, by fiat of law, that "learning isn't stealing", and thus anything reproduced by an AI isn't copyrighted – at least not by the owners of the copyrights of whatever the AI learned from.

https://m-cacm.acm.org/news/273479-japan-goes-all-in-copyright-doesnt-apply-to-ai-training/fulltext

Democratizing AI with open-source language models

Posted Jun 5, 2023 10:07 UTC (Mon) by kleptog (subscriber, #1183) [Link]

> false. Japan just declared, by fiat of law, that "learning isn't stealing", and thus anything reproduced by an AI isn't copyrighted – at least not by the owners of the copyrights of whatever the AI learned from.

> https://m-cacm.acm.org/news/273479-japan-goes-all-in-copyright-doesnt-apply-to-ai-training/fulltext

The way I'm reading it is that it's similar to the EU position: training on public data isn't a-priori copyright violation, just like browsing the web isn't violating anyone's copyright. It doesn't (AFAICT) say anything about the copyright status of the *output* of the model. Just like the law doesn't state that anything typed by a human is automatically free of copyright violations.

Democratizing AI with open-source language models

Posted Jun 5, 2023 13:31 UTC (Mon) by smurf (subscriber, #17840) [Link]

You're right in that the output's copyright status is not determined. However, that's a different problem which doesn't seem to have a good commonly-accepted answer yet – though IMHO if you manage to convince an AI model to output something that looks like art, you're the artist – and thus you get to hold the copyright.

Back to the topic: The point here is that the output's status is expressly NOT determined by – and/or does not depend on – the copyright status of its input, or any part thereof.

Democratizing AI with open-source language models

Posted Jun 5, 2023 14:26 UTC (Mon) by kleptog (subscriber, #1183) [Link]

> You're right in that the output's copyright status is not determined. However, that's a different problem which doesn't seem to have a good commonly-accepted answer yet – though IMHO if you manage to convince an AI model to output something that looks like art, you're the artist – and thus you get to hold the copyright.

I'm not clear why this is controversial: if you convince Photoshop to produce something that that looks like art, you're the artist and you get to hold the copyright (assuming it's original). It's a tool, you manipulate the tool to produce art, you're an artist. That it's an ML model is irrelevant.

Now, if an ML model only produces minor variations of existing artwork, then it's not a useful model. (Or it might be the worlds greatest compression algorithm.) If you manipulate Photoshop to produce an exact replica of a Mondriaan, then that's your problem, not the tool's.

> Back to the topic: The point here is that the output's status is expressly NOT determined by – and/or does not depend on – the copyright status of its input, or any part thereof.

Well sure, but that doesn't mean the output is free of copyright. Colourless models can produce coloured output. There is no "copyright-erasing" going on because, as you say, the processing done does not determine the copyright status.

Democratizing AI with open-source language models

Posted Jun 6, 2023 10:58 UTC (Tue) by JohnFA (guest, #165385) [Link]

Quite possibly you are correct, but if this is the case, I don't see why this is good for software openness.

In the version of the future you are imagining, there is much reduced incentive to make source code available, if you have developed something special, since then an LLM can simply reimplement it, and others can free ride.

I do not think supporters of Open Source should be celebrating this.

Through copyleft, copyright has been a huge net positive for Open Source (though perhaps not a net positive over all, but that is debatable, because Open Source is so important.)

Without copyright, the situation for Source Code is in danger of degenerating into something like we have to data sets: https://lu.is/blog/2016/09/21/copyleft-attribution-and-da...

Democratizing AI with open-source language models

Posted Jun 6, 2023 15:36 UTC (Tue) by Wol (subscriber, #4433) [Link]

> In the version of the future you are imagining, there is much reduced incentive to make source code available, if you have developed something special, since then an LLM can simply reimplement it, and others can free ride.

That's not the future, that's today.

I can see your copyrighted work, have a lightbulb moment, and re-implement it much better than you did.

The sad fact (from the position of copyright maximalists) is that having the idea is the hard part. With novels, that's fine, but with functional software as soon as someone sees what it can do, they can copy the functionality far more easily than it cost you to have the original idea, and there's NOTHING you can do about it.

Cheers,
Wol

Democratizing AI with open-source language models

Posted Jun 7, 2023 10:31 UTC (Wed) by geert (subscriber, #98403) [Link]

Patent it?

Democratizing AI with open-source language models

Posted Jun 6, 2023 16:29 UTC (Tue) by kleptog (subscriber, #1183) [Link]

Let me see if I'm understanding this correctly. Your argument is that if you write a program to scratch your own itch, you would be less likely to make it open-source because you think other people with the same itch could use an LLM to solve the problem instead?

I don't see how that could be true, instead I'm imagining that you'd scratch your itch, using an LLM to get there faster, and open-source it so other people can also use it. And they in turn use LLMs to make it even better faster and scratch even more itches. Seems to me LLMs and Open Source complement each other nicely, the combination beating each individually.

Frankly, I think LLMs work better for open source because the number of people with itches that need scratching but currently cannot code is vastly larger than business with proprietary software can possibly support.

(When they get good enough, LLMs are good at making convincing looking code snippets, but they're not there yet for larger projects.)

It feels a bit like the argument: if we ever invent replicators then people will stop making art and that would be bad. Dude, we'd have *replicators*, the possibilities for art would explode.

> Through copyleft, copyright has been a huge net positive for Open Source (though perhaps not a net positive over all, but that is debatable, because Open Source is so important.)

Copyleft has played its part, but there is a huge amount of non-copylefted open source code out there. I'm in the "only running code has value" camp, if it's not solving actual problems, it's just a bunch of oddly low-entropy bits. At my work the bean-counters assign value to the code I write, but in my opinion it valueless to anyone outside the business. It has value only because of the organisation it is embedded in so real world problems are solved.

And it only works because of the mountains of open source code it uses.

Also, just because an LLM may be colourless, the prompts that go into it are not. You can make it produce any output you want by asking it to repeat what you say. The output is definitly a derived work of the prompt. I don't see how the current situation changes much for copyright of source code. Only the expression was ever covered, not the ideas themselves. If we get to the situation where merely expressing an idea is enough to get a functional implementation, we'll have invented magic.

Democratizing AI with open-source language models

Posted Jun 7, 2023 14:29 UTC (Wed) by JohnFA (guest, #165385) [Link]

I agree with you prediction re the effect of LLMs on those motivated primarily by itch scratching. Seems like it should be mostly helpful there.

I was thinking more of Open Source projects that come out of the commercial sector. Apache 2.0 seem to be a favourite there due to the patent license. Without copyright protection, there's no way to enforce a patent license clause.

(n.b. I should have said avoiding copyright is the problem that LLMs bring, as I see it, as this affects not only copyleft but other contractual things such as patent licenses. So in other words there could be a problem not just with non-permissive licenses, but also with certain permissive ones such as Apache 2.0)

Democratizing AI with open-source language models

Posted May 21, 2023 4:46 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> the copyright to The Mouse! shall!! not!!! expire!!!1!

PSA: The (oldest) copyright to The Mouse will expire on January 1, 2024, and neither Disney nor Congress have shown the slightest interest in doing anything about it. Disney has even publicly acknowledged[1] that it will happen. It was a useful example for a while, but we will soon need to start using a new one.

[1]: https://www.nytimes.com/2022/12/27/business/mickey-mouse-...

Democratizing AI with open-source language models

Posted May 21, 2023 7:52 UTC (Sun) by smurf (subscriber, #17840) [Link]

Well, they have done plenty in the past.

The Disney example will serve. After all, while another extension seems to be off the table, nobody is actually working on reverting (some of) this atrocity either.

NB, according to Wikipedia the worst example is Mexico. Life+100 years. Ugh.

Democratizing AI with open-source language models

Posted May 19, 2023 14:11 UTC (Fri) by Lennie (subscriber, #49641) [Link]

If you read a page and can reproduce it word for word then you've created a work that is copyrighted by the original author(s).

Democratizing AI with open-source language models

Posted May 31, 2023 11:49 UTC (Wed) by Rudd-O (guest, #61155) [Link]

> But now, if you google for "open source language model", it seems to have quickly become a widespread term that means that at least the large data set is not under a free software license or open source license or free culture license. The reason for this seems to basically be that freely licensed datasets are much smaller and less useful, so the people making these models have decided to promote a new definition that is different and degraded in an important way, and they seem to be doing it successfully. Am I understanding this right?

Yes, and this is profoundly tragic.

I would feel confident pinning the blame for the *origin* of this epistemic degeneration on OpenAI — whose stated intentions were to be an open project for building AI... then promptly reversed course.

Democratizing AI with open-source language models

Posted May 18, 2023 4:42 UTC (Thu) by pabs (subscriber, #43278) [Link]

I wonder if any of these open models meet the Debian Deep Learning Team's Machine Learning Policy:

https://salsa.debian.org/deeplearning-team/ml-policy/

Democratizing AI with open-source language models

Posted May 18, 2023 18:14 UTC (Thu) by IanKelling (subscriber, #89418) [Link]

That is a good relevant link. Thanks.

Democratizing AI with open-source language models

Posted May 18, 2023 11:28 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

Nice summary of the current situation!

Another point that makes opensource models progress faster is that their performance depends a lot on their size, because a lot of parameters have to be evaluated for each produced word. And while the early players had deep pockets to use hundreds or thousands of GPUs, individuals do not have an A100 GPU to play with this code, so they're willing to make compromises on the model's size so that it fits in their GPU or main memory and runs fast enough to be usable. That's how we're now seeing a lot of work being done to try to optimize the accuracy of small models. Vicuna 13B is exceptionally good even in 4 bits and can be run on a 8GB machine with a moderate CPU. A few weeks ago, TogetherComputer released RedPajama 3B (3 billion parameters) which is amazing for this size, and they're still finishing to train the 7B model.

What I suspect is that models will more and more specialize on certain knowledge areas to be smaller, faster, and more accurate on such models, with less hallucinations, and they will complement each other better (exactly like humans in fact, you don't expect an electrician to be a good plumber and nobody would want such a dual-competence person if it made them work slower). We'll see, but these are very interesting times. I think that these new language models will have at least as high an impact on the society as smartphones had, if not even more.

Democratizing AI with open-source language models

Posted May 18, 2023 11:29 UTC (Thu) by karim (subscriber, #114) [Link]

Thanks for this well explained and researched article.

I think it's wonderful that the open source M.O. is enabling the commoditization of generating and running LLMs/AI. That's awesome. It'll eventually (short-?, mid-?, long-? term) make sure that the cost of such capabilities is as low as possible. In and of itself I think this is a comforting and worthwhile goal. Both at a personal level and a org/corp level, it'll enable local use of LLMs on private data-sets. That's huge.

Yet, still, the most "powerful" models are likely to come from those organizations with the largest of data sets. This will likely mean the existing dominant consumer-facing mass-market behemoths, including Google, FB/Meta, Twitter, etc. will remain on top. Regardless of whether it open sources its underlying tech or not, think of the models that Google, for example, can actually generate by training its AIs on just the collective set of training videos found on YouTube ... keeping in mind that it also has access to data from Gmail, Android devices, Gdrive, maps/navigation, etc. Same with FB/Meta. Training discussions just based on social network interaction (no matter what your opinion might be about this) is going to be quite interesting as well.

In short, it's likely that the commoditization of LLM engines through open source is going to ensure that the ensuing battle is about data sets. This will be interesting to watch.

Democratizing AI with open-source language models

Posted May 20, 2023 17:16 UTC (Sat) by rra (subscriber, #99804) [Link]

I think there's a widespread assumption right now that larger is better, but part of what I gathered from the article is that it's far from clear this is truly the case. Those large commercial models from companies like OpenAI and Google have a massive problem with inventing totally fake information (which their marketing departments are desperately trying to get everyone to use the term "hallucination" for instead of the more accurate term "bullshit"). I suspect there are a lot of causes for that and some of them are inherent in the structure of the model, but it seems likely that part of the problem is that shoveling everything on the Internet, true or false, into the model produces a model that is not weighted for truth over falsity.

Smaller models aren't as splashy because they are less likely to be able to hold a general conversation about any topic while being confidently wrong. But unless the goal is to replace business executives and politicians with AI, it's not clear that application is useful for more than entertainment. Smaller models carefully trained on more accurate and vetted data may be more useful for practical applications that don't make for good viral social media but that do help solve real-world problems.

Democratizing AI with open-source language models

Posted May 20, 2023 17:46 UTC (Sat) by Wol (subscriber, #4433) [Link]

Or in other words, "Garbage In Garbage Out" is as true for AI as it is for anything else. :-)

Cheers,
Wol

Democratizing AI with open-source language models

Posted May 20, 2023 18:06 UTC (Sat) by smurf (subscriber, #17840) [Link]

Most people recognize obvious garbage/fantasy/lies. They've been labelled as such, implicitly, since early childhood. Somehow, your fantasy pet dragon doesn't elicit quite the same reaction from your parents as a real-world cat, esp. when it has kicked the vase off the living room table; also, you might have noticed that grown-ups are much more concerned about the monsters at the IRS than the ones under your bed.

There are no labels whatsoever on the relevant parts of the AI training data. It's all a heap of text scraped off the WWW. At best, some labels are haphazardly glued on after the model has been trained, but that's much too superficial. You obviously can't get deep categorization that way, and thus can't prevent the model from hallucinating. Or spewing bullshit. Take your pick.

Democratizing AI with open-source language models

Posted Jul 5, 2023 20:42 UTC (Wed) by nix (subscriber, #2304) [Link]

I suspect there are a lot of causes for that and some of them are inherent in the structure of the model, but it seems likely that part of the problem is that shoveling everything on the Internet, true or false, into the model produces a model that is not weighted for truth over falsity.

Another problem is that though the models clearly know what text is statistically likely to follow other text, that doesn't mean they understand anything about grammar. In particular it is trivial to construct examples showing that even the largest current models don't know what the word "not" means. If they don't understand what it means to say something is the case as opposed to saying something is not the case -- if they are happy to consider each equally likely and the word "not" just as another token rather than as something with substantial semantic effect -- I don't see how they can possibly ever not bullshit.

Democratizing AI with open-source language models

Posted Jul 5, 2023 21:47 UTC (Wed) by kleptog (subscriber, #1183) [Link]

> Another problem is that though the models clearly know what text is statistically likely to follow other text, that doesn't mean they understand anything about grammar.

This is not really a problem (yet). Humans learning their first language learn to use it correctly without knowing anything about grammar. The grammar of a language is after all defined by "how people actually use the words" and not some list of rules someone drew up somewhere. Ergo, an LLM is learning language exactly the same way humans do.

Of course grammar evolves over time as people use it and the risk is that if LLMs start talking to each other a lot they might start evolving a grammer that is distinct from what most people use. While linguistically interesting, this would not useful for the purpose of interacting with most people.

The main problem is that LLMs have no real world references, so things like first, tallest, north, left, scale, distance, weight, etc are all a mystery to it. Leading to weird conversations where it can correctly give you the altitude of two cities, but then get wrong which is higher. At some point though someone is going to successfully couple an LLM with an analytic engine that does understand these things.

Though it does touch on a question I asked my AI course lecturer years ago: is it possible for an AI to truly understand human language if it doesn't live for years in a corporeal body interacting with the real world? How can it otherwise understand what heavy, light, bright, dark, pain, hunger or fear are? And related: would humans still be as empathic if we didn't have mirror neurons?

Democratizing AI with open-source language models

Posted Jul 6, 2023 5:28 UTC (Thu) by smurf (subscriber, #17840) [Link]

Besides mirror neurons we humans have a heap of more-or-less-special-purpose structures in our brains. For instance there's one in our vision system that routes human faces to one place in the brain and not-faces to someplace else. If that's damaged you can't recognize faces, period.

The same thing seems to apply to true/false, reality/fiction, fair/unfair, and a number of other concepts which babies understand even before they properly learn language. The same thing applies to some mammals, and (with variations) even some non-mammals.

My point is: if you want to teach any of that to an AI, you need to have parts that recognize those special-purpose concepts in its neural structure and THEN train it to apply the language model TO that structure.

One useful analog here can be seen in last year's AI day video from Tesla, where they reported how their driving model works. They added specific structures for the AI to "remember" things like occluded cars or people because, surprise, just throwing a bunch of labelled images at it doesn't cause it to come up with that mechanism on its own.

Evolution did (and we play peek-a-boo with our kids so that they link that evolutionary structure to the real world), but it has its own way of remembering and optimizing basic neuronal network architecture – DNA. Training some random GPT model or two, or a thousand, on random data from the 'net is unlikely to replicate that even if it's labelled appropriately, which it currently mostly isn't.

Democratizing AI with open-source language models

Posted Jul 10, 2023 15:22 UTC (Mon) by nix (subscriber, #2304) [Link]

Besides mirror neurons we humans have a heap of more-or-less-special-purpose structures in our brains.

And whether or not you think humans have special-purpose structures in our brains for any purpose (yes, it should be dead obvious to anyone that we do, but some people particularly in machine learning persist in claiming that there is none and that our brains are undifferentiated in any way that matters: so let's take that as read), it is clear that human children are better at learning language than other organisms with brains that grow up around a lot of it (say, domestic dogs): even if a few of them can pick up a few words, nothing else has managed to pick up grammar: the most you get is disordered word salad, the like of which you only see from human first-language learners if brain-damaged. So there is *something* different about what human children are or do that makes them especially good at learning language compared to every other animal on Earth -- and one thing that is definitely true is that all human languages are created almost entirely by generations of children and are thus optimized to be learned by children.

There is thus no guarantee that non-human non-children can do whatever it is humans do, and there is even less guarantee that if they *do* produce something that looks like language, that they are using the same mechanisms to learn or generate it that people do. I would venture to suggest that the presence of "glitch tokens" and their utterly bizarre behaviour makes it quite clear that LLMs don't know what language is, don't know what words are, and don't even know what it means to spell something or repeat it back to you, so their actual level of knowledge is lower than a two-year-old human child's, even if they're very good at faking it. No human child who knew what spelling was would reply to 'spell " petertodd" back to me" with 'L-E-I-L-A-N'. GPT-4 doesn't even realise it's done anything wrong, because it can't realise anything at all.

Democratizing AI with open-source language models

Posted Jul 10, 2023 21:03 UTC (Mon) by Wol (subscriber, #4433) [Link]

> There is thus no guarantee that non-human non-children can do whatever it is humans do, and there is even less guarantee that if they *do* produce something that looks like language, that they are using the same mechanisms to learn or generate it that people do.

Chances are they do use the same mechanisms, given that gene commonality is pretty high across species ... Birds and mammals all communicate with sound, so we all have mechanisms for making sounds, and for identifying them. What probably makes humans unique is we have a voice-box, with which we can make a massive range of sounds.

Also there's no guarantee that the ability to make assorted sounds evolves together with the ability to recognise and associate meaning to those sounds. Although there is clear evidence that that has happened in other species. Two examples I'm aware of - whales can communicate over ocean distances - when we don't make noises that interfere whales are known to send messages across the Atlantic. And starlings - one of the excellent mimics of the bird world - are known to associate certain sounds with human behaviour! Go back 40 years and the trimphone was very popular in the UK. There are thousands of recorded instances of starlings mimicking said phone - clearly KNOWING that the human in the garden will go runnning back into the house. They were obviously doing it for the amusement (to them) factor. Who's to say that won't evolve into something we would call language (if they haven't got it already).

Cheers,
Wol

Democratizing AI with open-source language models

Posted Jul 11, 2023 4:51 UTC (Tue) by smurf (subscriber, #17840) [Link]

That's not a counter-argument. The key word is "grammar" and there seems to be no indication any animal can learn that.

Without grammar, you can't talk about talking. You can't even talk about interacting with the world: you don't teach your child "if you see a fire, drop everything, shout 'Fire!' and run to the exit" by demonstration, do you? You simply tell them, or you show them a picture/movie. The AI doesn't and indeed cannot distinguish between the picture and the world, as multiple prompt jail breaks have demonstrated.

Input to the LLMs contains a lot of examples how to spell, yet they didn't learn the concept of "spelling". And that's the lowest-level example. You don't learn that from one training run, no matter how large the model is. You learn that by evolution, i.e. you iterate on the *structure* of the model, not just its parameters, just like evolution did. (And, like evolution, you need to run a million iterations across a million samples each, in order to get anywhere.)

That, or somebody designs a neural network architecture, or (I suspect) several interconnected ones, that might be able to understand grammar, from first principles. Good luck; we still have no idea how something like that should look like.

Meanwhile, the current crop of models are a local optimum. You can play around with them however you like, but they are unlikely to ever extract understanding of sentence structure from a billion example sentences, or of spatial relationships from a billion example 2D pictures. It's all just random bits. In fact that's how AI image generation works – you transform a heap of random bits into something not quite as random and repeat the process until you get a nice but meaningless-to-the-AI image.

Democratizing AI with open-source language models

Posted Jul 11, 2023 11:38 UTC (Tue) by kleptog (subscriber, #1183) [Link]

Agreed, even OpenAI agrees LLMs aren't going to get better just by throwing more data at them. Further advancement will require linking them with other things.

However, I would argue that LLMs understand grammer just as well as most people: they can produce grammatically correct sentences nearly 100% of the time, but would not be able to describe the difference between an adjective and an adverb, or explain why it's a "big brown dog" and not a "brown big dog".

ISTM though, that LLM do solve one of the biggest stumbling blocks for general AI. Before now, getting a computer to understand complex sentences and context and resolve some of the inherit ambiguity in language was really hard. Now we're much much closer. That said, what exactly the new "hard problem" is won't be clear for a few years and in the mean time, open-source models give a whole generation of students a level playing field to start from.

I think what's mostly needed is linking an LLM with some kind of reliable memory so it doesn't just make stuff up. And some kind of super-ego that monitors its sub-modules for odd behaviour and corrects them. The "shit, what I'm about to say will have serious long term consequences, close mouth now" feature.

Democratizing AI with open-source language models

Posted Jul 11, 2023 18:42 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> What probably makes humans unique is we have a voice-box, with which we can make a massive range of sounds.

It's all the bits and pieces above the voice box that allow human speech to feature such a bewildering array of sounds, from open back unrounded vowels to bilabial clicks by way of velar ejectives, palatal approximants, voiced alveolar lateral affricates, and dental plosives.

Textual AIs beyond chatting

Posted May 18, 2023 12:12 UTC (Thu) by smurf (subscriber, #17840) [Link]

Language models are just the first step, though. Is there anything out there beyond the ubiquitous chatbot?

For images we have the "here's a rough outline of the contents, and some hints about the style I want, now draw a pretty picture" AIs, Midjourney and Stable Diffusion and such, with selective refinement like inpainting.

Is there anything out there that does the same thing with text? For example, takes an outline and some examples of style to produce some text that's actually coherent? Like, highlight a terse description of a scene or two and ask the machine to "expand this to ten pages and introduce Joe Evilmonger"?

Textual AIs beyond chatting

Posted May 18, 2023 14:46 UTC (Thu) by hkario (subscriber, #94864) [Link]

> Is there anything out there that does the same thing with text? For example, takes an outline and some examples of style to produce some text that's actually coherent? Like, highlight a terse description of a scene or two and ask the machine to "expand this to ten pages and introduce Joe Evilmonger"?

That's what ChatGPT can already do. You just have to ask it to do it. Tell it to give you short answers, and it will provide you with sentence or two for an answer. Tell it to generate long paragraphs based on a concept and it will do that instead.

Same error checking, grammar, typos, etc. Ask it to proofread following text and it will give you suggestions.

Democratizing AI with open-source language models

Posted May 19, 2023 2:27 UTC (Fri) by aiethics.abhishek (subscriber, #164948) [Link]

Something else to consider in this discussion is how we think about democratization in the first place. We mostly focus on underlying data and whether trained weights are released. But, democratization can take on other meanings, for example, on profits, use, and governance, as articulated here

Democratizing AI with open-source language models

Posted May 19, 2023 9:32 UTC (Fri) by tzafrir (subscriber, #11501) [Link]

From what I understand, making "AI" a commodity means that it is not a product you can control, gatekeep and profit from.

(Or rather: higher-end will still require beefy servers, but "good enough" will become available to jut about any free software and startup, just like Linux is)

Will that actually happen? That's to be seen. But that is the sense of the word here.

Democratizing AI with open-source language models

Posted May 21, 2023 1:15 UTC (Sun) by NightMonkey (subscriber, #23051) [Link]

A neural network is a learning algorithm inspired by how our brains function.

So, yeah. I know, you might think this is semantics, but, it should really say:

A neural network is a learning algorithm inspired by a model of how brains function.

Yes, there are still too many unknowns for us to just declare we know how the brain works. :) So, computational models of neurological processes are very likely wrong insofar as how they mimic the brain and its components. I just have to say this because the hubris in this field of computing is breathtakingly large, outstripping what medical and biological sciences support. Cheers, and thanks for a nice read. :)

Democratizing AI with open-source language models

Posted May 21, 2023 1:24 UTC (Sun) by NightMonkey (subscriber, #23051) [Link]

Don't miss that there are also lots of poorly paid humans, hired in countries without a history of worker protection, assisting the proprietary "A.I." shops to trim the rough edges of their bots output...

https://time.com/6247678/openai-chatgpt-kenya-workers/

Can open frameworks and corpus help here? Not so sure myself... But I think this should be considered in comparing the two licensing and "intellectual property" models.

Democratizing AI with open-source language models

Posted May 24, 2023 16:24 UTC (Wed) by jezuch (subscriber, #52988) [Link]

I'm late to the party but I have to note that reading this, especially the last part, eerily reminded me of https://qntm.org/mmacevedo

Democratizing AI with open-source language models

Posted Jul 5, 2023 20:46 UTC (Wed) by nix (subscriber, #2304) [Link]

See also the sort-of-sequel _Driver_, in _Valuable Humans in Transit and Other Stories_. Why is that the title? That would be telling.