6 minute read

An image of a branch covered in ice. Pretty cold.

Disclaimer: I am not a lawyer; this is not legal analysis, nor legal advice, just an opinion.

Mark Pilgrim, the original creator of chardet recently opened a GitHub issue on the project after he found that the project he started was relicensed under the MIT license from the original LGPL, a process generally referred to as license laundering. The current maintainer, Dan Blanchard, rewrote the project “from scratch” using Claude, changed its license, and assigned the copyrights to himself, claiming this was an independent implementation done in a clean room environment.

Yet, it actually looks from the instructions (line 311) that Claude must have accessed existing source files, so how clean that room was is quite debatable:

Context: The registry maps every supported encoding to its metadata. Era assignments MUST match chardet 6.0.0’s chardet/metadata/charsets.py at https://raw.githubusercontent.com/chardet/chardet/f0676c0d6a4263827924b78a62957547fca40052/chardet/metadata/charsets.py

Fetch that file and use it as the authoritative reference for which encodings belong to which era. Do not invent era assignments.

And again, later on in the same file (line 2643), the test files are pulled from the existing copyleft-licensed repo in order to validate the implementation. Maybe not exactly the cleanest room. Anyway, regardless of the outcome of this particular situation, what are the implications for this in the context of open source? This is an interesting example I’d like to explore of what could become an interesting option for copyleft open source maintainers given the right (or wrong) set of circumstances and opportunities.

Why Would You Do This?

Who benefits from changing copyleft licenses to permissive license, such as the MIT license or BSD? Well, those who are inconvenienced by having such a license, such as developers who want to bundle a copyleft-licensed package in a binary or distribute in an otherwise opaque manner. Dan Blanchard gives the example that there were talks about merging chardet into the Python standard library, but they couldn’t because of the copyleft license’s incompatibility with Python’s permissive license. Although, this conversation was supposedly on Twitter, and the thread has been lost, so who really knows?

This distribution restriction is no big deal for dynamically linked libraries, but a bit of a gray area for Python, where libraries aren’t dynamically linked, but rather pulled into the program which uses them in source form.

Rather than dealing with lawyers or legal departments to find out if ${MEGACORP} could purchase a license, the perceived path of least resistance is pursued instead: beg the developer to change to a different license, which coincidentally happened in this chardet issue in 2014.

In short: these copyleft licenses aren’t an issue unless you’re distributing a closed source program that relies on a copyleft-licensed dependency. These licenses exist to ensure the software remains free (as in speech) and don’t get compiled away into a closed binary blob. If you don’t like that, don’t use it. For chardet, I wonder if Mark Pilgrim was contacted when there were talks about merging the library into Python. After all, he is the one who could have relicensed the project as the copyright holder.

Authorship

Who has the authorship claim in a long-running project like chardet? Is it the person who has maintained the project for the last twelve years, or the person who owns the copyright? According to Dan Blanchard:

[…] it is pretty wild to me how people are saying I am abusing the name of a thing that I have been the near sole contributor and maintainer of for over 12 years. If my understanding of history is correct (because some of this does pre-date me), Mark ported the original from C in 2006. Then he deleted the repository in ~2011. Then other people recreated the repository, and there was briefly a Python 3 fork called charade that @sigmavirus24 created. I took over after leading an effort to merge the charade and chardet repos back into one codebase in ~2013. Since then every release has been put out by me. It’s not like this was a thing I just popped into last week.

Dan has put in a lot of time and effort into this project for the last twelve years, but can he claim authorship? I’m not so sure. This code, even though the GitHub repo belongs to Dan, is owned by Mark, the copyright holder, and is a fork of his original work, licensed under the LGPL license. True, it’s a fork that doesn’t carry much resemblance to the original, but until this very latest change, the copyright was owned by Mark Pilgrim and licensed under the LGPL, version 2.1. It has been Dan’s choice to work on this project, knowing full well the limitations this presented.

Tabula Rasa

Why wouldn’t the maintainer create a brand-new, API-compatible library licensed under the MIT license and leave the old one as-is, in maintenance mode? Call the thing chardet2, and link it in the header of the original library. Output a deprecation warning on install, linking to the new PyPI package, and you’re done. This is a common pattern, so why not here? Regardless of the test coverage, a full-on rewrite is one hell of a risky move. Of course, that project would need to see adoption and doesn’t inherit the GitHub stars, existing installs, or other benefits that the base library does. Maybe that’s my cynicism speaking, but I don’t think this would have been a talking point if it had happened.

However, doesn’t the wiped repository, now filled with AI-generated code provide a clean break here? Well, even with the provided evidence of low code similarity, the issues mentioned in the opening segment remain. The break didn’t seem too clean here.

This also raises the question: can AI-generated code be copyright free or neutral? It’s evident from the generated code that Claude knew what chardet was, indicating it was trained on its data. Yet, the US Supreme Court recently declined to take up the issue of applying copyright to AI art generated without human direction, potentially setting a precedent that all those works could be in the Public Domain. The wider implications of this, of course, remain to be seen, and it’s likely that the US Supreme Court is waiting on lower courts to battle it out, or let the technology mature a bit longer. This is a new frontier, and this chardet situation could potentially be another one, but for software.

Precedent = Set

Even if nothing comes of this specific example, precedent has been set. Projects can be wiped in situ, maintaining API compatibility, and rewritten from scratch by AI agents. There’s not much stopping a rogue maintainer of any copyleft project from doing the same thing. And it likely will again soon.

The question then becomes: are we okay with this as a community? In the case of chardet, not only was Mark Pilgrim’s authorship wiped, but also the contributions of many other people; nothing but AI-generated code remains. Even if the maintainer has the right to do this, is this the right thing to do? Is it the kind of thing we want to see more of?

If this becomes okay, software becomes a cheap commodity unlike any other in our culture, treated unlike any other human creative endeavor. How likely are people who care about attribution and potential compensation through commercial licenses to keep their works online for free? If copyleft licenses can be laundered away, it becomes difficult to continue justifying putting hard work online for free just to train LLMs and potentially get shafted.

votes

Voting and comments require no login, no tracking, no cookies. No personal data is stored about you unless you choose to share it.