It looks like git after 2.22 was dropped because it took an LLM commit. Same with ghc.
If I have to choose between this or git and the latest ghc, I think I'm going to just wait for someone to fork annex.
I don't even feel strongly one way or the other on AI stuff; pragmatically, I'm just not going to stop using the most widely used version controller, or Haskell, just for some guy's (forkable, AGPL licensed) hobby project.
They will absolutely be missed, maybe not by any individual but the impact of them leaving will be felt. People willing to go to bat for code quality and who are also careful about copyright and the community aspect of open source is why this whole thing worked in the first place.
Was this done by manually reviewing commit messages? I think it would be interesting/useful to have a tool that could use some basic heuristics about LLM generated code to detect code-blobs even if they are not explicitly called out in a commit message.
when i was reading this i thought of writing some quick and dirty cli tool that checks commit co-authors. wouldn't be perfect, but would eliminate a good chunk of low hanging fruit.
Just like with writing, any kind of AI detection is going to be inaccurate to the point of snake oil.
LLM detection in writing is basically today's polygraph test pseudoscience. There was a blog a while ago where someone fed classic literature into one and it was detected as probably AI.
I'm not sure that is the case in this instance. Certainly general writing is a lot more variable and harder to classify, and on the other extreme certain one-line code changes don't have enough information to say anything. However, a blob with a 500+ line code change and 200+ lines of comments is a dead ringer for some of the current class of LLMs. That isn't to say it this behavior couldn't be obfuscated, but some basic categorization could probably separate the majority of human authored commits vs. AI commits. Heck, you could probably train an AI to detect commit-style just by using pre-2022 code archives and existing known-to-be-AI edits/commits.
It's not just "the code itself looks LLM generated" - it's also LOC/hr by a particular author which suggests vibe coding. You could look at the author's github contributions to identify time periods when the author was generating code at super-human speeds. Combine the two signals and you might get something better than a pseudoscience?
An agent doesn't have to be perfect to be useful. If it can find clear examples of stuff you don't want to see in a (potential) dependency quickly, that will save you time. Give it search tools and some policies, then have it go find things. You then check them out, ask followups.
Agents as a super powered (re)search assistant is underrated.
We are all figuring this new technology out and people will make mistakes. Would seem overreactionary to swear things off completely because of a single commit and reversion. Look for patterns in dependencies and your own work.
I think this is a fair and normal reaction to AI slop. Alot of work though. I think OSS projects are at serious risk of implosion due to the vigilance required which honestly may end up being a fool's errand anyway.
But maybe we are thinking about it backward. Have you ever wondered why there is so much "free software"? Beware of strangers bearing gifts.
I have always wondered and been suspicious of people who are so eager for you to use their software. Which isnt to say OSS isnt high quality. Im just saying that maybe when people are pushing free software on you they are kind of in it for themselves.
As for whats next, me personally, last year I pulled all my personal repos about 80 of them off of bitbucket and self host that all now. I think OSS projects should setup a paywall and charge money to create PRs.
Like 10-100 bucks per PR to cover the cost of the extra vigilance. Also I could see migrations away from github, to AI free dependency hosting or something like that. Its an interesting challenge. But its not insurmountable.
Either paywall OSS projects or take them off the interwebs.
Also one option the OP didnt explore I dont think is forking and freezing the dependencies. Huge maintenance burden, but its better than source corruption.
Also use fewer dependencies. Maybe set a limit of 5.
Clicking through to https://git-annex.branchable.com/no_llm_code/
It looks like git after 2.22 was dropped because it took an LLM commit. Same with ghc.
If I have to choose between this or git and the latest ghc, I think I'm going to just wait for someone to fork annex.
I don't even feel strongly one way or the other on AI stuff; pragmatically, I'm just not going to stop using the most widely used version controller, or Haskell, just for some guy's (forkable, AGPL licensed) hobby project.
This is a hill many people will choose to die on.
And the shan't be missed.
They will absolutely be missed, maybe not by any individual but the impact of them leaving will be felt. People willing to go to bat for code quality and who are also careful about copyright and the community aspect of open source is why this whole thing worked in the first place.
Was this done by manually reviewing commit messages? I think it would be interesting/useful to have a tool that could use some basic heuristics about LLM generated code to detect code-blobs even if they are not explicitly called out in a commit message.
when i was reading this i thought of writing some quick and dirty cli tool that checks commit co-authors. wouldn't be perfect, but would eliminate a good chunk of low hanging fruit.
Just like with writing, any kind of AI detection is going to be inaccurate to the point of snake oil.
LLM detection in writing is basically today's polygraph test pseudoscience. There was a blog a while ago where someone fed classic literature into one and it was detected as probably AI.
I'm not sure that is the case in this instance. Certainly general writing is a lot more variable and harder to classify, and on the other extreme certain one-line code changes don't have enough information to say anything. However, a blob with a 500+ line code change and 200+ lines of comments is a dead ringer for some of the current class of LLMs. That isn't to say it this behavior couldn't be obfuscated, but some basic categorization could probably separate the majority of human authored commits vs. AI commits. Heck, you could probably train an AI to detect commit-style just by using pre-2022 code archives and existing known-to-be-AI edits/commits.
It's not just "the code itself looks LLM generated" - it's also LOC/hr by a particular author which suggests vibe coding. You could look at the author's github contributions to identify time periods when the author was generating code at super-human speeds. Combine the two signals and you might get something better than a pseudoscience?
An agent doesn't have to be perfect to be useful. If it can find clear examples of stuff you don't want to see in a (potential) dependency quickly, that will save you time. Give it search tools and some policies, then have it go find things. You then check them out, ask followups.
Agents as a super powered (re)search assistant is underrated.
Maybe an LLM could be used to check for this :)
We are all figuring this new technology out and people will make mistakes. Would seem overreactionary to swear things off completely because of a single commit and reversion. Look for patterns in dependencies and your own work.
I think this is a fair and normal reaction to AI slop. Alot of work though. I think OSS projects are at serious risk of implosion due to the vigilance required which honestly may end up being a fool's errand anyway.
But maybe we are thinking about it backward. Have you ever wondered why there is so much "free software"? Beware of strangers bearing gifts.
I have always wondered and been suspicious of people who are so eager for you to use their software. Which isnt to say OSS isnt high quality. Im just saying that maybe when people are pushing free software on you they are kind of in it for themselves.
As for whats next, me personally, last year I pulled all my personal repos about 80 of them off of bitbucket and self host that all now. I think OSS projects should setup a paywall and charge money to create PRs.
Like 10-100 bucks per PR to cover the cost of the extra vigilance. Also I could see migrations away from github, to AI free dependency hosting or something like that. Its an interesting challenge. But its not insurmountable.
Either paywall OSS projects or take them off the interwebs. Also one option the OP didnt explore I dont think is forking and freezing the dependencies. Huge maintenance burden, but its better than source corruption.
Also use fewer dependencies. Maybe set a limit of 5.