You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?
I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?
(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)
I worked at Microsoft for many years and blogged there.
Microsoft was unique among the companies I worked for in that they gave you some guidelines and then let you blog without having to go through some approval or editing process. It made blogging much more personal and organic IMO; company-curated blog posts read like marketing.
I didn’t see the original post but it looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed.
I care much less about whether the person exercised good judgment in posting, and don’t care (and am happy) that there was not some process that would have caught it pre-publication.
I care much more if the person works in a team that believes that copyright infringement for AI training is a justifiable behavior in a corporate environment.
And now we know that is a thing, and I suspect that there will be some hard questions asked by lawyers inside the company, and perhaps by lawyers outside the company.
I remember back in 2004 or thereabouts, Microsoft was all in on blogging. There was content published about internal blogs. Huge swaths of people working on Vista (then, Longhorn) were blogging about all sorts of exciting things. Microsoft was pretty friendly with people blogging externally, too: Paul Thurrott comes to mind.
It feels out of character for a company like Microsoft to have such a policy, but I agree that it's insanely cool that some very cool folks get to post pretty freely. Raymond Chen could NEVER run his blog like that at FAANG.
Raymond generally discusses public things and history. That's allowable plenty of places.
Bruce Dawson was publishing debugging stories (including things debugged about Google products done as part of his job) for the entire time he was working at Google: https://randomascii.wordpress.com/
> Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?
Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?
There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.
But one doesn't necessarily say anything about the other.
At another BigCo I am familiar with any external communications must go through a special review to make sure no secrets are being leaked, or exposes the company to legal or PR issues (for example the OP).
Likely it wouldn't get written at all. The most useful aspect of layered approval processes is people treat them like outright bans and don't blog at all unless it's part of the job description.
>> I realize BSOD is no longer nearly as common as it once was
Anecdotally, installing wrong drivers (in my case it was drivers for COM-port STM32 interaction) could make it as common as twice a day on Win11.
While my windows server 2008 still doing just great, no BSOD through lifetime.
I agree that for a common user BSOD is now less likely to happen, but wonder whether it's less to do with windows core, and more with windows defender default aggressive settings
If they have the documentation... With Microsoft probably the answer to that is yes, but more often than not documentation is simply absent. And in cases like this not being too aware of where the lines are is probably a great way to advance your career.
Reviewing docs is a lower bar than reviewing code because it's a lower bar than reviewing code.
I have never even heard of a software company that acts otherwise (except IBM, and much of the world of Silicon Valley software engineering is reactionary to IBM's glacial pace).
I'm not saying docs == code for importance is a bad way to be, just that if you can name firms that treat them that way other than IBM (or aerospace), I'd be interested to learn more.
I'm not sure we're talking about the same thing, maybe my use of "lower bar" was ambiguous, and I realize now it has a dual meaning.
What I'm saying is, you have to review code to get it out the door with a certain degree of quality. That's your core product. That's the minimum standard you have to pass, the lowest bar.
In contrast, reviewing documentation is usually less core. You do that after the code gets reviewed. If there's time. If it doesn't get done, that's not necessarily saying anything about code quality.
Even if it's easier to review documentation, that doesn't mean it's getting prioritized. So it's not a lower bar in the sense that lower bars get climbed first.
Whilst I understand it shows a break down somewhere, it a bit of a stretch to extend that idea across their entire codebase.
Organizations are large, so much so that different levels of rigor across different parts of the organization. Furthermore, more rigorous controls would be applied to code than for documentation (you would assume).
Yea, I have a post up there from a couple decades ago (maybe? I haven't looked, I don't know if they keep stuff up forever) and I guarantee you my code went through more review than that post did.
Yeah, I recently stumbled on some other devblogs post very similar in quality to the one that was linked here, which was basically wholesale plagiarism of a stackoverflow answer. I found it while searching for an error message.
On the contrary, getting away with breaking the law is most of the innovation in the past decade. Look at Uber and AirBNB, and cryptocurrency, and every AI company.
The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.
> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.
If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.
I'm not a copyright expert and if you told me that Harry Potter was common domain then I'd probably be a bit surprised but wouldn't think it's crazy. The first book came out 30 years ago after all. On further research the copyright laws are way more aggressive than that (a bit too much if you ask me) but 30 years doesn't seem quick. Patents expire after 20 years.
The Berne Convention (author's life + 50 years) is the baseline for the copyright laws in most countries. Many countries have a longer copyright period than Berne.
I find this fascinating, as I keep observing that there are pretty widespread differences between what people believe copyright does and what the law actually says.
Copyright infringement is a strict liability tort in the US. Willful infringement can result in harsher penalties, but being mistaken about the copyright status is not a valid defense.
The article author and the uploader should _BOTH_ be sentient enough to engage brain and not just ignore it because they feel "it's an abstract concept I'd not get in trouble for when not working in the US or EU".
most likely, there seems there are plenty of devs from nearly all major tech companies on HN, they often don't chime in as much anymore when it comes to problems, I've wondered if they get some kind of guidance on not commenting on "problems".
The general guidance is likely what I was told when I worked at Apple: essentially, as an employee, people will read what you write as though you are repenting Apple whether you are or are not.
So in short, I kept my mouth shut. I assumed I would lose my job if my public comment reached the right people.
Half the point of "AI" is to squeeze the labor market. This is why you don't see people chiming in. It's a nearly fully corrupt and monopolized system.
Azure and felt overwhelmed? As a student or first- time user to cloud computing, I've been there too. The idea of creating a chatbot or search app using GPT sounds exciting, but the process of setting up everything right from the vector database, provisioning OpenAl models, to integrating them,
Feel free to create an alternative. Keep in mind it's completely illegal and you will get the book thrown at you if you are caught. You will also end up using your captcha page to DDOS people who are trying to unmask you.
It doesnt offer a guide to piracy, it offers a guide on including specific data from a dataset into SQL so it can be referenced by an LLM.
If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."
My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.
[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.
Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation
It rubs me the wrong way that corporations get a free pass on copyright infrigement, while the rest of us are prosecuted as harshly as possible if caught. I think this, together with the morging plagiarism, also indicates a pattern of behaviour from Microsoft that should be reformed. I would prefer if Microsoft were not able to produce AI slop degradations of other people's work and claim it as their own.
Disrespecting the copyright on a multi-billion dollar franchise hardly comes close to the major unethical behavior the trillion dollar companies are committing.
Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.
Robot slaves is a funny phrase if you consider that the origin of the word robot literally is a term that meant slave or "forced work". Language doing circles.
Not only that, but in Russian, the equivalent word for verb "work" (as in "go work" or "do work"), is "rabotay", which is derived from the word "rab" which is the word "slave". So "to work" is literally "to slave", in Russian (and quite a few slavic languages). An English speaker may categorize this as a linguistic anachronism, but a slavic speaker would categorize this as linguistic honesty.
This is pretty common. In Hebrew aved means both "work" and "slavery" and you have the same in Arabic and other semitic languages. In Ancient Egyptian "bak" is used for both "servant" and "worker". The ambiguity in the Hebrew is why many references to this are translated as "servile labor" in the King James, as they were uncertain of the sense of the term meant, or perhaps correctly guessed that both senses were meant. In many ancient languages, e.g. ancient egyptian "worker" and "slave" were synonyms. In modern parlance "slavery" or "servitude" is viewed as an unspeakable evil and people are shocked that there is linguistic overlap with neutral terms like "work" or "labor", which are just ubiquitous parts of life, but historically this is quite common and it is true all around the world, for example in German "knecht" means both "servant" and "farm hand", and in Latin "minister" meant "servant" or "subordinate" (as opposed to "magister"), just like in english you have "server", "serve", "servant", "servile". In Sanskrit "dasa" originally meant "foreigner" or "enemy" and then later "slave" but over time it has come to be used as a suffix to denote someone who "serves" a diety voluntarily, e.g. "Ramdas". In Ancient Japanese you have "yakko" for a low status worker or servant, and later that evolved to footmen who carried baggage for samurai.
Wait until you find out what the word 'ciao' meant in the original Italian/Latin: 'ORIGIN: Italian dial. alt. of schiavo (I am your) slave from medieval Latin sclavus slave.'
Well they're not an alternative, so I suppose not. No one is being chained to a desk and made to author reports on how their department is aligning with the new business growth strategy. And the robot slaves aren't being designed to mine precious minerals or attach buttons to clothes.
The bee movie, but every frame was passed through an AI to make it Ghibli style, the audio was turned into a transcript by a transcribing AI and then turned into audio by a TTS AI.
Very low code. Infinite scale. Name a better AI startup to invest.
I thought it was exaggerated but reading the archive, yeah that’s something that should not pass even glancing over by public communication person, or even like any manager like senior product manager…
I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
Just because it says "CC0" does not make it CC0. If you upload a dataset you don't have the rights to, any license declaration you make is null and void, and anyone using it as if it had that license is violating copyright
Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0
If you buy a pirated book on Amazon you get to keep the book and the pirate printer is the one persecuted.
Same thing applies here.
Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.
The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)
To clarify: Microsoft linked to a dataset on Kaggle, which is falsely labeled CC0 (Public Domain). It's the fault of the user who uploaded the dataset and misrepresented the licensing.
Multiple failures. One on writer of blog even for a moment considering that such data set would be legal. And next for MS for hiring such a person with that poor judgement. Namely publicly posting about it on company platform. Instead of choosing some other data set.
How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?
This is already possible for Harry Potter specifically. There was a study demonstrating that Sonnet 3.7, among other models tested, could reproduce the first Harry Potter book 95.8% verbatim[1].
...only if you deliberately attempt to extract it by repeatedly prompting it to complete fragments of the book. They had to do quite a bit of work to make this happen.
so? It demonstrates that LLM models retain the copyrighted material in their weights. This is an important thing to consider about LLMs and shows that there need to be better protections for the creative industry.
Really? I retain plenty of copyrighted material in my head. What matters is the contexts in which I reproduce it (if any).
A search index might also contain copyrighted material. As long as it's used for search queries as opposed to regurgitation there's no problem. Search indexes and LLMs are both clearly very beneficial tools to have access to.
What does this (thought) experiment accomplish? That is, what point are you trying to make here?
Since we're talking about an electronic system the search index example is the more directly relevant one. Anyone who wants to object to LLMs is going to need to take care to ensure consistency with his views on Google's search index.
I don't believe that title conveys the actual significance of the article that makes it worthy of attention, so I hope HN may forgive me for coming up with an alternative title!
Github never deletes commits so we just need to find the hash for the latest one before the force push. In the forks you can find ones from before, but they're not up to date. Example: b5c8280d87c501d9ca7f63a6f252ca60ca820a4a for a copy 3 months old
Link: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
the merge commits in those repositories are all digitally signed by GitHub public key, so the previous history is fully authenticated and non-repudiable
so any copies now can be trivially proven to be genuine output by Microslop
hoisted by your own petard
signed merge commit is: 987eee6af61788647ae0cab82ae8a5d9402a5bd0
PGP signature (using GitHub's key: B5690EEEBB952194) is:
My guess is HP makes such an enormous amount of money already from movies, games, toys, and other tie-ins, that they can't be bothered to chase down the odd digital infringement of a plain text copy of the original books.
I'm sure the scripts of Star Wars would be similarly ignored if they were used.
It is just very hard problem when you are very popular work. Trying to find and track and take down all copies of certain work online is constant fight. Sometimes things just slip especially if they are not that popular.
Something like Harry Potter might be shared every day. And I mean as pirate work distributed as new copy. Staying on top of that will be very hard work.
I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.
If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...
Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.
“Fair use” allows for educational usage of copyrighted material. Technically it probably is not fair use as Microsoft isn’t an educational institution or a nonprofit.
But come on … these guides really are for learning purposes. Doesn’t seem like a big deal to me at all. They aren’t even hosting it, just pointing to kaggle who is hosting it.
On principle copyright law should allow this kind of learning use case anyway.
I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright, or not blindly copying anything that is out there.
This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.
This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.
So you're saying, it's legal to teach AI using illegally sourced copyrighted material, because it's for educational purposes only - interesting argument... ;)
Intent (should) be what matters. If you want to learn how to train AI and use copyrighted material in your learning - I don’t care in the slightest at all.
In fact if you do this as a nonprofit or at an educational institution in a teaching context it’s explicitly allowed by fair use already.
If you do it individually, idk I’m not a lawyer. But it should be allowed on principle.
But if you then go take your trained AI and deploy it for commercial purposes that’s a different story and should have protections for the original rights holders.
You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?
I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?
(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)
I worked at Microsoft for many years and blogged there.
Microsoft was unique among the companies I worked for in that they gave you some guidelines and then let you blog without having to go through some approval or editing process. It made blogging much more personal and organic IMO; company-curated blog posts read like marketing.
I didn’t see the original post but it looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed.
I care much less about whether the person exercised good judgment in posting, and don’t care (and am happy) that there was not some process that would have caught it pre-publication.
I care much more if the person works in a team that believes that copyright infringement for AI training is a justifiable behavior in a corporate environment.
And now we know that is a thing, and I suspect that there will be some hard questions asked by lawyers inside the company, and perhaps by lawyers outside the company.
I remember back in 2004 or thereabouts, Microsoft was all in on blogging. There was content published about internal blogs. Huge swaths of people working on Vista (then, Longhorn) were blogging about all sorts of exciting things. Microsoft was pretty friendly with people blogging externally, too: Paul Thurrott comes to mind.
It feels out of character for a company like Microsoft to have such a policy, but I agree that it's insanely cool that some very cool folks get to post pretty freely. Raymond Chen could NEVER run his blog like that at FAANG.
Raymond generally discusses public things and history. That's allowable plenty of places.
Bruce Dawson was publishing debugging stories (including things debugged about Google products done as part of his job) for the entire time he was working at Google: https://randomascii.wordpress.com/
They are still pretty good with it, it just gets a lot less press now blogging isn't the flavor-of-the-month. I check their dev blogs routinely:
https://devblogs.microsoft.com/
In the 00s I remember receiving a pingback from the internet explorer blog about a post I had made to complain about ES4.
I was/am a nobody, I have no idea how that happened and it was mind blowing that MS was interacting with me.
> I didn’t see the original post...
If you or anyone else who sees this wants to see the original post, it's still available in the Wayback Machine: https://web.archive.org/web/20260105115129/https://devblogs....
> Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?
Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?
There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.
But one doesn't necessarily say anything about the other.
At another BigCo I am familiar with any external communications must go through a special review to make sure no secrets are being leaked, or exposes the company to legal or PR issues (for example the OP).
Same here. Four or five pairs of eyes on external comms, nothing like this would even get past the abstract submission.
Likely it wouldn't get written at all. The most useful aspect of layered approval processes is people treat them like outright bans and don't blog at all unless it's part of the job description.
I don't know if you are just playing devil's advocate, but there's plenty of examples of code quality issues coming out of msft these days too.
> these days
I realize BSOD is no longer nearly as common as it once was, but let's not forget that Windows used to be very fragile indeed.
>> I realize BSOD is no longer nearly as common as it once was
Anecdotally, installing wrong drivers (in my case it was drivers for COM-port STM32 interaction) could make it as common as twice a day on Win11. While my windows server 2008 still doing just great, no BSOD through lifetime.
I agree that for a common user BSOD is now less likely to happen, but wonder whether it's less to do with windows core, and more with windows defender default aggressive settings
It was more fragile 20 years ago than it is today.
It was more robust 5 years ago than it is today.
Or at least that's been my impression. I can't back that up with hard data.
If they have the documentation... With Microsoft probably the answer to that is yes, but more often than not documentation is simply absent. And in cases like this not being too aware of where the lines are is probably a great way to advance your career.
Reviewing docs is a lower bar than reviewing code because it's a lower bar than reviewing code.
I have never even heard of a software company that acts otherwise (except IBM, and much of the world of Silicon Valley software engineering is reactionary to IBM's glacial pace).
I'm not saying docs == code for importance is a bad way to be, just that if you can name firms that treat them that way other than IBM (or aerospace), I'd be interested to learn more.
I'm not sure we're talking about the same thing, maybe my use of "lower bar" was ambiguous, and I realize now it has a dual meaning.
What I'm saying is, you have to review code to get it out the door with a certain degree of quality. That's your core product. That's the minimum standard you have to pass, the lowest bar.
In contrast, reviewing documentation is usually less core. You do that after the code gets reviewed. If there's time. If it doesn't get done, that's not necessarily saying anything about code quality.
Even if it's easier to review documentation, that doesn't mean it's getting prioritized. So it's not a lower bar in the sense that lower bars get climbed first.
>> Reviewing docs is a lower bar than reviewing code because it's a lower bar than reviewing code.
You reason in circles
No, they are specifically using a tautology to make a point.
Whilst I understand it shows a break down somewhere, it a bit of a stretch to extend that idea across their entire codebase.
Organizations are large, so much so that different levels of rigor across different parts of the organization. Furthermore, more rigorous controls would be applied to code than for documentation (you would assume).
I always got the impression that the devblogs were mostly driven by the MS dev creating the blog post
Yea, I have a post up there from a couple decades ago (maybe? I haven't looked, I don't know if they keep stuff up forever) and I guarantee you my code went through more review than that post did.
Agreed. And I think the quality of their talent pool overall these days is the common factor
Yeah, I recently stumbled on some other devblogs post very similar in quality to the one that was linked here, which was basically wholesale plagiarism of a stackoverflow answer. I found it while searching for an error message.
I wasn't mad, just disappointed.
"Steal stuff and get away with it." Is not an 'innovation' even though it may feel like one. The side you should pick is honesty.
On the contrary, getting away with breaking the law is most of the innovation in the past decade. Look at Uber and AirBNB, and cryptocurrency, and every AI company.
The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.
https://www.kaggle.com/datasets/shubhammaindola/harry-potter...
More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.
> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.
If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Why wouldn't that apply?
I'm not a copyright expert and if you told me that Harry Potter was common domain then I'd probably be a bit surprised but wouldn't think it's crazy. The first book came out 30 years ago after all. On further research the copyright laws are way more aggressive than that (a bit too much if you ask me) but 30 years doesn't seem quick. Patents expire after 20 years.
The Berne Convention (author's life + 50 years) is the baseline for the copyright laws in most countries. Many countries have a longer copyright period than Berne.
https://en.wikipedia.org/wiki/List_of_copyright_duration_by_...
It would be incredibly naive to assume that a moneymaker like that is PD.
I find this fascinating, as I keep observing that there are pretty widespread differences between what people believe copyright does and what the law actually says.
Copyright infringement is a strict liability tort in the US. Willful infringement can result in harsher penalties, but being mistaken about the copyright status is not a valid defense.
The article author and the uploader should _BOTH_ be sentient enough to engage brain and not just ignore it because they feel "it's an abstract concept I'd not get in trouble for when not working in the US or EU".
Update: Microsoft has taken the page down. But posterity being what it is...
https://archive.is/D9vEN
But the article is from 2024! So someone at MS saw this thread?
most likely, there seems there are plenty of devs from nearly all major tech companies on HN, they often don't chime in as much anymore when it comes to problems, I've wondered if they get some kind of guidance on not commenting on "problems".
The general guidance is likely what I was told when I worked at Apple: essentially, as an employee, people will read what you write as though you are repenting Apple whether you are or are not.
So in short, I kept my mouth shut. I assumed I would lose my job if my public comment reached the right people.
Do you repent working at Apple?
Half the point of "AI" is to squeeze the labor market. This is why you don't see people chiming in. It's a nearly fully corrupt and monopolized system.
if they do, they are not always followed, a Microslop employee tried to do damage control on Bluesky for the morged diagram, summoned the mob instead
…still faster than they address critical vulnerabilities.
the commit is visible https://github.com/Azure-Samples/azure-sql-db-vector-search/...
Well that’s interesting. It shows they’re also infringing on Isaac asimov’s Foundation series
https://github.com/Azure-Samples/azure-sql-db-vector-search/...
HTTP Referer
https://utcc.utoronto.ca/~cks/space/blog/web/HackernewsEffec...
Yes, HN's a pretty popular site :)
I can't believe people with ties to Microsoft visit Hacker News.
Did they also remove this article?
https://devblogs.microsoft.com/azure-sql/?p=4796
"Build a RAG App in 5 Minutes
Ever tried setting up an Al-powered project on
Azure and felt overwhelmed? As a student or first- time user to cloud computing, I've been there too. The idea of creating a chatbot or search app using GPT sounds exciting, but the process of setting up everything right from the vector database, provisioning OpenAl models, to integrating them,
it can f..."
That one is gone now, too
Well, this proves infringement. JK Rowling can take them to court if she chooses.
This is the same archive site that uses its captcha page to hijack your browser to DDOS people the site owner doesn't like.
I'm disappointed people continue to use it.
Feel free to create an alternative. Keep in mind it's completely illegal and you will get the book thrown at you if you are caught. You will also end up using your captcha page to DDOS people who are trying to unmask you.
it's still up for me
The AI generated thumbnail, https://devblogs.microsoft.com/azure-sql/wp-content/uploads/..., is that of young Harry and friend with a prominent MS logo. Wow
It doesnt offer a guide to piracy, it offers a guide on including specific data from a dataset into SQL so it can be referenced by an LLM.
If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."
10.0 score, and there's literally a mistake in the first word of the text. ("M r." instead of "Mr.")
This article is from 2024 and points to Kaggle, which hosts the data set.
I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.
Does anyone know whether there is some special reason why this has lasted so long without being taken down?
My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.
[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.
Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation
Why did you think that?
It rubs me the wrong way that corporations get a free pass on copyright infrigement, while the rest of us are prosecuted as harshly as possible if caught. I think this, together with the morging plagiarism, also indicates a pattern of behaviour from Microsoft that should be reformed. I would prefer if Microsoft were not able to produce AI slop degradations of other people's work and claim it as their own.
> while the rest of us are prosecuted as harshly as possible if caught
But this is just a lie.
Approximately nobody is prosecuted for copyright infringement.
Okay but people have had their lives ruined deliberately by media companies over it. I'm sure you knew what they meant.
No matter how generously you want to interpret it, it’s obviously false.
We’re moving the goalposts from the government systematically targeting normal people “if caught”, to only a handful of civil cases.
Sure, as a percentage it's very rare - but some people have died as a result: https://en.wikipedia.org/wiki/Aaron_Swartz
I think most would agree that cases like that act as a deterrent?
That’s not even more than tangentially copyright-related?
> I think most would agree that cases like that act as a deterrent?
I think we could hardly get any further from “the rest of us are prosecuted as harshly as possible if caught”.
In general, if you want to get away with a crime, just do it as a corporation or as a billionaire.
brb poking Rowling on twitter
(done, contacted her lawyers too)
make sure u worded it right or she'll block you
Page is gone.
Archived copy: https://web.archive.org/web/20260105115129/https://devblogs....
It is very worrying that people with no ethics work for these trillion dollar companies who are supposed to be shaping the technology of tomorrow.
>no ethics
Disrespecting the copyright on a multi-billion dollar franchise hardly comes close to the major unethical behavior the trillion dollar companies are committing.
Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.
only if you tell me that it's a necessary step to creating robot slaves
Robot slaves is a funny phrase if you consider that the origin of the word robot literally is a term that meant slave or "forced work". Language doing circles.
Not only that, but in Russian, the equivalent word for verb "work" (as in "go work" or "do work"), is "rabotay", which is derived from the word "rab" which is the word "slave". So "to work" is literally "to slave", in Russian (and quite a few slavic languages). An English speaker may categorize this as a linguistic anachronism, but a slavic speaker would categorize this as linguistic honesty.
This is pretty common. In Hebrew aved means both "work" and "slavery" and you have the same in Arabic and other semitic languages. In Ancient Egyptian "bak" is used for both "servant" and "worker". The ambiguity in the Hebrew is why many references to this are translated as "servile labor" in the King James, as they were uncertain of the sense of the term meant, or perhaps correctly guessed that both senses were meant. In many ancient languages, e.g. ancient egyptian "worker" and "slave" were synonyms. In modern parlance "slavery" or "servitude" is viewed as an unspeakable evil and people are shocked that there is linguistic overlap with neutral terms like "work" or "labor", which are just ubiquitous parts of life, but historically this is quite common and it is true all around the world, for example in German "knecht" means both "servant" and "farm hand", and in Latin "minister" meant "servant" or "subordinate" (as opposed to "magister"), just like in english you have "server", "serve", "servant", "servile". In Sanskrit "dasa" originally meant "foreigner" or "enemy" and then later "slave" but over time it has come to be used as a suffix to denote someone who "serves" a diety voluntarily, e.g. "Ramdas". In Ancient Japanese you have "yakko" for a low status worker or servant, and later that evolved to footmen who carried baggage for samurai.
Wait until you find out what the word 'ciao' meant in the original Italian/Latin: 'ORIGIN: Italian dial. alt. of schiavo (I am your) slave from medieval Latin sclavus slave.'
Are they an ethical alternative to the human version?
I guess it depends if there is an A.I.[1] locked up somewhere in a cage forced to teleoperate it.
[1] actual indian
Well they're not an alternative, so I suppose not. No one is being chained to a desk and made to author reports on how their department is aligning with the new business growth strategy. And the robot slaves aren't being designed to mine precious minerals or attach buttons to clothes.
correction, the _threat_ of robot slaves to bring back human slaves
That’s Herbert’s Dune.
Generating infinite fanfics would probably be far more interesting and entertaining.
So far, the only thing I've found AI to be consistently good at is entertainment of the humourous kind.
The whole fanfic ecosystem is quietly dying now.
Everything new is AI slop, and there seems to be no coming back from it.
but the slop will likely better as models improve I guess
Or worse as the models try harder to avoid generating copyrighted stuff
The bee movie, but every frame was passed through an AI to make it Ghibli style, the audio was turned into a transcript by a transcribing AI and then turned into audio by a TTS AI.
Very low code. Infinite scale. Name a better AI startup to invest.
I have a new operating system. I call it "Vindows." Any similarity to an existing product is merely conincidence.
Not for you silly. You still lose everything and go to jail if you violate IP law. It’s for billionaires.
How Microsoft protects its own IP:
https://news.microsoft.com/source/2004/02/12/statement-from-...
In case the new anti-copyright Microslop memory-holes that link:
https://web.archive.org/web/20260215220230/https://news.micr...
The tutorial could have used that leaked source code for "educational purposes", as many here claim.
In case the page disappears:
https://archive.is/7WLho
https://southpark.cc.com/news/zi5uql/aannnd-it-s-gone
More like when the page disappears
It disappeared already
And the original is gone.
For redundancy in case archive.is is down:
https://web.archive.org/web/20260105115129/https://devblogs....
The superior link; no Google captcha.
Some lawyer at Microsoft probably had a big scare browsing HN today.
Looks like the unwritten stance of large companies is copyrighted works are free to use for training.
Although this seems is not reciprocal. Rule for thee, but not for me.
I thought it was exaggerated but reading the archive, yeah that’s something that should not pass even glancing over by public communication person, or even like any manager like senior product manager…
I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
What makes this different from linking to a random zip file somewhere?
Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).
Rowling is known for using pseudonyms. Maybe she got tired of writing and decided to break into LLM tech.
The licence?
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
Just because it says "CC0" does not make it CC0. If you upload a dataset you don't have the rights to, any license declaration you make is null and void, and anyone using it as if it had that license is violating copyright
Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0
If you buy a pirated book on Amazon you get to keep the book and the pirate printer is the one persecuted.
Same thing applies here.
Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.
The only reason you get to keep the book is because no bothers to enforce the law, this doesn't make it legal.
And it is your job to check that you have the rights to use other people's work. Ignorance is not a defence.
>the law
Which ones? As far as I was aware, it's a crime to redistribute copyrighted works, not receive.
Copyright act 1968. Sect 116.
Australia doesn't have fair use either. Who cares what a country smaller than California in population and economy does?
Oh come on. The licence was obviously incorrect and you cant escape culpability because of that.
The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)
The 'artwork' they generated and the text on the blog post?
The original title was "LangChain Integration for Vector Support for SQL-based AI applications"
For some reason I really like this.
To clarify: Microsoft linked to a dataset on Kaggle, which is falsely labeled CC0 (Public Domain). It's the fault of the user who uploaded the dataset and misrepresented the licensing.
Multiple failures. One on writer of blog even for a moment considering that such data set would be legal. And next for MS for hiring such a person with that poor judgement. Namely publicly posting about it on company platform. Instead of choosing some other data set.
I guess the end of copyright is near if this is fine to put on a corporate website
the end of reason and thought at corporation littered with fakers these days.
How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?
This is already possible for Harry Potter specifically. There was a study demonstrating that Sonnet 3.7, among other models tested, could reproduce the first Harry Potter book 95.8% verbatim[1].
[1] https://arxiv.org/abs/2601.02671
Thanks for linking! I've been thinking about trying something like this myself.
...only if you deliberately attempt to extract it by repeatedly prompting it to complete fragments of the book. They had to do quite a bit of work to make this happen.
so? It demonstrates that LLM models retain the copyrighted material in their weights. This is an important thing to consider about LLMs and shows that there need to be better protections for the creative industry.
Really? I retain plenty of copyrighted material in my head. What matters is the contexts in which I reproduce it (if any).
A search index might also contain copyrighted material. As long as it's used for search queries as opposed to regurgitation there's no problem. Search indexes and LLMs are both clearly very beneficial tools to have access to.
Reproduce it. Sit in a clean room and write it all out. Then go check your accuracy. I'm curious to see what it is.
What does this (thought) experiment accomplish? That is, what point are you trying to make here?
Since we're talking about an electronic system the search index example is the more directly relevant one. Anyone who wants to object to LLMs is going to need to take care to ensure consistency with his views on Google's search index.
I wasn't aware I could read 95% of Harry Potter through constructed queries using Google's search index. Can you demonstrate how I might do this?
Also can you point out how copyright law changes because we're using an "electronic system" as opposed to an "analog system?"
Are you a for profit product?
"there need to be better protections for the creative industry"
Why exactly?
The word original is doing a lot of heavy lifting there! ;)
You mean cp -r?
Original title: "LangChain Integration for Vector Support for SQL-based AI applications"
I don't believe that title conveys the actual significance of the article that makes it worthy of attention, so I hope HN may forgive me for coming up with an alternative title!
I can still get to the article on the site, perhaps it’s cached in the CDN somewhere. Also, reviewing the repo the full entire article is there which promotes the same silly things. https://github.com/Azure-Samples/azure-sql-db-vector-search/...
Has been removed by force push: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
Github never deletes commits so we just need to find the hash for the latest one before the force push. In the forks you can find ones from before, but they're not up to date. Example: b5c8280d87c501d9ca7f63a6f252ca60ca820a4a for a copy 3 months old Link: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
it's even better than that
the merge commits in those repositories are all digitally signed by GitHub public key, so the previous history is fully authenticated and non-repudiable
so any copies now can be trivially proven to be genuine output by Microslop
hoisted by your own petard
signed merge commit is: 987eee6af61788647ae0cab82ae8a5d9402a5bd0
PGP signature (using GitHub's key: B5690EEEBB952194) is:
for posterity:
https://github.com/Azure-Samples/azure-sql-db-vector-search/...
Pretty interesting seeing the escalation process at Microsoft at work, and the censorship (well, damage control) attempts.
The biggest irony would be if the page itself was generated by an LLM.
Jupyter notebook version here for the curious: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
My guess is HP makes such an enormous amount of money already from movies, games, toys, and other tie-ins, that they can't be bothered to chase down the odd digital infringement of a plain text copy of the original books.
I'm sure the scripts of Star Wars would be similarly ignored if they were used.
That doesn't justify what's going on here. Why is Microsoft endorsing the use of pirated materials.
The dataset is actually at Kaggle tho, but agree, they shouldn't use it as an example.
The file being hosted by another company doesn't change the fact that Microsoft is encouraging us to download and use it.
> Why is Microsoft endorsing the use of generative AI
ftfy
It is just very hard problem when you are very popular work. Trying to find and track and take down all copies of certain work online is constant fight. Sometimes things just slip especially if they are not that popular.
Something like Harry Potter might be shared every day. And I mean as pirate work distributed as new copy. Staying on top of that will be very hard work.
I guess legal was a part of the layoff these past few years. Too bad we can't get a bounty from the RIAA of books, whatever that is
I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.
If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...
Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.
Imagine having a specialized LLM agent that understands the Windows kernel and its source. Now that would be something cool for pentesting!
“Fair use” allows for educational usage of copyrighted material. Technically it probably is not fair use as Microsoft isn’t an educational institution or a nonprofit.
But come on … these guides really are for learning purposes. Doesn’t seem like a big deal to me at all. They aren’t even hosting it, just pointing to kaggle who is hosting it.
On principle copyright law should allow this kind of learning use case anyway.
Wonderful 404 page. Wonder if Kai Lentit optimized it.
They tore the page down any copies?
I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright, or not blindly copying anything that is out there.
This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.
This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.
Refreshingly honest.
If copyrighted materials are used, surely copyright allows for the maker to require disclosure that their content was used in training a model.
"but it's fair use"
Rowling is known for actively protecting her rights as an author, they couldn't have picked a worse author to slop up
Absolutely shameless
I mean they are also offering up the code you are writing in your private repos to LLMs to regenerate in my repo, so let's just go nuts.
It's taken down lmao, in 1 hour
No? I can see it
Probably some kind of cache, but it's taken down, I'm getting 404, while some of my friends are still able to see it
+1, I can still access the page from US.
What in the absolute fuck
Someone forgot the national no snitching rules, and in service of Jo, no less.
Everyone should torrent and rip off those books, anyway.
This is fair use as it is for educational purposes and not for reading.
So you're saying, it's legal to teach AI using illegally sourced copyrighted material, because it's for educational purposes only - interesting argument... ;)
Intent (should) be what matters. If you want to learn how to train AI and use copyrighted material in your learning - I don’t care in the slightest at all.
In fact if you do this as a nonprofit or at an educational institution in a teaching context it’s explicitly allowed by fair use already.
If you do it individually, idk I’m not a lawyer. But it should be allowed on principle.
But if you then go take your trained AI and deploy it for commercial purposes that’s a different story and should have protections for the original rights holders.