Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?
I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.
It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."
One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.
It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.
Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5
> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.
The easy solution for that is to just do that analysis locally...
Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.
>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
People I think are overindexing on this being about "Bad hardware".
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion
device hours per Mbit and more than 8% of DIMMs affected
by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There is DRAM which is mildly defective but got past QC.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
Based on what data?
According to their reporting they have around 200 Million monthly users, which seems compatible with 470k crashes a week?
See <https://data.firefox.com/dashboard/user-activity>
The nuance here is of cause that there are a bunch of people using multiple browsers.
Also I mean there are a lot of people using browsers on the world
For my part I'm not sure I recall a crash having daily driven firefox in quite some time. I'd suspect that the large number of bit errors might be driven by a small number of poor hardware clients.
Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?
> last year we deployed an actual memory tester that runs on user machines after the browser crashes.
He doesn't explain anything indeed but presumably that code is available somewhere.
I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.
It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."
One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.
It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.
Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5
> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
Memory isn't the only resource.
The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.
The easy solution for that is to just do that analysis locally... Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.
>The next logical step would be to somehow inform users so they could take action to replace the bad memory.
This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.
>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.
is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?
You can map known-bad memory regions to avoid using them.
https://www.memtest86.com/blacklist-ram-badram-badmemorylist...
https://www.memtest86.com/
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
People I think are overindexing on this being about "Bad hardware".
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There is DRAM which is mildly defective but got past QC.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.
Based on what data? According to their reporting they have around 200 Million monthly users, which seems compatible with 470k crashes a week? See <https://data.firefox.com/dashboard/user-activity>
2% worldwide? https://gs.statcounter.com/browser-market-share
Granted, they're probably just as accurate as netcraft. /shrug
The nuance here is of cause that there are a bunch of people using multiple browsers. Also I mean there are a lot of people using browsers on the world
For my part I'm not sure I recall a crash having daily driven firefox in quite some time. I'd suspect that the large number of bit errors might be driven by a small number of poor hardware clients.
Wouldn't it be more likely the faulty machines are crashing pretty often.
470k crashes / week
67k crashes / day
claim: "Given # of installs is X; every install must be crashing several times a day"
We'll translate that to: "every install crashes 5 times a day"
67k crashes day / 5 crashes / install
12k installs
Your claim is there's 12k firefox users? Lol