All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It’s all very exciting, personally, as someone not responsible for fixing it.
Apparently caused by a bad CrowdStrike update.
Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We’ll see if that changes over the weekend…
Reading into the updates some more… I’m starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don’t see how they survive this. Just absolutely catastrophic on all fronts.
The amount of servers running Windows out there is depressing to me
>Make a kernel-level antivirus
>Make it proprietary
>Don’t test updates… for some reason??I mean I know it’s easy to be critical but this was my exact thought, how the hell didn’t they catch this in testing?
I have had numerous managers tell me there was no time for QA in my storied career. Or documentation. Or backups. Or redundancy. And so on.
Just always make sure you have some evidence of them telling you to skip these.
There’s a reason I still use lots of email in the age of IM. Permanent records, please. I will email a record of in person convos or chats on stuff like this. I do it politely and professionally, but I do it.
A lot of people really need to get into the habit of doing this.
“Per our phone conversation earlier, my understanding is that you would like me to deploy the new update without any QA testing. As this may potentially create significant risks for our customers, I just want to confirm that I have correctly understood your instructions before proceeding.”
If they try to call you back and give the instruction over the phone, then just be polite and request that they reply to your email with their confirmation. If they refuse, say “Respectfully, if you don’t feel comfortable giving me this direction in writing, then I don’t feel comfortable doing it,” and then resend your email but this time loop in HR and legal (if you’ve ever actually reached this point, it’s basically down to either them getting rightfully dismissed, or you getting wrongfully dismissed, with receipts).
Engineering prof in uni was big on journals/log books for cyoa and it’s stuck with me, I write down everything I do during the day, research, findings etc, easily the best bit of advice I ever had.
Permanent records, please.
The issue with this is that a lot of companies have a retention policy that only retains emails for a particular period, after which they’re deleted unless there’s a critical reason why they can’t be (eg to comply with a legal hold). It’s common to see 2, 3 or 5 year retention policies.
Unless their manager works in Boeing.
There’s some holes in our production units
Software holes, right?
…
…software holes, right?
Move fast and break things! We need things NOW NOW NOW!
Push that into the technical debt. Then afterwards never pay off the technical debt
Completely justified reaction. A lot of the time tech companies and IT staff get shit for stuff that, in practice, can be really hard to detect before it happens. There are all kinds of issues that can arise in production that you just can’t test for.
But this… This has no justification. A issue this immediate, this widespread, would have instantly been caught with even the most basic of testing. The fact that it wasn’t raises massive questions about the safety and security of Crowdstrike’s internal processes.
most basic of testing
“I ran the update and now shit’s proper fucked”
That would have been sufficient to notice this update’s borked
I think when you are this big you need to roll out any updates slowly. Checking along the way they all is good.
The failure here is much more fundamental than that. This isn’t a “no way we could have found this before we went to prod” issue, this is a “five minutes in the lab would have picked it up” issue. We’re not talking about some kind of “Doesn’t print on Tuesdays” kind of problem that’s hard to reproduce or depends on conditions that are hard to replicate in internal testing, which is normally how this sort of thing escapes containment. In this case the entire repro is “Step 1: Push update to any Windows machine. Step 2: THERE IS NO STEP 2”
There’s absolutely no reason this should ever have affected even one single computer outside of Crowdstrike’s test environment, with or without a staged rollout.
God damn this is worse than I thought… This raises further questions… Was there a NO testing at all??
Tested on Windows 10S
My guess is they did testing but the build they tested was not the build released to customers. That could have been because of poor deployment and testing practices, or it could have been malicious.
Such software would be a juicy target for bad actors.
Agreed, this is the most likely sequence of events. I doubt it was malicious, but definitely could have occurred by accident if proper procedures weren’t being followed.
Yes. And Microsoft’s
How exactly is Microsoft responsible for this? It’s a kernel level driver that intercepts system calls, and the software updated itself.
This software was crashing Linux distros last month too, but that didn’t make headlines because it effected less machines.
From what I’ve heard, didn’t the issue happen not solely because of CS driver, but because of a MS update that was rolled out at the same time, and the changes the update made caused the CS driver to go haywire? If that’s the case, there’s not much MS or CS could have done to test it beforehand, especially if both updates rolled out at around the same time.
I’ve seen zero suggestion of this in any reporting about the issue. Not saying you’re wrong, but you’re definitely going to need to find some sources.
Is there any links to this?
My apologies I thought this went out with a MS update
From what I’ve heard and to play a devil’s advocate, it coincidented with Microsoft pushing out a security update at basically the same time, that caused the issue. So it’s possible that they didn’t have a way how to test it properly, because they didn’t have the update at hand before it rolled out. So, the fault wasn’t only in a bug in the CS driver, but in the driver interaction with the new win update - which they didn’t have.
How sure are you about that? Microsoft very dependably releases updates on the second Tuesday of the month, and their release notes show if updates are pushed out of schedule. Their last update was on schedule, July 9th.
I’m not. I vaguely remember seeing it in some posts and comments, and it would explain it pretty well, so I kind of took it as a likely outcome. In hindsight, You are right, I shouldnt have been spreading hearsay. Thanks for the wakeup call, honestly!
You left out
>Pushed a new release on a Friday
You left out > Profit
Oh… Wait…Hang on a sec.
Lots of security systems are kernel level (at least partially) this includes SELinux and AppArmor by the way. It’s a necessity for these things to actually be effective.
You missed most important line:
>Make it proprietary
never do updates on a Friday.
Yeah my plans of going to sleep last night were thoroughly dashed as every single windows server across every datacenter I manage between two countries all cried out at the same time lmao
I always wondered who even used windows server given how marginal its marketshare is. Now i know from the news.
Marginal? You must be joking. A vast amount of servers run on Windows Server. Where I work alone we have several hundred and many companies have a similar setup. Statista put the Windows Server OS market share over 70% in 2019. While I find it hard to believe it would be that high, it does clearly indicate it’s most certainly not a marginal percentage.
I’m not getting an account on Statista, and I agree that its marketshare isn’t “marginal” in practice, but something is up with those figures, since overwhelmingly internet hosted services are on top of Linux. Internal servers may be a bit different, but “servers” I’d expect to count internet servers…
Most servers aren’t Internet-facing.
There are a ton of Internet facing servers, vast majority of cloud instances, and every cloud provider except Microsoft (and their in house “windows” for azure hosting is somehow different, though they aren’t public about it).
In terms of on premise servers, I’d even say the HPC groups may outnumber internal windows servers. While relatively fewer instances, they all represent racks and racks of servers, and that market is 100% Linux.
I know a couple of retailers and at least two game studios are keeping at scale windows a thing, but Linux mostly dominates my experience of large scale deployment in on premise scale out.
It just seems like Linux is just so likely for scenarios that also have lots of horizontal scaling, it is hard to imagine that despite that windows still being a majority share of the market when all is said and done, when it’s usually deployed in smaller quantities in any given place.
It’s stated in the synopsis, below where it says you need to pay for the article. Anyway, it might be true as the hosting servers themselves often host up to hundreds of Windows machines. But it really depends on what is measured and the method used, which we don’t know because who the hell has a statista account anyway.
since overwhelmingly internet hosted services are on top of Linux
This is a common misconception. Most internet hosted services are behind a Linux box, but that doesn’t mean those services actually run on Linux.
This is a crowdstrike issue specifically related to the falcon sensor. Happens to affect only windows hosts.
Well, I’ve seen some, but they usually don’t have automatic updates and generally do not have access to the Internet.
It’s only marginal for running custom code. Every large organization has at least a few of them running important out-of-the-box services.
Not too long ago, a lot of Customer Relationship Management (CRM) software ran on MS SQL Server. Businesses made significant investments in software and training, and some of them don’t have the technical, financial, or logistical resources to adapt - momentum keeps them using Windows Server.
For example, small businesses that are physically located in rural areas can’t use cloud based services because rural internet is too slow and unreliable. Its not quite the case that there’s no amount of money you can pay for a good internet connection in rural America, but last time I looked into it, Verizon wanted to charge me $20,000 per mile to run a fiber optic cable from the nearest town to my client’s farm.
Almost everyone, because the Windows server market share isn’t marginal at all.
My current company does and I hate it so much. Who even got that idea in the first place? Linux always dominated server-side stuff, no?
You should read the saga of when MS bought Hotmail. The work they had to do to be able to run it on Windows was incredible. It actually helped MS improve their server OS, and it still wasn’t as performance when they switched over.
Yes, but the developers learned on Windows, so they wrote software for Windows.
In university computer science, in the states, MS server was the main server OS that they taught my class during our education.
Microsoft loses money to let the universities and students use and learn MS server for free, or at least they did at the time. This had the effect of making a lot of fresh grad developers more comfortable with using MS server, and I’m sure it led to MS server being used in cases where there were better options.
No, Linux doesn’t now nor has it ever dominated the server space.
How many coffee cups have you drank in the last 12 hours?
I work in a data center
I lost count
What was Dracula doing in your data centre?
Because he’s Dracula. He’s twelve million years old.
THE WORMS
Surely Dracula doesn’t use windows.
I work in a datacenter, but no Windows. I slept so well.
Though a couple years back some ransomware that also impacted Linux ran through, but I got to sleep well because it only bit people with easily guessed root passwords. It bit a lot of other departments at the company though.
This time even the Windows folks were spared, because CrowdStrike wasn’t the solution they infested themselves with (they use other providers, who I fully expect to screw up the same way one day).
There was a point where words lost all meaning and I think my heart was one continuous beat for a good hour.
Did you feel a great disturbance in the force?
Oh yeah I felt a great disturbance (900 alarms) in the force (Opsgenie)
How’s it going, Obi-Wan?
Here’s the fix: (or rather workaround, released by CrowdStrike) 1)Boot to safe mode/recovery 2)Go to C:\Windows\System32\drivers\CrowdStrike 3)Delete the file matching “C-00000291*.sys” 4)Boot the system normally
This is going to be a Big Deal for a whole lot of people. I don’t know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.
CrowdStrike: It’s Friday, let’s throw it over the wall to production. See you all on Monday!
^^so ^^hard ^^picking ^^which ^^meme ^^to ^^use
Good choice, tho. Is the image AI?
It’s a real photograph from this morning.
Not sure, I didn’t make it. Just part of my collection.
Fair enough!
We did it guys! We moved fast AND broke things!
When your push to prod on Friday causes a small but measurable drop in global GDP.
Actually, it may have helped slow climate change a little
The earth is healing 🙏
For part of today
With all the aircraft on the ground, it was probably a noticeable change. Unfortunately, those people are still going to end up flying at some point, so the reduction in CO2 output on Friday will just be made up for over the next few days.
Definitely not small, our website is down so we can’t do any business and we’re a huge company. Multiply that by all the companies that are down, lost time on projects, time to get caught up once it’s fixed, it’ll be a huge number in the end.
GDP is typically stated by the year. One or two days lost, even if it was 100% of the GDP for those days, would still be less than 1% of GDP for the year.
I know people who work at major corporations who said they were down for a bit, it’s pretty huge.
Does your web server run windows? Or is it dependent on some systems that run Windows? I would hope nobody’s actually running a web server on Windows these days.
I have a absolutely no idea. Not my area of expertise.
They did it on Thursday. All of SFO was BSODed for me when I got off a plane at SFO Thursday night.
Was it actually pushed on Friday, or was it a Thursday night (US central / pacific time) push? The fact that this comment is from 9 hours ago suggests that the problem existed by the time work started on Friday, so I wouldn’t count it as a Friday push. (Still, too many pushes happen at a time that’s still technically Thursday on the US west coast, but is already mid-day Friday in Asia).
I’m in Australia so def Friday. Fu crowdstrike.
Seems like you should be more mad at the International Date Line.
Wow, I didn’t realize CrowdStrike was widespread enough to be a single point of failure for so much infrastructure. Lot of airports and hospitals offline.
The Federal Aviation Administration (FAA) imposed the global ground stop for airlines including United, Delta, American, and Frontier.
Flights grounded in the US.
Ironic. They did what they are there to protect against. Fucking up everyone’s shit
Maybe centralizing everything onto one company’s shoulders wasn’t such a great idea after all…
Wait, monopolies are bad? This is the first I’ve ever heard of this concept. So much so that I actually coined the term “monopoly” just now to describe it.
Crowdstrike is not a monopoly. The problem here was having a single point of failure, using a piece of software that can access the kernel and autoupdate running on every machine in the organization.
At the very least, you should stagger updates. Any change done to a business critical server should be validated first. Automatic updates are a bad idea.
Obviously, crowdstrike messed up, but so did IT departments in every organization that allowed this to happen.
You wildly underestimate most corporate IT security’s obsession with pushing updates to products like this as soon as they release. They also often have the power to make such nonsense the law of the land, regardless of what best practices dictate. Maybe this incident will shed some light on how bad of an idea auto updates are and get C-levels to do something about it, but even if they do, it’ll only last until the next time someone gets compromised by a flaw that was fixed in a dot-release
Monopolies aren’t absolute, ever, but having nearly 25% market share is a problem, and is a sign of an oligopoly. Crowdstrike has outsized power and has posted article after article boasting of its dominant market position for many years running.
I think monopoly-like conditions have become so normalised that people don’t even recognise them for what they are.
Someone should invent a game, that while playing demonstrates how much monopolies suck for everyone involved (except the monopolist)
And make it so you lose friends and family over the course of the 4+ hour game. Also make a thimble to fight over, that would be dope.
Get your filthy fucking paws off my thimble!
I’m sure a game that’s so on the nose with its message could never become a commercialised marketing gimmick that perversely promotes existing monopolies. Capitalists wouldn’t dare.
I mean, I’m sure those companies that have them don’t think so—when they aren’t the cause of muti-industry collapses.
Yes, it’s almost as if there should be laws to prevent that sort of thing. Hmm
Well now that I’ve invented the concept for the first time, we should invent laws about it. We’ll get in early, develop a monopoly on monopoly legislation and steer it so it benefits us.
Wow, monopolies rule!
The too big to fail philosophy at its finest.
Since when has any antivirus ever had the intent of actually protecting against viruses? The entire antivirus market is a scam.
CrowdStrike has a new meaning… literally Crowd Strike.
They virtually blew up airports
An offline server is a secure server!
Honestly my philosophy these days, when it comes to anything proprietary. They just can’t keep their grubby little fingers off of working software.
At least this time it was an accident.
There is nothing unsafer than local networks.
AV/XDR is not optional even in offline networks. If you don’t have visibility on your network, you are totally screwed.
The thought of a local computer being unable to boot because some remote server somewhere is unavailable makes me laugh and sad at the same time.
Clownstrike
Crowdshite haha gotem
CrowdCollapse
https://www.theregister.com/ has a series of articles on what’s going on technically.
Latest advice…
There is a faulty channel file, so not quite an update. There is a workaround…
-
Boot Windows into Safe Mode or WRE.
-
Go to C:\Windows\System32\drivers\CrowdStrike
-
Locate and delete file matching “C-00000291*.sys”
-
Boot normally.
For a second I thought this was going to say “go to C:\Windows\System32 and delete it.”
I’ve been on the internet too long lol.
Yeah I had to do this with all my machines this morning. Worked.
Working on our units. But only works if we are able to launch command prompt from the recovery menu. Otherwise we are getting a F8 prompt and cannot start.
Bootable USB stick?
Yeah some 95% of our end user devices (>8000) have the F8 prompt. Logistics is losing their minds about the prospect of sending recovery USBs to roughly a thousand locations across NA.
-