ClownStrike
Maximizing shareholder value by using tried-and-true industry-standard systems and services is going just great.
Let's see if "nobody ever got fired for choosing Windows" still holds a week from now.
so, microsoft windows crowd is on strike, huh
"All software has bugs " is the "boys will be boys
" of the IT industry.
Crowd Strike thing is basically an "Ever Given stuck in Suez Canal" of IT industry.
All the techies losing hair, sleep, and family time trying to get this un-stuck are the excavator operators trying to get things un-fucked.
Cannot wait for the first tech media galaxy-brained piece that finds a way to blame this on "hackers", somehow.
Because obviously: computer go bad? Hackers!
This kind of failure is *systemic*, but of course it will get blamed on some lowly techie somewhere whose name is on the commit message.
> It was all Steve.
> We have now fired Steve, thus solving the problem once and for all.
> Bonuses to all management for a job well done!
Yet another example why techies might want to consider unionizing.
Crowdstrike:
> The fix is to delete C-00000291*.sys
Google:
> quick, we need to call ICANN and get .sys gTLD registered, stat!
"The buck stops with me!" – tech CEO says, head held high, pocketing the profit while firing Steve the intern, whose name happened to be on the commit message.
A gentle reminder that very recently the broader FLOSS community avoided a potentially massively problematic security issue with an attempted sneaky xz back-door affecting OpenSSH, because a Microsoft developer was curious about a weird slowdown.
I mean, screw Big Tech and that whole ecosystem, but we might want to take this opportunity to consider our own potential failure modes and near-misses.
Wait, CrowdStrike's CEO's name is Kurtz?
Is he, perchance, a colonel?
CrowdStrike is a small, local, struggling, resource-constrained mom-and-pop infosec shop which should not be regulated because that would kill it, and also is a globally-recognized security vendor of advanced AI-based EDR tools that you should definitely use because the company is massive and has all the resources in the world that they can put to making their tools top-of-the-line, not like those FLOSS amateurs.
Obviously.
huge oops → #hugops
So, does anyone have any reasonably reliable info on what actually happened?
Hearing things from "the CDN CrowdStrike uses done goofed" to "someone at CrowdStrike decided to push a quickfix outside of the standard testing-staging-QA pipeline."
I'm sure I am not the only one who would really appreciate something solid on this.
Also, how did this get installed on the end systems? Aren't CrowdStrike's updates signed?
Anybody has a better understanding of how such updates are signed (or not) on Windows?
This was a kernel driver update, right? Apparently it was not signed:
https://cyberplace.social/@GossiTheDog/112812317243841396
Wired claims that as a kernel driver update, it should have been signed by Microsoft:
https://www.wired.com/story/crowdstrike-outage-update-windows/
> [T]hey require that Microsoft also vet the code and cryptographically sign it, suggesting that Microsoft, too, may well have missed whatever bug in #CrowdStrike’s Falcon driver triggered this outage.
@rysiek huh, what happened?
@ParadeGrotesque @rysiek yeah. And I’m seeing all kinds of interesting reactions to something I didn’t know. But web+ap://toot.mirbsd.org/@osnews_rss/statuses/01J354WG7ENFWTKD9K6N20T46D
seems to have a good explanation (OSnews).
@mirabilos @rysiek CrowdStrike shipped a Windows kernel module update that is just garbage and BSODs on boot, to all its customers
@rysiek hackers as in the hacker news move-fast-and-break-things sense
@rysiek: “hackers issued faulty crowdstrike update”
@rysiek NEWAG press release any moment
@rysiek IIUC, the problem is related to the kernel driver crowdstrike installs. if it was a linux kernel driver, it'd also likely generate an explosion.
@pnathan this is not a technical failure, this is a systemic failure. The underlying affected technology is much less important than the ecosystem that created that problem.
I do find it interesting you brought up Linux out of the blue. Nowhere in this whole thread have I mentioned Linux, FLOSS, or anything of the sort.
@rysiek
> Let's see if "nobody ever got fired for choosing Windows" still holds a week from now.
Sorry, I flipped the bit on Windows and found a penguin.
I'm not sure I precisely grasp what you mean by systemic ecosystem. The systemic ways of thinking here to my mind all yield things like "don't use workstations" - because the systemic needs for DLP, defense, and various regulatory compliances won't be going away.
@pnathan systemic failure in management on CrowdStrike side, let's start here. Clearly there needed to be better testing of updates, for example.
If we scratch the surface, I bet we find corner-cutting on QA and testing, for example.
I bet we also find a few near-misses over the last couple of years.
And I bet there's an ignored engineering/security memo, sitting in some middle-manager's mailbox, that outlines this scenario and ways to avoid it.
@pnathan the broader systemic problem is that incentives for people who do make decisions in tech organizations simply do not align with fixing such issues.
If I'm a middle manager, getting my bonuses based on near-term company stock and financial performance, why would I invest in additional testing and QA? Likelihood of failure is low, right?
Especially since when shit really hits the fan I can easily blame this on Steve from Engineering, whose name happens to be on the commit message?
@rysiek It's also a question of e&o insurance and the fact that quality is not something biz buyers care that much about. Compliance is about checkboxes. And CS sells, IIUC, compliance.
I feel bad for Steve. I hope he has a lot of savings.
(I would note that I interviewed at CS and their practices were definitely regressive in the department I was speaking with... regressive vs other corps XD).
@pnathan sure. But all these are all also broader systemic issues.
Regressive you say? Well there you have it.
@rysiek
So many questions, so little popcorn.
@dzwiedziu right? I tapped into my strategic popcorn reserve!
@rysiek to add: they get installed automatically, as crowdstrike have agents which are installed on each endpoint
@rysiek @billy And apparently, normally customers can control the timing, but this time, that was overridden.
via https://news.ycombinator.com/item?id=41003390
@BenAveling @rysiek i didn’t catch onto that bit actually, interesting - i was wondering how my organisation got this update, for we have some specific settings (that bit’s not down to me though)
looks like i’m being invited to some big talks on monday
@BenAveling @billy never waste a good crisis!
@BenAveling @rysiek saw a message saying “this could’ve been friday afternoon as soon as i logged on hahahaha, yeah i’ve got a load of questions lined up - in my company i’m slowly taking over ownership of some security tools, and a couple months ago it was supposed to be crowdstrike until the security lead decided cyberark was a higher priority
good fucking lord i’ve been complaining for the last couple weeks about cyberark epm, but now i couldn’t be more grateful
@billy @BenAveling I have seen this as well.
But here's my question:
> What happened here was they pushed a new kernel driver out to every client
If it were a kernel driver update, Microsoft would have had to sign it. Which would put some of the responsibility on MS here.
But! It does not seem like it was the driver file itself. It seems, from what I gather, that the update contained a signatures file or some other data file that was then parsed by the driver.
So, that doesn't track fully.
@rysiek @billy Something like that. The file was misformated, so that caused bootup to barf, reboot, wash, rinse, repeat.
And those files are different for every customers (because DRM), so they must be autogenerated, so something broke the autogeneration or something slipped past the autogeneration, or IDK, is there a 3rd option?
@BenAveling @billy yeah, but my point here is that if the poster calls it "kernel driver update", then the poster clearly doesn't have the full understanding of what was going on.
I am not saying it's all wrong. I am saying: that's a "huh, need to be a bit careful here" moment.
@rysiek @BenAveling it won’t be a kernel driver update because that’s not exactly what crowdstrike does - but it’ll be a file that works closely with the kernel, which is an easy takeaway anyway. it’s hard to go into details on what the specifics are, and i’m hoping for a PIR from crowdstrike to conclude this (i’ll be pressing them for one if we don’t have any info)
but until then, enough has happened today - i’m going to enjoy my moretti and forget things for at least a couple of hours
@BenAveling @billy a .sys file is any file whose name ends in ".sys". It seems that in this case the .sys files that were the root issue – the ones CrowdStrike tells people to delete – are not loadable kernel modules but rather data for the module to process (malware signatures or some such).
@BenAveling @rysiek @billy They're not signed, and they're not a PE executable.
@rysiek Pretty sure everyone is in triage mode, doubt we’ll see accurate info for a week or two.
@rysiek From what I've read, this has actually happened in a similar fashion to Linux users before with CrowdStrike. Ultimately, I really do believe it boils down to poor production processes. I highly doubt that this company will even learn from this, but I hope they do.
@rysiek I haven't followed closely but what I heard early on was that the bug was an existing parser crash bug on bad input from data files (virus definition files or whatever) and an update to the data, not a change in the code, triggered it.
@dalias right, that seems to be the case indeed.
But do you know if such a definition update is usually distributed signed? I cannot imagine the answer is "no", but…
@rysiek @dalias If it was the case, and pure data updates had less rigid testing procedures, it would explain a lot.
Signing wouldn‘t have prevented it then, what would have is a more holistic test approach with every data update, or finding the existing bug, or a form of error handling that doesn‘t break a whole operating system.
@kmetz @dalias yeah. But if @GossiTheDog is right, and I have no reason not to believe him, the signature file from different b0rked customers were all garbage, but *different garbage*:
https://cyberplace.social/@GossiTheDog/112812260542179660
Which would suggest that either:
1. data files were not signed, relying perhaps on HTTPS for integrity;
2. the signature verification code is what crapped out on garbage data.
Gossi also claims that "the files aren't signed":
https://cyberplace.social/@GossiTheDog/112812317243841396
Which would mean 1.
@rysiek @dalias @GossiTheDog Ok, you meant signed for end-to-end transport integrity (not eg. by Microsoft for saying „we approve this“). That could have obviously prevented it.
To add a 3. to your points, data files could be supposed to look different locally (maybe locally encypted) for some reason, but I guess that would have been already noticed by @GossiTheDog
@rysiek @dalias @GossiTheDog So… then an even more holistic test approach of phased rollouts, starting with own machines in different regions behind different CDNs and such, would have prevented it.
@rysiek From what I've gathered so far from the ever reliable Social Media Feeds™ it appears that it wasn't a driver update, but rather a update data file that then caused the unchanged driver to crash when it tried to parse and apply it, which would effectively bypass the signing requirement.
@dos right. But one would assume that the update was at least signed by CrowdStrike?
@rysiek Well, you linked to someone who claimed it wasn't. How trustworthy that report is? You tell me!
@dos I don't know, which is why I am asking for further confirmation.
> wouldn't make any difference if they were as that just validates what you sent is what you got.
I have seen things that suggest the CDN that CrowdStrike uses b0rked the files. I have also seen people mentioning that on different b0rked systems the files that were installed as part of the b0rked update were all *different* garbage.
If these are true, signing and verifying these signatures could have prevented both of these.
Hence me asking about signing.
jokish speculation