President Trump’s obsession with cybersecurity firm CrowdStrike, which has been linked to false conspiracy theories since 2016, has brought the company’s role in probing the 2015 Democratic National Committee hack and in Trump’s first impeachment into renewed focus. CrowdStrike, a US-based cybersecurity firm, was hired by the Democratic National Committee to investigate the successful hacking operation in 2016. The company has been at the center of false conspiracy theories since 2016, and the theory has been prominent on right-wing blogs and news websites.
CrowdStrike was best known for its investigation into Russia’s hack of the Democratic National Committee leading up to the 2016 election. Unsubstantiated conspiracy theories about a purported Ukrainian link to the DNC hack began circulating almost immediately after it was mentioned. Trump pressed the Ukrainian president to investigate a debunked claim about the 2016 DNC hack.
American intelligence believes that Russia engaged in a years-long campaign to frame Ukraine for the 2016 election interference, and that the Kremlin is the prime suspect. According to testimony by DNC IT contractor Yared Tamene Wolde-Yohannes, the FBI attributed the breach to the Russian Government in September. The post alleged that it was Ukraine, not Russia, behind the 2016 US election hacks, relying on a raft of rumors, vague accusations, and debunked conspiracy theories.
In conclusion, President Trump’s obsession with CrowdStrike has led to a renewed focus on the cybersecurity firm’s role in probing the 2015 DNC hack and in Trump’s first impeachment.
📹 Debunking The Crowdstrike Conspiracy Theory | NBC News Now
NBC News Digital is a collection of innovative and powerful news brands that deliver compelling, diverse and engaging news …
📹 Real men test in production… The truth about the CrowdStrike disaster
An analysis of CrowdStrike’s official explanation of the code hat resulted in the largest IT disaster in history that crashed 8.5 million …
I once was a employee at Carbon Black, a competitor to CrowdStrike, working in automated testing. It was competitive with the worst software development practices of any organization I’ve ever been exposed to. The devs were fairly smart, but the assumption was that the purpose of testing was to bless the code that they had written. I agreed to step up and manually test one dev’s code, and I reported back that every time that I tried to run the code it killed the process without leaving any diagnostics. The dev said, how can I troubleshoot the problem without any good data? I looked at his code and identified he was not checking for null pointers but just dereferencing them anyway. This was an important step in getting myself terminated as not being a team player.
As a web developer, I can confirm testing in production is the best way to go: the added pressure focuses you, and it saves having to push things to production. The way I like to do it is over ftp with notepad. Or if I’m on the toilet I’ll use my phone and edit the files directly using the cpanel file manager. If my software was running on all the most vital computers in the world, I imagine that pressure would make me sharp as a knife and I’d never make a mistake.
Programmers are generally terrified about missing deadlines and will do whatever you command them to. It’s up to the project manager to track delays and ensure the boss is notified in advance that deadlines will be missed. It’s up to the boss to ensure they have good project managers and QA testing practices. Yes, this is indeed an organizational failure.
In my experience here’s how problem solving with code works. “I want to solve this problem with code. Here is my plan” “Let’s write the code now.” “Testing the code. Oh no, there are bugs.” “I fixed the bugs.” “Oh wait, what’s this?” “This problem has to do with stuff I can’t just fix, guess I’ll work around it.” “I hate my life, this is really hard.” “This works, but it shouldn’t. It looks ugly and I hate it.” “Whatever, it’s working.”
The company I work for has under 200 employees, under 30 devs, and we devs are writing education software. But even we have 5 levels test environments before any change hits production. That’s besides automated tests written by the API devs, automated tests written by the front end devs, and automated end-to-end testing by the QA team. Then there is required peer reviews of all code, and the QA dev manual testing. It’s scary if a software company with such a critical product is releasing code without at least these guard rails.
Let’s also mention that not only does the driver run in kernel mode, but it’s flagged as running on boot. That is why this outage was so bad: Bluescreen because of driver -> Reboot, ah, this driver is marked as an essential part of the system that we can’t boot without -> Bluescreen. Meaning them rolling out a fix will not fix machines automatically, an IT tech has to go over to every single machine and manually reboot in safe mode to have the fix actually applied.
I worked at a small Dot Com in the early 2000’s. We had a QA process for pushing changes to the production web sites. After the QA department had tested a new release, the QA manager manually signed a form that was printed on a sheet of paper, then that sheet of paper was handed to the sysadmin responsible for deploying changes to production. Seems like a foolproof process? Nope. After working there a few months, the QA manager told me that the producers (product owners) were printing out those forms and forging the QA manager’s signature. We had no idea we were pushing untested code to production, yet until we found out about this we were being blamed because the production web sites were unreliable.
“Failing upwards” seems to equal “They ~sure~ look great in a suit, let’s promote them!”. I’ve seen this over, and over, and over, over the last 30+ years, and it never ends well. It usually goes one of two ways: 1. The person in charge of a thing ends up being so bad or disinterested in their job that some really important thing ends up spectacularly failing even though they avoid blame (ie. today’s example), and they stick around to screw up the next thing they’re put in charge of. Occasionally they suffer the consequences of their ignorance, but by then the organizational and repetitional damage is done. 2. They muck around for a few years, cluelessly rising on the org chart until they shuffle off to some new employer who’s even more impressed with their fashion sense, usually leaving behind a two-comma morass of overdue projects, impossible deadlines, expensive and inappropriate software subscriptions, disgruntled technical staff, and the like.
I’m a retired IT guy, part of a team that did global pushes quite regularly. While a flaw in one of our pushes might “only” take down our presence on the web, there were layers upon layers of pre-push testing, staged releases, and so forth. I remember the pucker factor each and every time we did a “for real” push. I empathize when I hear of D’oh!!! misadventures.
So apparent Crowdstrike Falcon broke a Debian image about three months ago, but because Linux doesn’t actually force software updates, it fucked the VMs of a few dozen nerds who reported the issue and rolled back to the previous image before the entire global ecosystem went down. Seems like there’s a few lessons to be learned here.
What’s worse is that Crowdstrike updates bypass staging policies. So even the smart companies that run critical software updates in their own test systems first to make sure they don’t break anything before updating all computers still got the CS update forced upon them. So not only did they ignore their own staging and testing policies, they also ignored everyone else’s staging and testing policies.
The school of “Hey it compiled, it must work.” I’ve been coding for almost 40 years. Yeah, I’m old. It drives me nuts that we do not learn lessons. Company hiring a guy who thinks delivering and using software is testing should have the entire C-suite fired. What happened to the concept of continuous integration, automated testing? Bosses are always too cheap, arrogant, impatient, whatever to put money into testing. And clients, to be fair, are also disinclined to plan for and budget testing.
This boggles my mind as an IT professional. I was part of a team that deployed patches and software for years. This included OS deployment patch deployment, software deployment the whole thing on both Workstations and Servers. We tested our patches extensively before pushing them out to the entire population of the environment. This 1st included a sandbox environment, then a select user / system environment, then we would stage our patches out over several hours so if something happened we could back out before catastrophe struck. And honestly sometimes we would find problems with the patches, and we would be able to immediately stop, suspend and even back out. Yes we would use 3rd party vendor solutions to help with this, and any time we changed ANYTHING we would follow our testing procedures and matrix, normal business. We would never shirk our procedures to test 1st, then deploy. To me this is a total failure of IT Governance and failure to maintain standards. (IT Governance is setting and maintaining standards and policies for the IT Infrastructure)
Extra little detail: There wasn’t so much a logic problem in the website file… the website file was null. Not zero size, but full of nothing but null bytes. And their kernel module apparently does ZERO checking for validity before trying to work with such files. Should be criminal negligence, but that is literally legally impossible since zero enforceable software standards of any kind exist.
One of the big issues is that crowdstrike tries to do two things differently than their competitors: 1. They want to be fastest to protect machines around the world from novel new malware techniques. 2. They want their sensors to be extremely lightweight. There are two types of antivirus updates: Agent/Sensor updates, and Content updates. Agent updates are slowly rolled out by an IT organization. (This allows IT to test on and brick say, 10 machines before they go and brick 10,000.) Content updates (definition updates) are pushed to all machines, because what the bad guys are doing is constantly changing Most EDR software vendors make major changes to kernel-level detection logic with Agent updates. Because of Crowdstrike’s goals however, they push most of that logic into Content updates. That philosophy and design choice has come back to haunt them.
Any IT organization worth its salt would not give a third party direct access to update their critical servers and workstations simultaneously, and on a completely unknown and ad-hoc schedule. They should have demanded pilots, staggered roll-outs and well-documented AUTOMATIC ROLLBACK and RECOVERY processes. These are not “new lessons”. Those paradigms were drilled into my head in 1998 and were probably around for at least a decade before that. An outside vendor having Ring 0 access to a company’s servers is also a huge security risk. Heads should roll – not just at CloudSfrike but also the IT director of any company that implemented crowd strike without control mechanisms in place.
It probably was an organisational failure… Like Boeing, the manifestation is doors blowing off etc… but the real cause is unwise organisational changes… to boost profits, personal and corporate… at the expense of quality… I don’t know if Crowdstrike has shareholders, but there will be pressure to increase profitability… … outsource IT to India… employ under qualified, cheaper, staff etc… put pressure on managers to deliver… who put pressure on staff to deliver… … and the manifestation is… global computer failure…
As someone who has been interested in Security Nightmares for some time, I am not surprised that such a catastrophe has occurred. After all, it was foreseeable that it would happen sooner or later. But I can’t help but wonder how it is that nobody is still demanding that software manufacturers are liable for damage caused by their junk software. Because as long as IT companies like CrowdStrike get away with a slap on the wrist, nothing will change. Then it’s only a matter of time till the next security nightmare.
What everyone seems to be overlooking is the very concept of “pushing” automatic updates. Just a few decades ago, if you owned a computer, you decided what updates to apply and when to apply them. That way, you could test the updates first, have a recovery plan ready if needed, and perform the update at a time of your choosing. This is an example of what happens when we cede control of our responsibilities to someone we assume has our best interests. They don’t, not any of them.
Real men test in production…incrementally. Why anyone would deploy any change to their entire production audience at once boggles my mind. It is so easy to update a small percentage of production, wait to see if it causes a problem, and then either fix the problem or push on. I own a tiny little software company, but we follow this staggered roll out plan religiously. As a result, few users ever see bugs in new releases.
I’m glad this CrowdStrike error didn’t affect me, I don’t have CrowdStrike on my windows computer thankfully! Sad that a lot of computers were affected by this though. Airports and hospitals with computers with CrowdStrike couldn’t use their computers because they kept crashing because of this bug. Airports and hospitals had to revert back to manual and on paper. So many flights delayed, so many cancelled surgeries, an airport even had to write flights on a whiteboard! I find it funny how computers have plenty of room for error, and every time we always revert back to paper and 1980’s technology. The past was more stable in terms of technology I guess. Computers can crash but paper can’t, I guess.
From a qa perspective, this could have been caught by static analysis tools, manual testing, automated testing, fuzzing and by gradual releasing instead of pushing to all machines at ones. Now this problem might slip through individual steps but the fact it slipped through all means qa was just woefully insufficient for a kernel driver.
Crowdstrike were deeply incompetent in an area where rushing is less important than testing. So, they rushed anyway and probably because they push that shit out sans testing quite a lot, just thought, “Phhhhpt, It’ll be fine. It was before.” I would be chasing their Pro Grade Whacky McAfee off my PCs if it were my coin. You can’t trust a company that specializes in preventing downtime, which negligently delivers a big mountain of downtime.
Most of my recent employers require all manners of automated test levels. The devs mostly govern the unit and integration testing, while I’ve implemented API testing and contract tests with PACT for communicating SaaS services, e2e testing (including front-end) using either Cypress or Playwright, as well as small subsets of those e2e tests as synthetic tests, automatically run on every merge to main, all of which, are orchestrated through a series of CI pipelines, each pipeline having its own quality gate of requirements specified by SRE, that are then triggered on each branch and merge. Each merge automatically spins up an ephemeral test environment for those tests to be conducted on, and then again on a final production-like staging environment before being released to production. All this is just for some medium enterprise web platforms, so imagine what Clownstrike should have.
From what I have read and seen is that the crowdstrike config files actually contain a type of code, so this allows crowdstrike to basically change the functionality and logic of their driver EVEN though at one point on the past, it was certified by Microsoft. This mechanism basically means the WHQL certification means absolutely nothing. Crowdstrike should be sued for this as its a way of circumventing WHQL certification.
I read that the bug was actually in the deployment pipeline, whereby after the package was tested, it was copied to prod, however the script to copy the definitions file was buggy, and ended up copying a null file. So the testing was there, just for the app package and not the deploy pipeline itself.
Rust programmers do tend to pop up to promote the language frequently. And it’s kinda suspect that despite their supposed expertise they couldn’t even look at the “professional C++ developer” “stack trace dump” with a bit of scepticism. I’m just saying, it’s clearly not beyond them to manufacture something like this. Remember when they said Rust shouldn’t replace anything? They’re an organized and deceptive bunch.
I have worked for the largest software companies in the world and also rejected most of the rest I haven’t worked for. As soon as I see them doing weird things like managers dictating WHEN the software will be delivered, I start looking for a new job. Vote with your feet and choose which companies are good.
Being a Quality Engineer for an aerospace company, this issue with Quality Control seems to be seeping out from everywhere. Take a look at Boeing! It has been seen that they want to get sales up instead of making Quality Aircraft. The same applies to Crowd strike. As a quality engineer, i constantly battled against management saying that if they push quality aside, they will have a problem. Boeing’s problems with quality has alerted other aerospace companies to strengthen quality control and companies are hiring experienced quality personnel at a rapid rate!!!
The file that Crowdstrike released causing this problem was apparently all zeroes. This is NOT the first time Crowdstrike has released updates that have crashed large numbers of machines though, it is actually the THIRD time they’ve done it. This time however it was an update sent to one of the world largest code bases and therefore had a far larger impact. Obviously they failed to learn from the previous problems which places this event in the category of Corporate Malfeasance. Note though that the companies who use this software ALSO failed to keep up with the news and learn that Crowdstrike was playing fast and loose so their failure to install mitigation (i.e. test before installing updates) so Crowdstrike isn’t completely to blame, just mostly. This would have been a much larger outage in fact except some companies did have policies of testing before allowing updates in place so it also could have been a far larger outage.
As someone who worked at Crowdstrike, I can assure you they have testing environment for sensor update and they also heavily preach about it during the training where they instruct you to test it on some machines before releasing it into production. In light of that, it shouldn’t come to your surprise that I call it bullshit that this was an accident. The engineer or analyst that pushed this update through, did not test it and that’s about it. He was probably fired too as CS fires employees on a weekly basis for much less, let alone causing all major news outlets speak poorly about you and let your stocks fall by 30%.
To be fair, i’ve done a lot of deployments on fridays… but i always have it in mind that i might have to stay up the weekend as a result. Critical stuff releases were only done when i knew i could work the weekend 😅 the only way this can happen is if the team does not have any QA and devs don’t bother testing due to any reason, like no easy way to test etc
Man, showing Stokton Rush in that picture is brutal. But yes, it’s an organizational problem. Theory one is the only one that makes any sense. I’ve seen it dozens of times in my career. When companies aren’t doing testing or deployments right, it’s not because they lack individuals with the skills to do those things. It’s to organization is broken in some fundamental way, and actively blocks or prevents any attempts to implement better testing or fancy staggered rollout deployments. I’d like to be optimistic and hope that this failure fixes those problems at the company, but it’s clear that the broken culture goes all the way back up to the CEO himself.
As somebody who works in QA, what the heck. I thought initially some tests had failed/not ran and somebody signed off the push to production anyway, or the test env didn’t match the proper env completely due to lack of funding and therefore automated update tests passed but there’s a problem anyway, but it seems much worse than that. Guh.
My question is how did the file become malformed or null in the first place. Also the lack of an if statement might explain why previous updates to other systems resulted in similar issues, but still leaves some room for other possibilities. Definitely seems like many things went wrong leading to this event. They definitely lacked a lot of basic best practices, whether they had policies laid out to enforce that I don’t know, but relying on an automatic testing system without monitoring it and also having a manual check isn’t smart. Also forcing updates in real time on top of that especially when their program has kernel access is crazy. I too wondered if maybe some of this was a deliberate attempt considering everything happening and the fact so much went wrong in this case that could’ve been avoided.
Company I work for does consultancy for a lot of high profile government agencies and and a couple of Fortune 500 companies. What really amazed me with this CrowdStrike incident is how it didn’t happen sooner and doesn’t happen much more frequently. Seriously in my career I’ve seen a lot of “critical” pieces of software that are held together by hopes and dreams. You look at those things and wonder “how the fuck does this software that was coded by some dude in 90s using Fortran running on a Windows XP machine is holding up the entire global finance sector”
I’m.not a computer geek but I do.follow my instincts and read up on stuff like this. I’ve never understood why apps need constant updates. And when I have updated, it usually winds up screwing up some other function. Basically I do as few updates as possible. My living comes from working online. The fact that this debacle was caused by a company that should have had higher testing standards, reinforces in my mind the BS & fk ups behind many updates.
Why is no one talking about the rather irresponsible practice of applying patches to critical systems automatically? All these companies who are not just blindly trusting Crowdstrike and other vendors to test their stuff properly, but also assuming that all these patches won’t interfere with something else in their particular ecosystem. If you just install patches without testing, the buck stops with you
I only hope that we don’t see massive regulation about this. Regulation would not solve the problem. Regulation creates a bad and rigid “solution” that doesn’t solve anything while making life harder for competitors by forcing them to comply to some arbitrary rules that supposedly solve everything forever. The real solution is letting crowdstrike fail or earn back customer trust. The market is an automatic problem solving and self-correcting system that doesn’t take a gazillion layers of bureaucracy to come up with a rubber stamp program that doesn’t actually solve anything but makes the barrier to entry for actual solutions harder. Even when the market comes up with rubber stamp program that fails, there are consequences for the people holding the damn stamp machine. For government, there are rarely if ever consequences.
Explain this to me like I’m a 4th grader: what organization tolerates the risk of vendor updates to code running in the OS kernel without running a basic sanity test in a sandbox or some other QA environment? I get that CrowdStrike is going to take a lot of blow back, but frankly, the organizations that dodged this bullet and were running Falcon should also be of interest.
Hackers hack production mainly, so some controlled test in production is inevitable for security at least. At least what one hopes is a controlled test, but such high privileges now clearly need a re-think. Sure, a hacker that has gotten into the internal network may try to get useful stuff out of internal test environments, so they will poke around there too. But the treasure is in production, so offensive testing in production in the security world is a reality. That’s the clash to solve here – development and any testing for that matter, will always be at some odds – security testing even more. How much security is too much security, and how much tying of security’s hands will make a hacker win – problem being that fear is ruling the discussion around such security questions. Better rationalisation is required via maths, science, logic, philosophy/core domain knowledge.
It’s strange to me that people keep blaming C++ and/or memory unsafe languages when the actual root cause was a failure to validate external data before executing it. Data validation has to occur regardless of how memory safe a language is if you don’t want bad things to happen. The file was all zeroes, which should be an obvious validation failure in any programming language.
i’m just amazed that this security guard did more worldwide damage than any thief ever could 😅 This was a dumpster fire… do they not have any staging or dev environments? i’ve worked on teams without QAs… always managed to stop the disasters in dev environments. you really gotta be in “i’m feeling lucky!” mode to manage to do this.
Real IT professionals develop & debug in a dev environment. Real IT professionals deploy into a test environment that mirrors the production environment for thorough testing. Real IT professionals only roll out into production environment once the testing in the test environment has been successful. Real IT professionals don’t segment their code to bypass WHQL certification of their “config files”. I bet Kaspersky is shaking its head…
This just proves that config files suck. The idea is to have a writable file to put information in which the code uses. So the code doesn’t have to be changed for updates. It’s impossible to test every possible combination of every config file, not even automated. Ring 0 should not allow any config files, everything should be hardcoded.
I’ve worked software support for over a decade now. The spread of Agile development and the pressure to continuously push updates means almost everything only gets tested in production these days. These sort of things are only going to become worse unless Microsfot makes Windows more resilient to these kind of software errors. And don’t get me wrong – CrowdStrike is just as much if not more to blame for losing perspective on the impact of it’s product.