Six Questions About Big Data Cyber Risk Answered

One of the hottest topics for both DellEMC and Hortonworks today is how to protect big data repositories, data lakes, from the emerging breed of cyber-attacks. We sat down to discuss this topic to address some of the common questions we’ve faced, and would love to know your thoughts and contributions. Our thanks also to Simon Elliston Ball for his contributions to the discussion.

lines of code displayed on a computer screen
Photo by Markus Spiske on Unsplash

What are the new threats that pose a specific threat to big data environments?

The threats to big data environments come in a few broad areas.

First, these are ‘target rich’ environments. Years of consolidating data, in order to simplify management and deliver value to data science and data analytics, makes for an appealing destination for cyber attackers. These will be subject to many ‘advanced persistent threats’ – cyber attackers and organisations trying to use extremely focussed and targeted techniques ranging from spear-phishing to DDoS attacks to gain access to or exploit your big data platforms in some way.

Second, they are powerful computational environments. So, things like encryption attacks, if they are ever unleashed on big data operating environments, could potentially spread very rapidly.

Third, big data repositories are often accessible to many employees internally. In general, this is a good thing, as how else could organisations tap into the potential value of big data? But a comprehensive framework to monitor and manage data access and security is required to protect against possible abuse or exploits.

What about big data environments that makes them more or less vulnerable to threats like WannaCry/Ransomware?

The good news is that WannaCry and other ransomware variants currently in the field don’t really target the operating systems on which big data platforms run. The bad news is, it’s probably just a matter of time before they do. And the fact that these environments are very capable computational resources means that these sorts of exploits could spread fast, if steps aren’t taken to protect them.

What are some best practices to limit the possible spread of malware like WannaCry?

There’s a lot about the way big data platforms are architected that could potentially protect against these malware forms – assuming the right steps are taken. Here are some suggestions:

  • First, conduct basic big-data hygiene. Many organisations have historically perceived big data environments, Apache Hadoop clusters etc., as internal-only resources, protected by the network firewall. This may well be the case (to a point), but the nature of APTs means that if it’s there, people will find a way to reach it. If you’ve left default passwords in place, haven’t set sensible access restrictions for employees (governed and audited by tools like Apache Ranger) and so on… get that all done! Access controls will also limit the spread of any encryptionware to accessible data sets to each compromised user/set of credentials.
  • Test it! Conduct sensible security procedures to assess the potential vulnerabilities to your data stores via penetration testing and assessment. Deploy whatever countermeasures you deem necessary to limit the risk at hand to acceptable levels.
  • Deploy behavioural security to protect your environment. The industry guesstimate is that there are 300 million new viruses and malware variants arriving each year. Signature based security will fail against ‘day zero’ threats, so behavioural analytics is essential to monitor the activity across the environment and detect as well as protect against potential infections. If a system notices large-scale read/write activity typical of an encryption attack (but VERY unusual for a normal data lake), then it can shut it down dynamically by policy.
  • Set a sensible snapshot policy to allow for ‘rollbacks’ at the levels that meet the recovery point objectives and recovery time objectives set for key data sets. This won’t necessarily mean creating daily snapshots of a multi-petabyte data lake, but might mean that certain critical data have more routine snapshots than less critical data. You can of course set these tiers in policy, given the right resources. This is a massive boon for Hadoop Distributed File System (HDFS).

Do IT organisations know how to set Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs) for big data environments?

One of the most common misunderstandings in deploying big data environments is that you can still think of RTOs and RPOs for the infrastructure as a whole. You can’t – it’s too large! You’d have to build in such a vast amount of redundancy as to make the whole thing commercially impossible. Rather, you need to set RTOs and RPOs for individual data sets or storage tiers within the environment. In this context, you need to allow sufficient slack in your resources for the right number of snapshots to be in place for key data sets to insulate you from risk. This might be anything from 30-50 percent unused capacity in a given storage tier, made available for snapshots, though the latter would be verging on overkill in most cases.

What about tackling the employee challenge to big data security?

It’s a critical part of protecting any environment, educating employees, as this will be a more likely first possible entry point into an organisation than anything else. Raising employee awareness around the dangers of spear phishing, modern malware attacks, and beyond. The standard tricks of redirecting people to websites and downloads, via sending dubious email attachments and beyond have become much more sophisticated.

The people that attempt to hack a Hadoop cluster might start by hitting a system administrator with a Servicenow helpdesk request… This camouflage makes it difficult to spot. It’s important to remember that the people that are coming after these resources are good… not script kiddies or mass market ransomware opportunists, but people who are into causing serious damage, either for ideological or commercial reasons.

Even with training, people will remain a weak link. Given another guesstimate that the “per event” reputational and regulatory impact of a breach can cost up to two percent of market cap, having good remediation policies, processes and technologies in place given the eventual inevitability of a breach is key.

How do these security practices tie into wider security, risk and compliance objectives for a business?

The critical component here is the audit piece, given need to know exactly where your data is being stored, controlled and processed, and what it’s being used for in an evolving regulatory context. This is something you both apply to your use of big data, but also something big data enables you to achieve, for other systems as well. The audit and exfiltration monitoring tools you build in as part of your hygiene planning around your big data are useful, for example… but these logs are no use without analytics, and without being able to cross-reference and cross-check other data resources, e.g. if a piece of personal information has been accessed on one system, does it also exist on others? And should it therefore have been deleted from all?

The rise in the volumes of unstructured data represents a huge number of unknowns. As such, we are going to see a huge opportunity around digital transformation. Organisations are going to be forced to assess how they handle data and put in some big improvements in terms of the structure of their environments, their ability to do those analytics, pull back the information in a short amount of time and so on… else organisations may be exposed to potential regulator enforcement/investigation scrutiny for failure to embed within an organisation appropriate data governance and data security.


For those interested in functional ways they can tackle these problems, Dell EMC Isilon has built-in tools that aide in recovery from a ransomware attack; however, detection & prevention is a much better alternative. Fortunately Dell EMC partners with Superna and Varonis to offer ideal solutions.

If you’re interested in how Dell EMC Isilon and Hortonworks customers tackle other challenges around gaining value from their big data, join our upcoming webinar on “Batch + real-time analytics convergence” in late November. Register here.

About the Author: Ross Porter

Ross leads the Dell EMC EMEA Isilon Unstructured Data and Analytics Presales team. He is focused on delivering innovative scale out, object, analytics and big data solutions to customers across multiple industries and geographies. He has held previous leadership positions in systems engineering management, global systems integrator technical alliances, as well as advisory and consultant roles. Ross has spoken on technology and change leadership at various global events, partner forums, and leadership conferences.