The HDFS Juggernaut

There's been much focus on MongoDB, Elastic and Redis in terms of data exposure on the Internet due to their general popularity in the developer community. However, in terms of data volume it turns out that HDFS is the real juggernaut. To give you a better idea here's a quick comparison between MongoDB and HDFS:

	MongoDB	HDFS
Number of Servers	47,820	4,487
Data Exposed	25 TB	5,120 TB

Even though there are more MongoDB databases connected to the Internet without authentication in terms of data exposure it is dwarfed by HDFS clusters (25 TB vs 5 PB). Where are all these instances located?

Most of the HDFS NameNodes are located in the US (1,900) and China (1,426). And nearly all of the HDFS instances are hosted on the cloud with Amazon leading the charge (1,059) followed by Alibaba (507).

The ransomware attacks on databases that were widely publicized earlier in the year are still happening. And they're impacting both MongoDB and HDFS deployments. For HDFS, Shodan has discovered roughly 207 clusters that have a message warning of the public exposure. And a quick glance at search results in Shodan reveals that most of the public MongoDB instances seem to be compromised. I've previously written on the reason behind these exposures but note that both products nowadays have extensive documentation on secure deployment.

Technical Details

If you'd like to replicate the above findings or perform your own investigations into data exposure, this is how I measured the above.

Download data using the Shodan command-line interface:

shodan download --limit -1 hdfs-servers product:namenode

Write a Python script to measure the amount of exposed data (hdfs-exposure.py):

from shodan.helpers import iterate_files, humanize_bytes
from sys import argv, exit


if len(argv) <=1 :
    print('Usage: {} <file1.json.gz> ...'.format(argv[0]))
    exit(1)


datasize = 0
clusters = {}


# Loop over all the banners in the provided files
for banner in iterate_files(argv[1:]):
    try:
        # Grab the HDFS information that Shodan gathers
        info = banner['opts']['hdfs-namenode']
        cid = info['ClusterId']
        # Skip clusters we've already counted
        if cid in clusters:
            continue
        datasize += info['Used']
        clusters[cid] = True
    except:
        pass


print(humanize_bytes(datasize))

Run the Python script to get the amount of data exposed:
```
$ python hdfs-exposure.py hdfs-data.json.gz
5.0 PB
```