Data Pollution

Data Pollution is a real phenomenon, and somehow we don’t think about it, or care about it very much.

What is it? It is data that is redundant and/or doesn’t serve any real purpose. It only purpose is to exist. It has been forgotten by us. They are the shoes that you will never wear. They are the spices that you will never use. They are the frost in your freezer.

Every byte of data that has been created over and above the need of humans or other programs must reside in memory somewhere. If it exists in volatile memory (such as your computer’s RAM) it is not a problem, since it will be cleaned off anyway after you turn off your computer – but when it becomes permanent memory, such as your hard drive, then it may turn into data pollution.

Data that is not on your computer and has been transferred to a data center (such as the website you are reading right now) is stored on one of 9 million data centers around the globe. These combined contribute to 17% of the global carbon footprint caused by technology. On top of that, these data centers are wasting 90% of the energy they are using (source). We need data centers now because our data storage needs are immense. 90% of the world’s data was created in the last two years alone. We produce 2.5 quintillion bytes of data every day (source).

An example of data pollution is two photos which are very similar. People generally take multiple selfies – they choose the selfie that they like the most, and share it on social media, but the others stay on your phone’s disk, and are therefore join the vast silos of pollution. With the help of image processing algorithms which utilize a very basic level of machine learning this data can be found and deleted. A burst photo set can be 10 photographs or more, and if you select one of them and discard the others, you are cleaning up the cloud just a little bit.

Spot the differences

Even redundant data (meant for restoration in the case of a system failure) is pollution. Most of it is not necessary, since the failure rate of a given system may be low (modern cloud systems guarantee 99.99% uptime). So only 0.01% of redundant data is useful. But we keep it anyway because we have to keep our shareholders happy so that we can deliver that beautiful 99.99% number to our customers.

I used to write all the Anime that I used to watch on CDs and then DVDs. I had more than 50 DVDs for Inuyasha alone. Combined, I estimate more than 400 CDs and DVDs (and I am not even talking about my games and movies). I know now that this was totally not necessary.

It is possible to clean data pollution at a very early stage. But as time passes by, the pollution becomes harder and harder to clean. Automated backups will move your hard drive’s contents to the cloud, at which point you have to manually delete data there (it is often not very easy). If your data is being versioned, then old data is being archived (using tech like Amazon’s Glacier), and it is harder (read ‘boring’) to access it and get rid of it. What are you going to do with personal videos that you are not going to use? They could be gigabytes of content like this. There is data that you are never going to need on your old computers. And you may move to another country leaving all your old drives behind.

It is time to recycle all the devices that you are not using. Old hard drives that have thousands of movies you saw: you don’t need them anymore. Simply make a list of them. Use a simple terminal command to list them and save them in a spreadsheet. If you want to watch them again later, rent them on iTunes. Only keep one drive with your most important data. Do not keep multiple copies of your data. One is enough. Delete all your selfies. They serve no purpose. Selfies are the worst kind of pollution. Instagram thrives on such pollution.

Ultimately data pollution (in the form of software) becomes tangible pollution (hard disks that are no longer needed, or plastic USB drives – millions are produced but not useful, and after they get damaged, they must be thrown away).

Please clean your data before it becomes a problem to yourself and others.

Comments are closed.