Rime Of The Ancient Data Engineer

Sharing is caring!
Data Engineer

I had to make a decision when writing this post. Do I make it an extract from my upcoming book 'Data: A guide to humans' as a fairly blatant promotion in the form – Copy / Paste, Promote – spoilers and everything, or do I write something new? I have decided on the second, to write something new. So, no spoilers here. If you want to read the content of the book, please follow the link above and make a pledge. Then we will send you your very own copy!

From the fiends, that plague data thus!
Why look'st thou so? With my cross-bow,
I shot the birds.albatross

The promotion now being done, what new thing do I have to say? Data is important.

That statement is easy to ignore in the morass of data related messaging. I want to explain why. 'Data' is everywhere in massive volumes. Data is in your mobile phone plan, your spreadsheet, dashboard or database. Data is known about you by big companies, they hold data on you. Data could save all our lives when used in health care. Data is the main concern of this thing called GDPR which is going to be implemented on 2018-05-25. This is really important too; it upgrades your rights in ways that help you, personally, manage data about you.

Then all averred, I had dropped the data.
And had forgot to write a log,
'Twas right, said they, such data to slay,
But write a log, so it is never mislaid.

I used to embrace the term Big Data, but now it has become synonymous with the Hadoop and Spark technology ecosystem and as such, it is becoming less useful as a term. It used to separate modern data work from work done in the past... with data. Data technology in the past was SQL Databases and the applications which doted on them. Modern data technology, including the Hadoop ecosystem, helps us solve a variety of problems and take a data first approach.

At the moment, I generally default back to just saying 'data' and leaving it at that. I find this leads to much more interesting conversations and opportunities. There are oceans of data available to humanity and that is deeply, unignorably, important.

Day after day, day after day,
We stuck, nor insight nor action;
As idle as a static dashboard
In a static industry.

Data gives us new things to know and new ways of knowing this. This has become the tag line for my personal mission. It defines me in my work and within it lies the importance of data. Data is now a fundamental part of the human epistemology (the study of what we know and how we know it) whether we like it or not.

This is a deep and fundamental shift, happening right now, that we need to take seriously. It is not a matter just for the Analysts or Data Scientists nor just the Data Engineers. They are rightly focused on the methodology of science and engineering respectively. Working within tightly defined rules, practices and processes that bring rigor and much needed control to insight generation and movement of data (this key, valuable, modern asset). The impact of data on the human epistemology is a matter for Data Philosophers.

Data, data, every where,
And all the mysteries did shrink;
Data, data, every where,
Nor any space to think.

"Don't be so pretentious" I hear you cry at your screen. "Data Philosophy indeed! Who do you think you are? Get back behind your keyboard and code some transformations. Less of this nonsense and badly adapted epic poetry."

I am a Data Philosopher, and I embrace this. I find a deep satisfaction in thinking about data and exploring how data impacts all aspects of human life. Yes, I know how to write SQL and transformation code in multiple different languages. I know how to conceptualise and design data processing systems all the way through to dashboards; as do many other people. Many better than me.

That is not what is important any more. What is important now are the philosophical concerns; reasoning about systems and people and how they interact. It is here that we bump into the need for Empathy (the main topic of the book linked above). We find the Ethics of Advanced Analytics (AA) and Artificial Intelligence (AI). We discover bias in the data and in ourselves. It is these topics that put into a human context the most important technological advances. We have the opportunity to discover how to save ourselves from ourselves. Believe me, we need saving.

The very data did rot – Oh Data!
That ever this should be.
Yea, slimy things did crawl with legs,
Upon the slimy web.

We have seen the negative uses of data pop up like unexpected warts on the smooth, unblemished skin of our darling social media. We have seen how rotten data has destroyed the best-intentioned AI. We have seen our very society manipulated at the core by AI trained to trick us into believing it gives us news. 'News', for many people, is how they 'know things'. When I was young, we were told 'not to believe what you read on the internet', but now many plug that pipe directly into their brain without question.

It is not the AI that is at fault here, it is the people and the data. AI still, at the time of writing, does what it is told. The goals and ways of achieving them are tightly constrained; a bucket of coefficients that control the logic flow. The data and the people are joined in an epistemological dance that needs our focus and mental effort.

Ah! Well a-day! What evils looks
Had I from old and young!
Instead of the cross, the data
About my neck was hung.

We must use what makes us human, to help us understand what we are doing to ourselves and our planet.

Empathy is an important soft skill when working with data and AI. You can read more about this in Data: A guide to humans where we give you practical models, steps and tools for developing your technical empathy in ways that will make you more successful. Make a pledge to support us to get your copy!

P.S. Upon what is this nonsense based? Well, this link should make it all clear.

Sharing is caring!