Data is hard. Data being hard, and people accepting that data is hard, is lame. Why is data hard? It’s because of old thinking, old tools and bad data citizens.
In this article I am going to talk about examples of bad data behaviour, the people who do it and the old thinking and behaviours they fall foul of to make data unnecessarily hard. Then I will move on to how we can all be good data citizens and work together to make data easy.
Data doesn’t have to be hard and every one of us can make it easier.
Historically some people believed the world was flat, and some believed it was carried on elephants riding on a tortoise. There is old thinking in data too. Below I discuss two of the most common types of old thinking when it comes to data.
1. Data should only live in tables
A table is one shape: it is a collection of small rectangles that make up one big rectangle. Rectangles are not the only shape. I don’t fit into a rectangle, you don’t fit into a rectangle and data in it’s great variety doesn’t only fit into rectangles. The fact that we have used rectangles for handling data in the past doesn’t mean it is the only way.
2. Data is someone else’s problem
This is wrong, data is everyone’s problem. Data represents such a huge opportunity for everyone, individually and collectively, that if you don’t take responsibility for data someone else will and they will win.
Data only living in rectangles, and data being someone else’s problem is lame.
People default to the old tools they know, and a lot of these tools were developed in the 70s and 80s and have only ever been incrementally developed ever since. They are often great for doing one particular thing, but they very soon are not fit for purpose in the face of more modern challenges.
Take relational databases for example; a great tool for managing a reasonable amount of data in rectangles using SQL. They are the backbone of tightly designed systems that exist in a world that doesn’t change. This isn’t the world we live in now. We shouldn’t use these old tools where they are not fit for purpose.
Old tools are lame.
Bad data citizens
This means you. This means all of us. We all do things accidentally and through laziness or ignorance that makes us bad data citizens. Data is all of our responsibility and we can do better. Not taking responsibility for data is the route of much of the behavior of a bad data citizen. Bad data citizens make data hard.
Below I discuss five bad behaviours of bad data citizens.
I use Ada Lovelace here as an avatar for all programmers. She was the first programmer and worked with Charles Babbage. She was the first person to fully understand the power of giving a structured set of instructions to a machine. She discussed the possibility of writing a symphony in this way. She also discovered the first ever software bug. A real insect squashed between two punched cards.
Programmers are always working with data. It is arguable that every piece of software is data software. But many programmers just don’t respect data as they should.
CSV files are often the first time that many programmers meet data. They are simple, tabular and text based; and they also have a standard.
To quote just a small part of that standard: “Fields containing a line-break, double-quote, and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).” Note how this clearly states that not following this part of the standard ‘the file will like be impossible to process correctly’. This means that instead of data you are just creating junk.
So I would expect this kind of result from a production system:
Why then, from a real life production system did I receive this?
It is an easy thing to test when building the code to generate this and it is obviously wrong. This file is now impossible to process correctly. This programmer was just lazy and was not taking responsibility for the data they were working with.
This is another standard worth mentioning. It has also been covered by XKCD (image below). Numerical dates are different everywhere in the world. This was clear in the 80s and is still true now, but such a large number of production data systems ignore it that it is obvious that programmers need to be reminded.
4. Overly constrained data systems
The image below is from a local café. They had a refit and the designer specified a counter tightly coupled to the card reader system in use at the time. It had a separate contactless and chip-and-pin reader. Soon after the new counter was implemented the shop decided to change the card reader to use one unit for both contactless and chip-and-pin.
This means the counter now has a null and ‘column stuffing’
5. Data Illiterates
Data is everyone’s responsibility so I am next going to discuss the behaviour of people who are bad data citizens through apparent data illiteracy.
I have pulled together, in the image below, a whole collection of examples of bad behaviour. I showed this to some colleagues and they asked me who had made it. They had seen things like this before and didn’t realise that it was fake. That is justification enough for this example.
There are many different things wrong with this sheet, but lets focus in on one: what does orange mean? That’s right, orange means Tuesday! I am sure that you can see the internal logic here if you work hard enough. But this really is the behaviour of a bad data citizen. They have not considered this sheet to be data and have no empathy for the next person (or machine) who might have to use it.
No data empathy
This data illiterate, however, is being friendly and is offering to give you some data. While, in principle, this may be a good thing lets consider what is actually going to happen.
The diagram above is called ‘the data journey’ and I will explain it in more detail shortly. You can see, marked in red, that Data and Information are two very separate things. Information is a carefully selected and presented portion of data designed to encourage learning and knowledge in the recipient. Data, on the other hand, is a valuable resource that is everyone’s responsibility.
So what does this person give you? A PDF. Oh.
PDF is not a data format. It is an information format. It can be very useful to carefully curate and control the information you want to deliver. It’s closed and fixed and designed not to change. Putting data in a PDF, makes data hard. It is the behaviour of a bad data citizen.
The Five Lame Things:
- Disrespectful Programmers.
- CSV: ignoring the standard and making data unusable.
- ISO8601: ignoring it and reinventing the wheel time and time again.
- Overly constrained systems: nulls and column stuffing.
- Data Illiterates:
- Disrespecting data;
- No data empathy.
Now we have seen a catalogue of issues and the behaviour of bad data citizens that makes data hard. Lets talk about…
Making data easy
If we want to make data easy, we need to be open to new thinking, put data in new context, be open to new tools and become good data citizens. In this section I lay out some new ways of thinking about data, suggest a frame for working out where to start; how new tools can fit in. Finally I give some examples of good data citizens.
We need to reject old thinking and be open to new ideas. We need to think in new ways and allow the development of new techniques and paradigms.
The data journey
Below is the cycle we saw earlier when discussing the behaviour of data illiterate people who don’t know the difference between data and information.
Taking each element in turn.
Action is the things you do, things you want others to do or the things that are happening that you wish to monitor. With the explosion of sensors, both physical and digital, we are able to measure so much!
Data is the exhaust of action. All of these sensors produce data in some form. From binary interpretations of analogue signals from a physical sensor to json documents in social media. Data is a valuable resource.
Information is a selected portion of data designed to encourage learning. Information is often aggregated, manipulated and rendered in very specific ways.
Knowledge and learning is now a strategic imperative for all organisations. Organisations that don’t learn will cease to exist. Data, carefully curated as information is a powerful source of knowledge. Why do we need knowledge and learning? To make better decisions! Decisions lead to action and the cycle can continue again. The faster you can make this cycle turn the faster you will learn and the faster your decisions and actions will get better.
Whatever action you want to take in the world, if you want to make the world a happier place, make some money or find new ways to feed the hungry, there is data that can help you. There are tools to make this data into information to give new opportunities for learning and new knowledge.
It is likely that if you have been exposed to the world of Big Data, you will have met Gartners 3Vs of big data. If not, then they are worth looking up.
In a previous article I offered a different frame for thinking about data: the 3Cs of Chaordic, Connectionist and Consilient data.
- A Chaordic data system has the right balance of chaos and order. There is neither ‘too much’ order or ‘too much’ chaos.
- A Connectionist data system focuses on individuals, their differences and connections, not artificially defined sets.
- A Consilient data system works with different data sets from different sources flowing seamlessly together like water.
I was recently on holiday in Japan with my wife and son. We all enjoy a bit of history and were visiting a Ryukyu castle in Okinawa. This castle had a boundary wall.
If you look at the wall as it appears in this picture, it presents order. It flows with the landscape and presents seemingly straight lines to the eye. If you look more closely, however, you can see the structure is extremely chaotic.
If we look again, we can see that the connection of every stone has been considered. The wall is a connectionist system. Also, all the stones flow together, regardless of source to make a robust structure. The wall struck me as a great way of considering how a good data system should be.
If we come back to the data journey and break it down further
We can consider all of these things to belong to you. Your sources, your data, your queries, your information and your learning; all leading to your knowledge, your decisions and your action.
All of these things are your responsibility. Regardless of who helps you along the way or what tools you use, it’s up to you.
And if you don’t take responsibility for this data, someone else will and they will win!
For me, one of the best things about the Big Data industry is that it has given permission to people working with data to create and try new tools. Almost every week new tools become available to tackle some aspect of the data journey. All of these tools, in some way are trying to make data easier. At least, they should be.
There are new tools in all areas:
This could be considered an overwhelming jungle. So we need a frame to help us put these tools into perspective.
While the data journey is a cycle, let’s flatten it out and show explicitly where people, and by people I mean YOU, fit. As you might expect, right in the middle. People, individually or in organisations, are the ones who take actions in the world. They do this, as we have seen, by gathering knowledge through learning and making decisions. It is very important that people know where their knowledge comes from.
The study of ‘what you know and how you know it’ is called Epistemology. Within business this is sometimes hidden behind ‘knowledge management’ but it is always there. All of the new tools for working with data exist to help us know more and understand how we know it. Everything from tools collecting data to visualising and analysing it, to decision support and automated action tools.
When approaching the jungle of tools available, always ask how they can help you as you are taking responsibility for the data in your organisation. You will soon find out which tools only exist for their own sake.
There is a technique called Data Landscaping, which is very useful for finding a place to start in understanding and improving your organisational epistemology.
Good Data Citizens
There is a very simple split between good data citizens and bad data citizens. Good data citizens make data easier and bad data citizens make data harder.
Let’s consider some examples of good data citizens. None of these examples are perfect, and I am sure you can find bad things they do, but they are all trying to make the world of data a better place. And for that, they get my thumbs up.
These first four represent the government open data movement; governments making data available for anyone to use.
This next set represent people working hard to make data available or use it for the benefit of us all.
- datacoup.com this example deserves special attention. They are empowering the individual to sell their data to companies that want it. Real money for individuals. Great work!
This penultimate set is working to make data more accessible. Either by cleaning up public data or liberating data for the individual from within their giant and popular systems.
And finally, the best data citizen of them all. Data Kind are working to bring data skills from enterprise to the third sector. Helping charities to use data to do good in the world.
Lets now look at how bad data citizens make data hard and good data citizens make data easy.
- Don’t care about others
- Thinks data is someone else’s problem
- Uses bad formats for data
- Has empathy for others
- Takes Responsibility for data
- Uses good data formats
- Ignores standards
- Ignores everything but the immediate
- Defaults to old tools
- Respects standards
- Thinks about the bigger picture
- Seeks out suitable tools
- Demands strict order
- Only considered rectangles
- Closed thinking
- Embraces useful chaos
- Respects variety
- Open thinking
No matter who you are, from event organisers in Scandinavia, to scientists in the desert to enterprise software engineers in Redmond to someone breaking cars in Dorset…
So don’t be lame, be a good data citizen!