Everyone tends to focus on the flashy things you can do with data. But, data onboarding and storage is fundamental to doing the headline-grabbing stuff. Trust in data is paramount. Without knowing your data has been ingested and stored correctly, you can’t trust the insights that it provides. That’s why you need to start with the right processes to onboard and store it.
Data onboarding and storage is a critical part of getting AI-ready. AI models need to be trained on data. If the data is poor quality, you cannot trust the outputs of the AI. Build a robust platform that allows you to carry-out complex activities such as training an AI. A clear pipeline of data, that is trustworthy and can be read by the AI, will set you up for long-term success. So, don’t rush into the clever stuff without doing the basics first.
The term ‘data lake’ has become as ubiquitous as data itself. Go to any data-focussed conference and you’re sure to see a few data lake solutions on show. Everyone has a different interpretation of a data lake, and it can often come with a lot of baggage.
I prefer the term ‘data platform’. What most people want is a platform that enables them to not only store a wide range of data but also access and use it in a wide variety of different ways. Data platforms encompass this.
As for data warehouses, there used to be a couple of versions in common use. Kimball was one of them. In some circles, whatever data warehouse you chose became something of a religion (there were Kimball followers and Inmon fans).
In the past, there was a great need to be efficient in your data storage design. However, now, with the increased computing power available, that need isn’t as critical. Some things that were complete no-nos in the past, like having repeat data in more than one place, are sometimes acceptable. If the use case calls for it, and you have enough storage, then why not?
Storage wise, we’ve never had more variety in off-the-shelf options. The variety of different systems means that there isn’t a one-size-fits-all for organisations. It means that you can choose your system based on your use cases.
In fact, we recommend this approach. Look at your use cases first, then build your storage solutions from that.
Most organisations are likely to be well-served by a core, tabular style, relational database. The majority of organisational data is in this format. Plus, the skills needed for dealing with this database technology are widely available. You won’t have to invest too much in hiring specialist team members. Consider this as a starting point. As your use cases get more advanced, begin to explore other storage solutions.
One example where a different style of storage is needed is when mapping out relationships across people. Like when an organisation wishes to map out its internal talent – which department speaks to others, who in a team solves the most problems, and where common failure points occur. A graphical database that can easily visualise this information is the best bet in this scenario.
As the complexity of use cases and their required storage increases, the available people you can recruit to deal with it becomes smaller. This makes certain projects most costly than the ones you first begin with. When you reach this point, you must recognise what projects are worth investing in, and what should be on hold. Do a cost/benefit analysis of each and every use case.
As well as considering storage solutions, you must build efficient data onboarding routines. If these are wrong, your delivery speed will suffer along with the trust in your data. As a start you should:
It’s also important to note that it’s no longer efficient to extract, transform and load data in the old way. From a trust and quality standpoint, you should not transform your data as you ingest it. Keep it in its raw format and transform it after. This way you can always go back and see exactly what it looked like when you loaded it.
There are a few options for different ingestion tools:
Without investment in data onboarding and storage, your data projects will falter. You want to be able to trust the data quality so that you can rely on the findings from using it. Before you start ingesting and storing data, consider your use cases.
Everything should stem from your use cases. This will tell you what data you need to collect, and the best storage solution for it. Developing good ingestion routines and data storage sets your organisation up for the future. If you want to use data, you need to trust it. That starts with data onboarding and storage.