Lessons data scientists can learn from extract, transform and load of yore – and apply to tomorrow's machine learning
Intel walks us through its AI-tuned technology and how it accelerates enterprise workloads
Sponsored Extract, transform and load (ETL) is, according to some, no longer the de facto database process it once was. In the age of artificial intelligence (AI) and machine learning (ML), it’s argued this industry staple of enterprise data should be replaced by more modern approaches. Some believe fast and efficient in-memory databases are the answer.
AI and ML are growing in adoption: IDC predicts 75 per cent of enterprises are planning on embedding some sort of AI or ML in their activities with AI becoming integral to every part of the business by 2024. That means enterprises are going to need an awful lot of data to train the machines, which is why ETL is needed now more than ever. It’s a proven process, it provides structure, and AI and ML are a natural home for ETL.
A typical ML project can be split into multiple steps: collect the data, prepare the data, split the data into training and validation sets, train the model, test and validate the model, and finally deploy the software. Anyone who has had any experience of ETL in conventional data applications, such as RDBMSes, will recognise and be familiar with steps one and two. The final parts – train, test, validate, and deploy the model – are new areas.
Some phases of the ETL process can be time-consuming, especially in the world of AI and ML thanks to the huge volumes of data involved. That is why enterprises should embrace hardware and tools that can help reduce the crushing amounts of time spent by data scientists preparing their data.
So what exactly is ETL? It’s a three-step process to collect data from sources, convert the data into a format that’s suitable for the job at hand, and finally load the data into the required storage layer.
Extraction represents around a fifth of a data project’s workload, and involves finding and importing data from multiple sources. These can be structured or unstructured, and housed on-premises or held in as-a-service applications in the cloud. Unstructured data today includes online video, audio from call centres, and social network chatter. The data may contain elements that make it possible to identify individuals, which is a problem that needs to be tackled if the destination application is not allowed to handle personally identifiable information, as might be the case in the fields of financial and medical research.
Crucially, during the extraction phase, you can begin to ascertain whether data is missing, wrong, or duplicated.
Transformation takes the most time of the ETL data triumvirate, typically 60 per cent of the project. The reason? Again, the data will be either in an unstructured format that needs structure added, or it may be dirty and therefore must be cleaned. A survey of Fortune 500 companies by the Irish Management Institute found just three per cent of data meets basic quality standards, and that there are typically records missing, and the data is duplicated, incomplete, out-of-date, misspelled, or in the wrong fields. That in a nutshell is dirty data.
Cleaning this data is important as the adage of “garbage in, garbage out” couldn’t be truer when it comes to AI and ML. The data you use to train the application must be free of dirt otherwise your machines will learn incorrectly. That will either delay your pilot’s successful completion, as you seek to correct the outputs, or it will carry its errors into production should they go undetected.
Transforming requires the data to be cleaned. What does cleaning look like? It may involve replacing specific figures with mean values, eliminating inconsistencies such as incorrectly spelled names, and removing duplicates. The data may also need more complex transformations, including weighting, aggregation, filtering, or sampling, and processing through algorithms. After that lot, it can be loaded.
The transformation process for unstructured social network traffic involves a considerable amount of work. Social media data is about the dirtiest data you can get: things can be misspelled, contain irony and sarcasm – something machines are not good at detecting – and be punctuated by emojis, animations, and so on.
Loading data into your system is where you can really embrace new tools and get some advantages by choosing hardware that’s optimised for AI and ML software, languages, libraries, and frameworks.
For years the main tools for machine learning were R (the statistical analysis part of the GNU open-source project) and Python with the Theano, NumPy, SciPy, Scikit-learn, Matplotlib, and Seaborn libraries. However, a 2019 TDWI Research report found that there now exists a number of other tools and frameworks that are growing in popularity. These include native toolsets from the various cloud service providers offering AI or ML as a service.
Google’s popular TensorFlow has also been optimised for Intel®'s Math Kernel Library for Deep Neural Networks primitives – a performance library used in deep learning. Like Theano, TensorFlow has been tuned to Intel® Xeon® processors. Intel® also brings a number of other frameworks to developers through the Intel® nGraph Compiler. Additionally, the BigDL deep learning library for Apache Spark was created by Intel® for deep-learning applications written in Python. These, and more, are available to those getting started through the Intel® DevCloud.
It’s here that you can gain a real edge if your software model has been optimised for its host system, including its microprocessors. Intel® has continued to optimise its processor technology to accelerate standard AI processes: its Xeon® Scalable Processors are now up to 30 times faster on image inference performance with increasing computation speeds on more general-purpose AI workloads.
In addition, Intel® has worked on optimisations of key software toolkits – including the Theano Python library, and the Intel® Math Kernel Library and BigDL – that can be utilised by your existing AI assets to speed up execution. China UnionPay has recently used BigDL for a new neural network risk-control application based on Cloudera CDH, and Apache Spark and was able to significantly reduce development time and boosted accuracy on its compute clusters by up to 60 per cent.
Training and testing is the final part of the ML process. As mentioned, this is outside the scope of traditional ETL. So how do you proceed? It's a good idea to test on a subset of your data, but which part? The standard rule for splitting data is 80 per cent for training and 20 per cent for testing, but that 20 per cent ideally needs to be representative of all of the data.
However you choose to make the cut, there’s no getting away from the fact that training and testing are processor intensive. You build your test application, run it, examine the results to see if they match your expectations, and if they don’t then you tweak and start the process again.
So, the faster you can complete the process and the processing the better. Which is where processors such as the Intel® Nervana™ Neural Network Processor (NNP), which has been designed from the ground up for AI, can save time and reduces costs in your AI application. When it comes to storage, you also have Intel® Optane™ SSDs deployed in the data centre that provides the infrastructure to manipulate bigger data sets faster and more cost effectively.
Run smarter and faster
You will hear plenty about the death of ETL. AI and ML are new representations of the data-heavy applications of past eras. Today’s applications are faster and the data bigger. Those differences aside, the underlying principles of sourcing and preparing data remain unchanged – enter ETL.
What is new compared to ETL’s past is the choice of tools and technologies available to support this process. Optimisation of the AI and ML frameworks and tools to the hardware will help ensure the benefits of ETL are realised faster for these big and demanding applications.
Sponsored by Intel®.