The pipeline starts with the project planning phase, proceeds to the data processing phase, and finally culminates in model deployment and productionalization. Fullstack AI supports the pipeline by enabling the seamless integration of each component.
This is why for organizations that aspire to run a similar pipeline, Fullstack AI is a must.
In this phase the organization has to identify the problem to be solved. Normally this is determined based on the client’s pain points. Apart from that, this is the phase where the resources and timeline have to be set. The definition of the scope of the project is critical because it is from there that decision-makers can decide whether or not a project is technically feasible.
The first stop in the pipeline would be data collection. This is relevant in business because virtually any kind of data, regardless of format, can be fed into the pipeline. For instance, if a company dealt with ride sharing such as Uber, normally the raw data would come in the form of trip path, trip type, trip length, amount paid, etc.
Data Labelling involves annotating the data to be able to use it for training and testingpurposes. This is commonly needed for supervised ML, where the model learns by looking at labelled examples and is later tested against unseen data.
Traditionally, data labelling is a resource intensive endeavour. However, a company called Snorkel specializes in automating data labelling at an industrial scale. They use Snorkel Drybell, a weak supervision management system that is built on the Snorkel framework. In just tens of minutes, Snorkel Drybell is able to label millions of data points.
From there, the raw data can now be stored in the designated Data Lake of the company. Common Data Lake solutions in GCP include Google Cloud Storage and Google Big Table. Now, this step is very important because ideally there has to be low latency in data access. This is especially important if the data is STREAMING into the pipeline. That is, every new piece of data will be processed on the fly.
Most cloud platforms have their own storage solutions. Google Cloud Platform has what is called Google Cloud Storage--one of the most commonly used cloud storage solutions to date. It features 99.99999999999% durability, multi-region support, easy integration with other cloud services, and robust Identity Access Management.
Its AWS counterpart is Amazon S3 whereas its Microsoft Azure counterpart is Azure storage. They have more or less the same features but with slightly different pricing.
Next, once the data has been comfortably housed, it has to be processed and analyzed by data scientists and analysts. This step can be done in various ways. Data scientists and analysts commonly leverage on cloud services such as Google Compute Engine, GCP’s AI Platform Notebooks, BigQuery, etc.
Their findings and models can easily be housed in those services. However, if they prefer to do it on their own and in a completely customized fashion, that is also possible. They can leverage on open-source technologies such as Python, Jupyter Notebook and Tensorflow.
Once that key insights and analyses have been extracted, it is time for the general public to see and experience what the company has worked on. In data output and reporting, this is the part where solutions that have been gathered are implemented. It can either be deployed to improve existing systems or quite literally presented to users (ex. chat assistant).
For purposes of Data Visualization, GCP’s Data Studio is an excellent tool to effectively communicate key insights extracted from the data. It is an interactive dashboard that can be connected with most Google services. The image above is a sample Data Studio template which describes Anthony Bourdain’s travel frequency across continents and countries throughout the seasons of his show.
At this stage, Data Scientists can now develop prototypes to test out hypotheses made after extracting insights from the data. There are several different models out there which would depend entirely on the type of problem being solved. For NLP, a commonly used model is Google’s BERT. It is a language model that is meant to understand context.
For computer vision, regardless of whether the problem subtype is that of image classification, object detection, pose estimation etc. neural networks and its many variants are almost always the go-to model type.
After the data scientists have created a satisfactory prototype, it is now up to the ML engineers to bring that prototype into production. Combined with data engineers who are in charge of infrastructure, factors such as latency now come into play.
Again, AWS, GCP, and Azure are the key players when it comes to deployment to the cloud. All their services allow for scalability. A feature generally referred to as “load balancing” or “autoscaling” ensures that resources are properly allocated for optimal performance and savings.
For instance, one might want to leverage on a VM instance. They might want to house a Nginx web server on that instance so as to have a dedicated server on the cloud for a web application. After which, other cloud services can easily be connected to the VM instance housing the web server. This will enable the user to implement a plethora of auxiliary functions that augment user experience.
This is a simple example of how an application was transformed from a prototype or POC to a production-grade application.
Now, at this point we will mainly be interested in managing resources. Again, the 3 big cloud platforms always come built-in with automated scaling regardless of the service. By relying on these services, automating and managing deployed instances becomes trivial.
Here at codvo.ai, our business and technology experts are passionate about building end-to-end Fullstack AI pipelines. In particular, if you are interested in a Machine Learning Data Pipeline, we have all the tools and expertise to jumpstart you on your journey. We have experts in ML ops, cloud engineering, and many more. With that said, we are capable of equipping you with the necessary tools to transform your organization into a modernized and data-driven machine.
If you are interested in building an AI pipeline of your own, contact us today at firstname.lastname@example.org
Discover Data-Driven Agile Approach to Cybersecurity which is a culmination of modern best practices in tech.
Is your data warehouse up to date? Find out in this blog on the Top 5 Data Warehouse Trends in 2021