Overview

General information about goals and usage of the Science Toolkit.

Background and Goal

Today, data-based projects have a critical impact on any business. The amount of historical information available, the different data sources that enrich this information, and the ability to compute and execute advanced techniques, are some of the indicators that demonstrate why this branch of technology has become one of the main actors in the stage. For approximately four years, the evolution in terms of technological framework and algorithms has been overwhelming. A lot of new technological approaches and improvements in the state-of-the-art solutions and methods applied have helped to push this field to be more controlled, established and mature. Much of the responsibility is held by software engineering, where methodologies and patterns have been adopted when designing artificial intelligence solutions.

However, in order to achieve total technological maturity, there are still many gaps in the methodologies inherited from software engineering that need to be filled in. Although total technological maturity is very ambitious, there are improvements and changes that can be adopted in the medium term. With that goal in mind, the Science Toolkit has been developed following this idea of ​moving to the next level of maturity.

The Science Toolkit goal can be defined in one sentence as:

“To offer a standard environment that covers the technological and methodological needs of a data team in an artificial intelligence project.”

What is the Science Toolkit?

As mentioned in the previous section, the Science Toolkit is a solution that offers a standard environment to cover the technological and methodological needs of a data team working in any artificial intelligence project, allowing the data scientists to focus on gaining value and new insights from the data.

Regarding the technological needs, the science Toolkit is composed of a suite with the latest frameworks and libraries in the same environment, as well as, its own engineering approaches such as data versioning, model versioning, continuous integration, dashboards for data visualization, etc. In terms of the methodology, the Science Toolkit is designed to develop experiments guided by hypotheses, so that the control is greater and the collaboration between team members is easier and more effective.

Science Toolkit Workflow

Applying the proper methodology in artificial intelligence projects is a challenging task. There is a lot of literature covering the main phases and levels of a data project. Nowadays the definition of the right workflow to face real data projects is pretty similar in the main organizations working in this area. The Science Toolkit is designed to provide the right tools and the right workflow to help A.I. researchers and practitioners to work applying these methodology standards. For any data project, the first step would be to design a set of initial hypotheses that will guide the work. Following the hypotheses definition, an exploratory data analysis is performed in some cases together with the development of some preliminary models, to gain data knowledge and starting to get some insights regarding the hypotheses.

With this objective in mind, the Science Toolkit offers a data-studio based on open source technologies to perform these tasks easily. It is important to take into account that along the entire project, a volume to persist temporary data or artifacts is necessary for good project development.

Once the hypotheses are proven and the data analysis returns a promising output, it is advisable to adopt software engineer methodologies or patterns to develop the solution in the correct way. Some good practices such as ATDD, code decoupling or unit tests are examples that can be useful to control that the project evolves in the right path. At this point, a code editor is mandatory to facilitate development. Of course, all of the generated code is maintained through some code version technology (i.e: git).

Another useful adoption is continuous integration and execution. Having the experiments totally automated is a win for the project because it allows testing of different configurations for the executions in favor of having the best possible solution. Some of the benefits of applying this approach are automated execution, saved metrics, charts generation or artifacts serialization.

As it is important to not lose sight of the useful information that we need to store related to the experiments, a solution for managing this information as useful metrics is needed as well as a database to store it.

Last but not least, a dashboard is mandatory to support the analysis during the exploratory data analysis phase and for checking the results of the experiments executed.

The Science Toolkit covers with open source technologies all the needs for the different phases highlighted in the previous paragraphs. The aim of this is to provide a suite of tools and technological frameworks to work in data projects, as well as, the right workflow and methodology to help artificial intelligence researchers and practitioners to obtain the maximum value from each phase of the project.


Last modified January 1, 0001