First impressions on the development of shared ML projects running on Google Colaboratory
This year I’ve decided to apply for a ML specialization course to improve my knowledge in the area. The course of choice was Mineração de Dados Complexos @ IC-UNICAMP (which I highly recommend), taught at the institute where I have previously taken my Master course years ago.
After six months using some free tools for R and Python languages, I’ve decided to write about my experience working in groups and using those tools to collaborate and maybe help others to avoid some pitfalls when dealing with (possible limited) free environments.
The first tool I’ll write about is the Google Colaboratory online ML environment, which was used most of the time during the course when developing python notebooks.
When working by yourself or in a small group, using Google Drive + Google Colab showed up as a great alternative to run ML projects, although some points should be taken into consideration since the beginning of the project to avoid edit conflicts or sharing problems.
Data files and Storage
Disk space is one of the most important things to take into consideration. Usually, ML projects will demand a huge amount of disk space for data and the free tier of Google Drive may not be enough. Pay attention to your available space and the size of the files you’ll need to upload or generate into the drive.
If one of the members of the group has more disk space available (a paid Google Drive account, for example), one possibility is to create a shared folder into his/her account and share this folder with other members.
To avoid problems with shared data, the folders containing the original data, that should not be rewritten or erased, must be set to Reader permission. Within this setting, one will prevent having headache if anyone writes a wrong piece of code and try to mutate the original data.
Mounting Google Drive into Colab
Each member of the group will have to mount their personal Google Drive inside the notebook. A good solution here is to create path variables (relatives to a ROOT_PATH pointing to the shared folder) in a cell at the beginning of the notebook. Another good practice here is to put the link to the shared at the root of the Google Drive, avoiding the need to edit the ROOT_PATH variable).
Shared notebooks and collaborative editing
Editing the same notebook puts some challenges into the room. Each time a member saves the notebook, it will possible create a conflict in the versions being edited by the other members. One solution to this problem was to share one individual notebook with READ permission to other members. This won’t allow multiple editors, but everybody will be able to read each other’s notebook, allowing at least real time code sharing inside the group.
Data produced by the algorithms
It must be taken into account the fact that the person who is executing the notebook will be the owner of the generated files, even if the files are generated inside the shared notebook, and the used disk space will be taken from his/her personal account. The solution is to prepare the notebooks to be run by the owner of the account with the biggest free disk space.
Saving and reloading data between executions
As free environments are all of them limited somehow, a problem I personally had to deal with since the beginning was how to keep data between executions. In a local Jupyter-lab environment, the python kernel being executed will not be stopped automatically, but in free cloud environments, it will certainly happen already during your first project execution.
I have collected some personal best practices in this topic that I find interest to share and discuss:
- Separate cells where we declare variables and where we mutate code, trying to keep each cell idempotent and self-contained in the sense of partial re-execution of code during the development. Most of the time it won’t be practical to restart the hole notebook and execute it from the beginning, thus having self-contained cells that can be executed with minimal collateral effect is a must have.
- Save pandas dataframes and partial fit models to disk after each major state change, allowing the restart of the execution by loading the dataframe or model from disk and avoiding the re-execution of heavy processes. As the free environments are limited, saving partial state can save a lot of time by allowing to execute smaller parts of the notebook and restarting after the backing VM has been stopped by a timeout, for example;
- Create a folder structure (with the corresponding variable defined at the beginning of the script) where each member of the group has a folder to save temporary or partial data, for example, by having their name as the sub-folder of a results folder. As the path variable pointing to this folder will be the same for all the developers, changing the final version of the code do produce the results in a default folder will be a matter of changing a variable.
Other free ML environments
Google Colaboratory is not the only free option available for executing ML projects. Some other options should also be evaluated and may be considered more adequate by the group, depending on individual background. The other tool I’ve being using is Kaggle Notebooks, with the advantage of not killing the virtual machine after some time, but limiting the total time available for notebook executions. This limitation may be easier to deal with depending on the type and the size of your project.
I am sure there are other great options for free development of ML Models out there and these collected impressions are just a small contribution by a novice in the field.
Can you also share your impressions or best practices? I’m sure this is a great practice to evolve our technical knowledge.