Blog: Geo-Distributed Machine Learning — An Overview
The following post gives a bird’s-eye view on Geo-distributed machine learning.
We are living in an era where data is being produced at an immensely large scale. Organizations generate data related to users and systems from several branches across the globe. Currently prediction algorithms and machine learning models are built after accumulating all relevant data at a single location. This scenario is prevalent everywhere including the fore running organizations. This comes at a large cost, however.
Is there a way to perform model building without moving data?
Consider a situation where you have hospitals spread across the country and patients with different diseases and ailments visit each of them. Let us consider patients to have DR — Diabetic Retinopathy (it is a disease that causes vision related problems to people having diabetes). Assuming you want to develop a CNN model to identify whether a particular patient has DR or not, the data of all the patients across these hospitals need to be centralized to a single destination after which a model can be trained. This current trend has the following problems:
- Privacy: Hospitals and especially patients never want their information to be used/transferred elsewhere.
- Latency: The data (which in our case are images of high resolution) takes quite a long time to be transferred.
- Transfer cost: The cost involved is proportional to the amount of data transferred.
- Security: It must be ensured that the data is not leaked or hacked during transfer.
Fig. 1 shows the current trend of training machine learning models.
- Data is first collected from various sources distributed across regions and stored in a centralized server (aka. data center).
- After data collection, different models are built and trained on the DC.
Enter GDML (Geo-Distributed Machine Learning)
The following figure shows how a general GDML environment can be created.
- The raw data is kept in their respective data centers (DCs).
- A single model with fixed hyper parameters is set on every DC.
- During training the DCs communicate with each other periodically so that updates in their model parameters can be shared among them.
In this way predictive algorithms can be trained without the need of transferring raw data.
There are however some intricacies involved when sharing parameters across different DCs:
- Communication between DCs across geographies is expensive.
- Weights and biases of complex models are heavy. Sharing them between DCs frequently will be costly.
- Only significant updates among DCs must be shared, i.e. updates that has an effect during learning. So care must be taken to intelligently update models with low communication cost.
There will be a follow-up post containing details on the above mentioned intricacies and possible solutions later. So stay tuned.
I used this paper as reference.
Do you have any thoughts you would like to share? Please leave them, I would be glad to read.
You can also connect with me on LinkedIn.
And clap as much as you like !!