Subscribe

National Education: Exploring Federal Learning Initiatives

4 minutes read
250 Views

In 2017, Google introduced federated learning (FL), an innovative approach enabling mobile devices to collaboratively train machine learning (ML) models while preserving the raw training data on each user’s device. This methodology effectively decouples ML capabilities from the necessity to store data in the cloud. Since its inception, Google has been actively engaged in FL research and has implemented FL in various features of Gboard, including next-word prediction, emoji suggestion, and out-of-vocabulary word discovery. Federated learning has also played a crucial role in enhancing detection models for “Hey Google” in Assistant, suggesting replies in Google Messages, and predicting text selections, among other applications.

While FL facilitates ML without the need for raw data collection, differential privacy (DP) offers a measurable standard for data anonymization. When applied to ML, DP can address concerns about models memorizing sensitive user data. Google has made DP a top research priority, resulting in the development of several key initiatives such as the RAPPOR analytics system in 2014, an open-source DP library, Pipeline DP, and TensorFlow Privacy.

After years of concerted effort spanning fundamental research and product integration, Google has announced the deployment of a production ML model using federated learning with a robust differential privacy guarantee. For this proof-of-concept deployment, Google utilized the DP-FTRL algorithm to train a recurrent neural network, enhancing next-word prediction for Spanish-language Gboard users. This deployment marks a significant milestone as it represents the first production neural network trained directly on user data with a formal DP guarantee (specifically, ρ=0.81 zero-Concentrated-Differential-Privacy, zCDP).

In federated learning systems, principles such as data minimization and anonymization are inherently incorporated. FL minimizes data transmission by only sending minimal updates for specific model training tasks, limits access to data at all stages, aggregates individuals’ data early in the process, and discards collected and processed data promptly. However, FL alone does not directly address the issue of anonymization. Differential privacy, a mathematical concept, quantifies anonymization by ensuring that the final model does not memorize information unique to individual data. Training algorithms with differential privacy add random noise during training to maintain privacy guarantees.

While achieving differential privacy in a federated learning setting presents challenges, Google has made significant progress. In 2018, the DP-FedAvg algorithm was introduced, extending the DP-SGD approach to the federated setting with user-level DP guarantees. Subsequently, Google deployed this algorithm to mobile devices in 2020, marking a significant milestone. However, ensuring formal privacy guarantees in a real-world cross-device FL system presents additional complexities. The DP-FTRL algorithm addresses these challenges by leveraging negatively correlated noise to provide accurate estimates of cumulative sums of gradients efficiently, thus ensuring strong differential privacy guarantees.

For the production deployment of the DP-FTRL algorithm described above, each eligible device maintains a local training cache comprising user keyboard input. When participating, the device computes an update to the model, enhancing the likelihood of suggesting the next word based on the user’s input. DP-FTRL was applied to train a recurrent neural network with approximately 1.3 million parameters. The training process spanned 2000 rounds over six days, with 6500 devices participating per round. To adhere to the DP guarantee, devices were limited to one training participation every 24 hours. Notably, the model quality surpassed that of the previous DP-FedAvg trained model, which offered privacy advantages based on empirical tests but lacked a formal DP guarantee.

The training mechanism utilized is available in open-source through TensorFlow Federated and TensorFlow Privacy. With the parameters employed in the production deployment, it furnishes a robust privacy guarantee. Our analysis reveals a ρ=0.81 zero-Concentrated-Differential-Privacy (zCDP) at the user level, where lower values indicate enhanced privacy in a mathematically precise manner. For perspective, this guarantee surpasses the ρ=2.63 zCDP assurance adopted by the 2020 US Census.

Moving forward, while achieving a production FL model with a meaningful zCDP represents a significant milestone, our research endeavors persist. Although this approach may not yet be universally applicable or practical for all ML models or product applications, other private ML methodologies exist. For instance, membership inference tests and other empirical privacy auditing techniques can offer supplementary safeguards against data leakage. Training models with user-level DP, even with a substantial zCDP, signifies progress, as it entails training with a mechanism that limits the model’s sensitivity to individual user data. This paves the way for future enhancements in privacy guarantees as improved algorithms or more extensive datasets become available. We are committed to advancing the utilization of ML while mitigating potential privacy risks for data contributors.

Leave a Reply

Your email address will not be published. Required fields are marked *