An Overview of Transfer Learning Without Formulas
By Masaya Mori, Global Head of Rakuten Institute of Technology
Transfer Learning – Learning How to Learn
Recently, Transfer Learning is getting a lot of attentions, so I wanted to write a bit about it. At the New Economy Summit held last year, it was mentioned together with Generative Adversarial Networks (GAN) and others during a topic of what technologies to keep an eye on, in a discussion together with Professor Yutaka Matsuo from Tokyo University, Yosuke Okada from ABEJA, and Koichiro Yoshida from CrowdWorks. It is expected to attract more and more attention as an extremely important technique going forward.
Transfer Learning can be seen as a special benefit obtained by using Deep Learning.
Explaining recent AI technology like Deep Learning in writing is very difficult, and trying to write it in an easily understandable way can often make it even more difficult to understand. From the perspective of a person that understands the subject, comments such as “the explanation is too rough and sacrifices accuracy” or “at this level the risk of misunderstanding is too high” are understandable critiques. Meanwhile, from the perspective of someone who isn’t familiar with the subject, they might leave with thoughts such as “in the end I didn’t really understand” or “I only got a rough image of the subject.”
Transfer Learning is an extreme case of this. In one sentence, it lets you take a model trained in one area and apply it to a different area. For example, you could take an autonomous driving model trained using a toy car and apply it to a real life autonomous car, but this doesn’t explain how or why this is possible. Trying to explain this to someone outside of the field of computer science, it easily becomes a non-understandable mess, at least when I try to do it. This is the challenge.
Transfer Learning is useful when you want to create a Deep Learning model for a problem, but you don’t have enough data. Usually, in order to train a detailed model, you need a huge amount of data and computational resources. However, for many problems and application, reusing a model that someone already created can be sufficient.
This is utilizing a so-called pre-training model. For example, if you want to create a model for recognizing the faces of a certain company’s employees, first you can create a pre-training model by gathering face data of people in general, and then use face data of the company’s employees only to finetune the model. In this example, the pre-training model learns with a high level of generality, and finetuning is used to improve and adjust the focus to the more specific problem.
You can imagine several ways of applying Transfer Learning in the image field. Applications of Deep Learning on images mainly utilize Convolutional Neural Networks (CNN). In CNNs, each layer’s inputs and outputs are similar, so you can stack layers like blocks. The first layers might identify the image’s abstract features such as edges or textures, while the later layers identify things like composition and subjects.
The easiest method of Transfer Learning utilizing this takes the finished model, cuts off the later layers, and replaces them with suitably initialized ones. By doing this, you can train the model with the final target’s data by only changing the parameters of the later layers. With this method, the later layers behave like a regular neural network, while the first layers perform the feature extraction. When the data and task pairs resemble each other, the Transfer Learning can be said to function effectively. For example, with driving a car and driving a truck, it would be effective because the relationships between the data and the tasks are similar in both cases. Another upside is that, because the trained parameters are limited to the later layers, it is more difficult to cause overfitting.
Explaining this in detail is difficult, but it is worth noting that, depending on which part is more similar in the two problems, the data or the task, the decision of cutting away the first layers or the later layers can change. If the data is similar, but the task if considerably different, there are cases where cutting away the first layers is preferred.
Also, if large amounts of data for the target task is available, the parameters of all layers can be included and retrained. In this case, the pre-training model will act as an initialization state and the convergence of the training can be expected to happen faster. Here, Transfer Learning helps lower the cost of training.
It might seem counterintuitive, but even when there are large amounts of data for the target task, and a pre-trained model’s task is significantly different, there are cases where utilizing the pre-trained model as an initialization state gives better results. This indicates that sometimes it is worth trying to apply Transfer Learning to a problem, even if it doesn’t seem like it would help at first.
However, the opposite is also true. For example, if the pre-training model contains a lot of bias. In the case of facial recognition of a company’s employees, a biased pre-training model can inject unexpected side-effects into the final model, making it preferable to create a new model from scratch. This topic needs more careful discussion, so I will cover it separately on a different occasion.
Furthermore, there are cases where the networks are too different, and the pre-training model doesn’t satisfy the problem, making Transfer Learning inapplicable. The previous examples mainly covered image processing, but for Natural Language Processing (NLP), which network type is appropriate depends on the specific application. This means that a pre-training based approach doesn’t apply. That said, Google presented a new method called Bidirectional Encoder Representations from Transformers (BERT) last year that completely changes this, but I would like to cover this in a separate future post.
As explained above, Transfer Learning is a method that focuses on the relationship between data and tasks, and makes use of specific features of Deep Learning and pre-training models to enable the application of a model to a different area. In different wording, it is an attempt to deepen the understanding of how we can apply generalized learning to multiple concepts. Therefore, Transfer Learning can be seen as “learning to learn,” presenting a way towards a more universal artificial intelligence, and many future applications are expected.
It might sound like Transfer Learning uses abstraction of knowledge to enable learning at a meta level, but this is not exactly true. There are cases where it is true, but there are also cases where it is not, and it has a vagueness to it that makes it difficult to explain. I think the following analogies are helpful.
For example, in high school you learn algebraic geometry. You understand situations where it applies, learn formulas and rules, and solve exercises to acquire knowledge. The same approach can also be used in, for example, learning probabilities and statistics, so there is a transfer between the fields. However, the abstract knowledge from algebraic geometry is not transferred. Sometimes specific things like reading and understanding a problem can be transferred, but not all knowledge is. In other words, whether abstraction or representation, or the essence of knowledge or superficial techniques, can be transferred or not becomes a different conversation.
Also, it is possible that this way of learning algebraic geometry can be applied to learning English grammar to a certain degree. In other words, learning formulas and rules, and solving exercises to acquire knowledge, is transferrable between subjects. However, it is not as transferrable to subjects like history.
So, in a way, Transfer Learning helps us get a deeper understanding for what it means to apply experience from one area to another, and furthermore, what efficient or effective learning actually means.
You can read the original post and an article from the discussion that sparked the post, both in Japanese, at the links below.