STAR volume breaks a thousand, 300 lines of code, Tesla AI director Karpathy wrote a GPT PyTorch training library


Machine Heart Report

Edit: Demon King, Zhang Qian

If the GPT model is the invincible battleship, then mingpt is probably a yacht that can still take advantage of the wind.

Recently, “The largest AI model in history “GPT-3swept the world.


GPT series can be said to be a representative of “violent aesthetics” in the field of artificial intelligence. GPT, which was born in 2018, 1170 million parameters; GPT-21.5 billion parameters in 2019; GPT-3175 billion parameters in 2020. In just one year, the parameters of the GPT model increased in index level.

Shortly after the release of

GPT-3, OpenAI opened a commercial API to the community, encouraging everyone to use GPT-3 to try more experiments. However, the use of API needs to apply, and your application is likely to sink the sea. So, in addition to using the official API, is there any other way to get started to play this “biggest model”?

Recently, Tesla’s person in charge of artificial intelligence research and former OpenAI research scientist Andrej Karpathy made attempts.

Based on PyTorch, he wrote a small GPT training library with only about 300 lines and named it mingpt.

Karpathy said that mingpt can perform additional operations and character -level language modeling, and the accuracy rate is not bad. However, after running DEMO, Andrej Karpathy found an interesting phenomenon: 2 layers and 4 Attention head 128 layers of GPT in the two -digit plus method, calculated the results of 55 + 45 as 90, while other additional operations did not have no calculation operations. question.

At present, the project has not appeared on GitHub for 24 hours, but the STAR volume has exceeded a thousand.

mingpt project Address: https: //

MINGPT: GPT training for only 300 lines of code

If the GPT model is an invincible battleship, then mingpt is probably a yacht that can still take the wind and waves.

In the project page, Karpathy introduced: Since the existing available GPT implementation library is slightly cluttered, he tried to follow the principles of small, simple, explained, and educational significance in the process of creating mingpt.

GPT is not a complex model. MINGPT has only about 300 lines of code, including model files and a completely unnecessary custom causal self -attention module. Karpathy turned the index sequence into a Transformer block sequence, so that the probability distribution of the next index appeared. The remaining complex part is to cleverly handle Batching to make training more efficient.

Core mingpt library contains two documents: mingpt/ and mingpt/ The former contains the actual Transformer model definition, and the latter is a PyTorch model file that has nothing to do with GPT, which can be used to train the model. Related Jupyter Notebook shows how to use the library training sequence model:

  • PLAY_MATH.IPYNB training a GPT focusing on the addition;

  • PLAY_CHAR.IPYNB trains GPT into a character-grade language model that can be used based on any text, similar to the previous Char-RNN, but using Transformer instead of RNN;

  • PLAY_WORDS.IPYNB is a version of BPE (byte-Pair Encoding), which has not yet been completed.

Use a BPE encoder, distributed training and FP16. This implementation may reproduce GPT-1/GPT-2, but Karpathy has not tried it yet. As for GPT-3, mingpt may not be reproduced, because GPT-3 may not be suitable for GPU memory, and more fine models need to be processed in parallel.

Use examples

Karpathy provides some use examples in the mingpt project.

These code are very simple, just hack inline instead of “use”. The current API appearance is as follows:

How is

mingpt realized?

During the implementation process, Karpathy refers to the official OpenAI GPT project, as well as examples of other organizations.


  • OpenAI GPT-2 project provides a model, but does not provide training code (;

  • Openai’s Image-GPT library has made some changes similar to GPT-3 in its code. It is a good reference (;

    The Transformers project of

  • huggingface provides a example of language modeling. It is fully functional, but it is a bit difficult to track. (Https://

Thesis + Realization Instructions

In addition, the project author also introduced relevant papers and implementation details.

1. GPT-1:《Improving Language Understanding by Generative Pre-Training》

  • Thesis Address: https://s3-


GPT-1 model generally follows the original Transformer, and trains Transformer, which contains only 12 layers of decoder, has a shielding self-attention head (768 dimension status and 12 attention heads). For details, please refer to the figure below:

2. GPT-2:《Language Models are Unsupervised Multitask Learners》

  • Thesis Address:

GPT-2 Move the Layernorm’s input of each sub-module, similar to the pre-excitement residual network, and add an additional layer of return to the final self-attention module. In addition, the model also changed the model initialization (including the initialization weight of the residual layer, etc.), extended the vocabulary volume, increased the context scale from 512 to 1024, and used a larger batch size. For details, please refer to the figure below:

3. GPT-3:《Language Models are Few-Shot Learners》

  • Thesis Address:

GPT-3 uses the same models and architecture as GPT-2. The difference is that GPT-3 uses alternating dense and local band-shaped and sparse-shaped attention on each layer of the GPT-3, similar to Sparse Transformer. For details, please refer to the figure below:

Andrej karpathy

Andrej Karpathy is a researcher in the field of computer vision, generating models and enhanced learning. During the Ph.D., Li Feifei, a professor of computer science at the University of Computer Science at Stanford. During his reading, he studied twice in Google to study large -scale characteristics on YouTube videos. In addition, he also designed with Li Feifei and others to teach Stanford’s classic CS231N.

In 2016, Karpathy joined Openai as a research scientist. In 2017, he joined Tesla as the visual director of artificial intelligence and autonomous driving. Today, Karpathy has been promoted to Senior Director of Tesla AI. His team is responsible for the design of all neural networks of Tesla’s autonomous driving system Autopilot, including data collection, neural network training, and its deployment on Tesla custom chips.

Like the

and professor CS231N, Karpathy hopes that he can also use the mingpt mingpt in his spare time to have a certain educational significance. He has been appreciated by many community members:

In addition to the discussion about mingpt itself, some people also proposed: Is it possible to train GPT-3 with the help of community power? In other words, if thousands of developers contribute it when GPUs are free (such as at night), is it possible to train a 175 billion parameter GPT-3 in the end? In this case, you only need to share the electricity fee.

However, some people pointed out that the idea of this distributed training is very interesting, but it may encounter bottlenecks in gradients and other aspects.

Some people teased, isn’t it easier to use electricity costs to buy cloud services?

Reference link: ID = 24189497

Recent top ten popularity:

java meets Pythongo

long press for 2 seconds, input: [benefits]

java meets Pythongo] It is operated by a senior old -code agriculture and has been mixed in the world’s top 500 foreign companies for many years. He has done development, testing, project management, and is more obsessed with technology. He is also a back -end development enthusiast. Welcome to like Java, Python and GO programmers follow.


Random Posts

nginx compile and install love

C language: three -dimensional love

Machine Learning | Mathematics expectations (random variables, random variable functions)+K -order original torque, center torque | 15MINS entry | General learning notes (8)

GO and Java how to handle the concurrent and intersection of the two arrays

PROTEUS first project flow lamp experiment