Source | Open Source Data Science Tools
Translation | Cheng Haoyuan
Google Colaboratory, referred to as “Colab”, is a free Jupyter Notebook cloud platform. COLAB can not only provide users with the operating environment of Python and R Notebooks, but also allows users to share some GPUs and TPU resources for free.
For data scientists in charge of Jupyter Notebook programming, Colab has already become the default operating environment. However, the use of COLAB’s computing power to other applications other than Jupter Notebooks is extremely difficult.
For the machine learning engineers who want to produce models and bring them out of the Notebook stage, such problems are particularly obvious. Although Notebooks is very suitable for exploring, it is not effective when using it to use the training process into the high -end MLOPS tools of the formal assembly line.
After encountering similar problems, I decided not to let Colab’s limitations change my workflow. Instead, I tried to change Colab around my workflow!
For this reason, today we will explore the internal structure of Google Colab and try to change the built -in rules of Colab slightly. What needs to be declared in advance is that we just want to explore Colab and will not have any impact on Colab itself or its users.
1
Revealing the secret after the start
Colab’s secret lies in its back end: Google server provides infrastructure support for Colab, allowing users to easily run code. Therefore, the first step is to analyze the COLAB API first. The easiest way is to check the API calls performed by Colab during the normal operation.
First open the Google developer tool, find the network option, and then run a code. The developer tools start to record each request from Colab, and then we find some interesting things.
It looks like this url (/tun/m/<ID> /Socket.io) is an agent of the Jupyter Socket running on the remote machine.
If we open the Files pane from the left pane of the colab interface (default display/content directory), we will find another interesting request:
The files on the JSON enumeration remote host responded to this time. This url (/tun/m/<ID>/api/contents/) seems to point to the service providing file metadata.
Double -click the file in the Files pane, Colab will start downloading the file and display the file details. If you click /Content/sample_data/readme.md, you will send a request for/tun/m/<ID>/files/to return the content of the file.
Obviously, https://colab.research.google.com/tun/m/ <ID>/Is the reverse proxy running the colab instance server. Endpoint.
Let’s see if there is any service in the COLAB container. There are LSOF programs in Colab, running LSOF -ITCP -STCP: Listen, listing all the processes that monitor network requests on the TCP port.
Look! Colab-Fileshim, Node and Jupyter-Notebook all look worthy of finding out. Since we have used Files panes, first look at Colab-Fileshim, which has PID 28, so check the/Proc file system and check the complete command line of the process:
Next Study /usr/local/bin/colab-fileshim.py. Ironically, we can actually do this directly on the Files pane. This program is more like a boring file server. In addition to the server itself, it can respond to LocalHost: 3453/FILES (with file content) and localhost: 3453/api/contents (with JSON metadata). This means that Colab will forward these requests from the channel URL to the port 3453 of the instance itself.
In the network option of Google Developer Tools, we can right -click and copy the CURL command to reproduce it. For example, this is a curl command for viewing the readMe.md file:
If you run this command on the local computer terminal, you will print the contents of the readme file to our terminal. By continuously trying and correcting errors, we can reduce most of the bids and leave the following content:
x-colab-tunnel header surface is to prevent XSS attacks. In fact, it is to prevent us or hackers from issuing these requests from the conventional browser options.
Cookie header is used to provide identity verification to Google, which proves that we can access the notebook instance. Because Cookie is relatively long and difficult to handle, we store it in the rest of this article to the shell variable $ colab_cookie.
2
Victory 1: Reveal our server
Now we have discovered the reverse agent of Colab to see if it can use it to transmit our own requests.
We can simply replace the process with our own server without affecting the existing colab-fileshim.py server! Run PKill -Flab -Fileshim to terminate the existing server so that we can start our own server on the same port.
Below a brief demonstration, we will start the Python’s default HTTP server, and then provide our own files in LocalHost: 3453/Files.
Look! We can now change the CURL command to download our own files!
Note that the log line in the colab unit can prove that our server handles the request:
Unfortunately, because the X-COLAB-TUNNEL: Google header is required, we cannot directly access the server from the browser.
3
further research
Continue research, this time to see another interesting thing I found before, node. If we check/proc/7/cmdline, we will see that the process is running /datalab/web/app.js.
Once we jump and read the code, we will find that/datalab/web contains a considerable standard NodeJS application. In addition to the/socketio/routing that I saw before, it also disclosed/_proxy/{port}/routing. This should allow us to access any URL from any port on the colab instance!
Start a fast server and test.
Unfortunately, we cannot view this HTML page from the browser option. Colab rejects any request of the agent. Unless the X-COLAB-TUNNEL: Google header is set, if we try to visit these URLs from the browser, we will see a http 400 client error page:
4
Victory 2: Reveal the entire webpage
Fortunately, we can use the Google browser extension program to insert the HTTP header into the browser request. We set it to send X-Colab-Tunnel: Google header:
Then we can start the channel URL in the browser!
5
Go to the JUPYTER NOTEBOOK
Finally, let’s take a look at the third and last interesting process. Jupyter-Notebook, it monitors 9000.
We can use the previous proxy and header by accessing/tun/m/<ID>/_ Proxy/9000, try to directly access the port from the browser. Unfortunately, the HTTP 500 server error page appeared, not the Jupyter user interface.
Strangely, when we run from the Colab Notebook itself! Curl -I LocalHost: 9000 to diagnose this problem, we still report an error:
The output of LSOF before providing us with a clue: Jupyter only listened to the private IP provided to the COLAB instance, instead of listening to 0.0.0.0/ :: (all IP on all interfaces), which is probably to avoid the Jupyter interface Exposed to us.
Google did not try its best to hide the interface, so there is a way to repair quickly.
In order to bypass the restrictions of the monitoring address, we need to create a process to monitor all interfaces and IPs, and forward all the traffic obtained to the specific IP address that Jupyter is listening to. We can install the socket proxy tool Socat (socket cat) to do this. Use SOCAT to turn the traffic in LocalHost: 9000 and $ hostname: 9000 back and forth:
This is the beginning! If we reload the URL in the browser, we will see some of the Jupyter user interface, but there are obvious problems.
This is because Jupyter is set in the root directory of the domain (URL path/), but the path of our colab channel is/tun/m/<ID>/_ proxy/9000, which will mess up CSS and JS files The absolute path of the resources.
There is no simple solution yet. We need a complete (sub -) domain to forward the traffic to the Jupyter server.
6
Victory 3: Show the jobyter user interface
Fortunately, Colab has a hidden port forwarding official solution, which provides a complete sub -domain! It is hidden very well, and it takes a longer time to find it than to find the internal reverse agent!
How to repost traffic with the official port of Colab? Open the Code Snippets option from the left sidebar, and then find the Output Handling code segment.
Click “View Source Notebook”, you will see Advanced_outPuts.IPYNB. This is a high -end user clip of Colab, which shows the little -known feature of the platform. The specific fragments we need can be found in the title of “Browse the Nuclear”.
We can use this code segment to open the jober user interface as a sub -domain.
Now, we can click the link (add/tree to the URL to stabilize the jober), and then you can view the full -featured Jupyter UI!
Finally, almost all jobs were completed. Google seems to have restricted the official agent as a GET request, and only allows us to view but cannot run Notebooks.
7
Commentation
Congratulations to you to see the last, I hope this semi -structural method that shows COLAB related working principles that you don’t know and learn reverse engineering tools will be valuable. I also hope to stimulate you to understand the internal structure of the tools and products you use every day!
(Compiled and published after authorization of this article. Original text:
https://dagshub.com/blog/reverse-engineering-google-colab/)
head map comes from wir_sind_klein, pixabay
Everyone is watching