POSTMAN asserts the explanation of the corresponding script

2023-01-23   ES  

The previous blog mainly introduced the Find function matching string using the strings, and the strings in the simple webpage using the Find function of the string are OK, but a more complicated string like an IP address is like an IP address. But it was very troublesome. At this time, the regular expression came in handy. The regular expression is used to describe complex rules.
Since this series of blogs are only to learn crawlers, what kind of knowledge is used to learn what knowledge is used, and the learning of regular expression can only be interrupted. Go directly to dry goods.
Regular expression is achieved through the RE module in Python. We first learn from the RE modulesearchMethod.

The

search () method is used to search for the first position of the regular expression mode in the string. In order to avoid unnecessary trouble, we add R before the character or string we want to match.

Regular expression of the form: the point number of English. You can match any characters

If we want to match., We can use the method of transfer, that is, add \ before.

\ d match any number

indicates the range of the string: the character class []] A match in the character class is a matching

There are no extra symbols between the elements in the character class, because this is matching

Regular expression defaults to open the case sensitive mode
1 is to turn off the case -sensitive mode
Two is full writing

character category ‘-‘indicates range [a-z]: indicates English letters from A to Z26 lowercase

The problem of matching times {} to solve the number in {} indicates the number of repeated times
is just duplicate {a character in front

The number in the bracket represents a total of times

The range of matching times {n, m} repeat n ~ m times can be
{n, m} is a closed range

Give test samples of the above knowledge points:

import re
print(re.search(r'FishC','I love FishC.com!'))

print(re.search(r'.','I love FishC.com!'))

print(re.search(r'\.','I love FishC.com!'))

print(re.search(r'\d','I love 123FishC.com!'))

print(re.search(r'\d\d\d','I love 123FishC.com!'))

print(re.search(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d','192.111.123.233'))

print(re.search(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d','1.111.123.233'))

print(re.search(r'[aeiou]','I love FishC.com'))

print(re.search(r'[a-z]','I love FishC.com'))

print(re.search(r'ab{3}c','abbbc'))

print(re.search(r'ab{3,5}','abbbbbc'))

After having the above knowledge, we can almost write the regular expression of the IP address, but we still face a difficulty: how to match the number of 0 ~ 255 with regular expressions? What we need to understand is that the regular expression matching is a string, and there is no division of progress, that is, the number that can match is 0-9, which requires us to discuss the situation:

The number of
is 0 or 1, and the remaining two can be any character
The number is 2 o’clock, the second bit is between 0-4, and the last one can be any character
The number is 5 o’clock, the second place is 5 o’clock, and the third place can only be between 0-5

After such limits, we can use regular expressions to match an IP address

print(re.search(r'(([01]{0,1}[0-9]{0,1}[0-9]|[2][0-4][0-9]|[2][5][0-4])\.){3}([2][0-4][0-9]|[01]{0,1}[0-9]{0,1}[0-9]|[2][5][0-4]','1.238.144.208')

After the above preparation, we can fromQuick proxyThis website to crawl the agent IP in real time. With the practice of the previous grabbing girl picture, this code is a bit simpler to write.

import urllib.request
import os
import re

url='http://www.kuaidaili.com/'
req=urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')
response=urllib.request.urlopen(req)
html=response.read().decode('utf-8')
ip=re.search(r'(([01]{0,1}\d{0,1}\d|[2][0-4]\d|[2][5][0-5])\.){3}([2][0-4]\d|[01]{0,1}\d{0,1}\d|[2][5][0-5])',html)
iplist=[]
# extract boundary

# Here just extract the IP address
a=str(ip).find('match')+7
b=str(ip).find('>')-1
while True:
     if '' in iplist:
          iplist.remove('')
          break
     a=str(ip).find('match')+7
     b=str(ip).find('>')-1
     if str(ip)[a:b] not in iplist:
          iplist.append(str(ip)[a:b])
     e=html.find(str(ip)[a:b])
     ip=re.search(r'(([01]{0,1}\d{0,1}\d|[2][0-4]\d|[2][5][0-5])\.){3}([2][0-4]\d|[01]{0,1}\d{0,1}\d|[2][5][0-5])',html[e+15:])      
print(iplist)

In this way, we can get the IP address to capture it:

Python applets with a simple grabbing proxy IP are completed. This applet needs to be improved, and this program will be improved later.

source

Related Posts

Experiment 7-2-2 matrix operation (20 points)

[HDU 1811] Rank of Tetris and check the collection+topology

php dynamic library libicui18n error

queue — chain storage

POSTMAN asserts the explanation of the corresponding script

Random Posts

ANDROID execute the shell command MR

1 1 1 sharing (5) cache center

networkx+python to achieve complex network classic infectious disease SIR model (depending on the degree of centrality, the centrality, and the central indicators to select the source of infection)

ORA-01078: Failure in Processing System Parameters problem (Oracle 11G)

php dynamic library libicui18n error