When using Python as a crawler, we needcamouflage head information4 4 Anti -climbing strategy, third -party module in Pythonfake_useragent
will solve this problem very well. It will return us to a randomly encapsulated head information, and we can use it directly
FAKE_USERAGENT installation
pip install fake_useragent
FAKE_USERAGENT
From Fake_useragrated Import Useragent
#
Ua = useragent (). Random
request.headers ['user-agent'] = UA
FAKE_USERAGENT’s error of the use process
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py", lin
e 166, in load
verify_ssl=verify_ssl,
File "d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py", lin
e 122, in get_browser_versions
verify_ssl=verify_ssl,
File "d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py", lin
e 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
According to reporting an error message,Inference is caused by the timeout of the network, learned from the online checking data that this library will quote online resources, and its source code Fake_useragent \ settings.py related configuration is shown below:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import os
import tempfile
__version__ = '0.1.11'
DB = os.path.join(
tempfile.gettempdir(),
'fake_useragent_{version}.json'.format(
version=__version__,
),
)
CACHE_SERVER = 'https://fake-useragent.herokuapp.com/browsers/{version}'.format(
version=__version__,
)
BROWSERS_STATS_PAGE = 'https://www.w3schools.com/browsers/default.asp'
BROWSER_BASE_PAGE = 'http://useragentstring.com/pages/useragentstring.php?name={browser}' # noqa
BROWSERS_COUNT_LIMIT = 50
REPLACEMENTS = {
' ': '',
'_': '',
}
SHORTCUTS = {
'internet explorer': 'internetexplorer',
'ie': 'internetexplorer',
'msie': 'internetexplorer',
'edge': 'internetexplorer',
'google': 'chrome',
'googlechrome': 'chrome',
'ff': 'firefox',
}
OVERRIDES = {
'Edge/IE': 'Internet Explorer',
'IE/Edge': 'Internet Explorer',
}
HTTP_TIMEOUT = 5
HTTP_RETRIES = 2
HTTP_DELAY = 0.1
After testing, it was found that because the website of Browsers_Stats_page = ‘https://www.w3schools.com/browsers/default.asp’ cannot be opened.
Solution:download the file to the local area and place it under the corresponding folder.
browser access https://www.w3schools.com/browsers/default.asp URL, then Ctrl+S save the file into anotherfake_useragent_0.1.11.json, note that the name cannot be changed. It is the same as the name configuration of the source file, otherwise it will cause it to be unable to access. As for putting the saved file to that position, you can check the configuration source code:
DB = os.path.join(
tempfile.gettempdir(),
'fake_useragent_{version}.json'.format(
version=__version__,
),
)
It was found that it was stitched into the full path of DB with the path of Tempfile.getTempdir (), so the path of Tempfile.getTempdir () is to store the path of Fake_useragent_0.1.11.json. As shown in the figure below, you only need to put the saved json file in the directory, and you can access it normally, and there will be no timeout problems!
Note:If the cache_server is not https://fake-sserager.herokuapp.com/browsers/0.1.11, please update the library:
pip install --upgrade fake_useragent