Flink combined with HDFS


programming language, although it should be learned,
But learning computer algorithm and theory is more important
, because computer algorithms and theories are more important, because computer languages and development platforms are changing with each passing day, but those algorithms and theories that are not separated from their ancestors, such wait


algorithm is one of the most important cornerstones in the field of computer science, but it has been left out by some domestic programmers. Many students see that some companies require programming language to have a misunderstanding. They believe that learning computers is to learn various programming languages, or think that learning the latest language, technology, and standards is the best way to pave the way. In fact, everyone was misled by these companies.

Although the programming language should be learned, learning computer algorithms and theories is more important because computer algorithms and theories are more important, because computer language and development platforms are changing with each passing day.
But those algorithms and theories are not separated from its ancestors, such as data structures, algorithms, compilation principles, computer architecture structures, relational database principles, and so on.On the “Refining Student Network”, a classmate vividly compared these basic courses as “internal skills”, and compared new languages, technology, and standards to “external skills”. Those who are fashionable all day only know how to move. Without skills, it is impossible to become a master.

algorithm and me

When I transferred to the Department of Computer Science in 1980, the professional direction of not many people was computer science. Many other people laughed at us and said, “Know why you only have to add a ‘science’ without the” Department of Physical Science ‘or’ Chemistry Science “? Because people are real science, they do not need to paint snakes, and they do not need to draw a snake. You are guilty of guilty, for fear of not ‘science’, so you want to cover it like this. “In fact, they were completely wrong.
Those who really learn computers (not just “programmers”) have considerable accomplishments in mathematics. They can use the rigorous thinking of scientists to verify, but also use the pragmatic means of engineers to solve the problem -and this thinking The best interpretation of the means is the “algorithm”.

Remember that I wrote the Othello game software I won the world championship. At that time, the person who won the second place thought that I was lucky to win him, and asked my program how many steps can I search on average. When he found that my software was more than 60 times faster than him in search efficiency, he found that my software was more than 60 times faster than him. At that time, I lost it completely. Why can I do 60 times more work on the same machine? This is because I use a latest algorithm that can convert an index function into four similar tables, as long as the constant time can get the similar answer. In this example, whether the algorithm is the key to win the world champion.

Remember in 1988 to visit my school vice president in 1988. The purpose is to understand why their voice recognition system is dozens of times slower than I developed. It is hundreds of times. Although they bought a few supercomputers and barely let the system run, such expensive computing resources made their product department dislike, because the “expensive” technology has no application prospects. In the process of discussing with them, I was surprised to find a dynamic? Programming of O (N*M) actually made O (N*N*M) by them. What’s even more surprising is that they have published a lot of articles for this, and even named their own algorithms, and nominated the algorithm into a scientific conference, hoping to get a prize. At that time, the researchers at Bell Lab were of course smart, but they all came from mathematics, physics or motor. They had never learned computer science or algorithms, and they made such basic mistakes. I think those people will never laugh at people who study computer science in the future!

The algorithm of the Internet era

Some people may say, “Is the computer so fast today, is the algorithm important?” In fact, there will never be a computer that is too fast, because we will always come up with a new application. Although the computing power of the computer is growing rapidly every year under the role of Moore’s law, prices are constantly declining. But don’t forget that the amount of information that needs to be processed is an exponential growth. Everyone now creates a lot of data (photos, videos, voice, text, etc.). The increasingly advanced record and storage methods have made the amount of information of each of us increased in explosive. The Internet information flow and log capacity of the Internet also grow rapidly. In terms of scientific research, with the improvement of research methods, the amount of data has reached unprecedented. Whether it is three -dimensional graphics, massive data processing, machine learning, voice recognition, it requires a great amount of calculation. In the Internet age, more and more challenges need to be solved by excellent algorithms.

Another example of another online age. Search on the Internet and mobile phones. If you want to find a nearby coffee shop, what should the search engine deal with this request? The easiest way is to find out the cafe in the entire city, then calculate the distance between their location and you, then sort, and then return to the nearest result. But how to calculate the distance? There are many algorithms in the map that can solve this problem.

Doing this may be the most intuitive, but it is definitely not the fastest. If there are only a few cafes in a city, then there should be no problem to do so, anyway, the calculation is not large anyway. But if there are many cafes in a city, and many users need similar search, then most of the pressure on the server will bear. In this case, how should we optimize the algorithm?

First of all, we can make a “pre -processing” of the cafe in the entire city. For example, divide a city into several “grids”, and then put him in a certain grid according to the location of the user, and only sort the cafe in the grid.


question is here again. If the size of the grid is the same, most of the results may appear in a grid in the city center, and the grid in the suburbs only has very few results. In this case, we should divide a few more grids in the city center. Furthermore, the grid should be a “tree structure”, the top layer is a large grid -the entire city, and then decreased layer by layer. The grid is getting smaller and smaller. There are not many search results, and users can rise step by step to enlarge the search range.

The above algorithm is very practical for the example of the cafe, but is it universal? the answer is negative. Abstract the cafe, it is a “point”. What if you want to search for a “face”? For example, users want to go to a reservoir, and there are several entrances in a reservoir, so which one is closest to the user? At this time, the above “tree structure” must be changed to “R-Tree”, because each node in the middle of the tree is a scope, a boundary range (Reference:

Through this small example, we see that the requirements of the application are ever -changing. In many cases, a complex problem needs to be decomposed into several simple small problems, and then the appropriate algorithm and data structure are selected.

parallel algorithm: Google’s core advantage

The example above is a small case in Google! Every day Google’s website needs to handle more than one billion search, Gmail wants to store tens of millions of users’ 2G mailboxes, Google? Earth wants hundreds of thousands of users to swim on the earth at the same time, and submit the appropriate picture through the Internet to every A user. Without a good algorithm, these applications cannot become a reality.

In these applications, even the most basic problems will bring great challenges to traditional calculations. For example, more than one billion users visit Google’s website every day, and using Google’s services also produce a lot of logs (LOG). Because Logs are increasing at a speed every second, we must have a clever way to deal with it. I once asked about how to analyze the LOG analysis and processing in the interview. Many interviewers’ answers are logically correct, but the actual application is almost feasible. According to their algorithms, even if tens of thousands of machines are used, our processing speed is not rooted at the speed generated by data.

So how does Google solve these problems?

First of all, even in the Internet era, even the best algorithm must be executed in a parallel computing environment. In Google’s data center, we use a large parallel computer. However, when the traditional parallel algorithm is running, efficiency will quickly decrease after increasing the number of machines. That is to say, if the ten machines have five times the effect, it may only have dozens of times the effect when it increases to a thousand units. The price of this kind of thing is more than that, no company can afford it. Moreover, in many parallel algorithms, as long as one node is made, all calculations will be abandoned before.

So how does Google develop parallel calculation that is both efficient and faulty?

Google’s most senior computer scientist Jeff? Dean recognizes that most of the data processing required by Google can be attributed to a simple parallel algorithm: Map? And? Reduce (Reduce
). This algorithm can achieve quite high efficiency in many calculations, and it is scalable (that is, even if a thousand machines cannot achieve 1,000 times the effect, at least a few hundred times the effect). MAP? And? Another major feature of Reduce is that it can use a large number of cheap machines to compose a powerful server? Farm. Finally, its fault tolerance performance is excellent, even if a server? Farm is halfway, the entire Fram can still run. It is precisely because of this genius that Map? And? Reduce algorithm. With this algorithm, Google can almost increase the amount of calculation infinitely and grow with the rapid Internet applications.

algorithm is not limited to computers and networks

Take an example outside the field of computer: In terms of high -energy physical research, many experiments can be several TB data per second. However, due to the lack of processing and storage capacity, scientists have to discard most of the unprecedented data. But everyone knows that the information of the new element is likely to be hidden in the data we have too late to process. Similarly, algorithms can change human life in any other fields. For example, the research of human genes may invented a new medical method because of algorithms. In the field of national security, effective algorithms may avoid the next 911. In terms of weather, algorithms can better predict the occurrence of future natural disasters to save life.

So, if you put the development of computer in the environment of rapid application and data growth, you will definitely find out that the importance of algorithms is not increasingly decreasing, but increasingly strengthened.


Related Posts

PANDAS summary and calculation description statistics

linux configuration ffmpeg

HDP 2.4 offline installation post

bzoj 4012: [hnoi2015] Opening the store -dynamic tree division governance

Flink combined with HDFS

Random Posts

Chapter 1 Principles of Computer Storage Information

MAC skills: how to repairs the copy and paste that cannot be repaired

[Embedded Linux] Suggestion of learning methods embedded in Linux [Transfer]

java use EasyExcel to read the Excell table content

[Play to Pointpillars] Pillarscatter operation