Big Data Skill Review in reality

Since Argsen is working in the AI field, sometimes we have been asked to help out in setting the “Big Data” infrastructure on premise. In one case, an Australian University who is a client of ours asked us to setup a cloud server to run pracs for their latest online Big Data course designed for the future data scientist.  I guess nobody will be surprised that the term “Hadoop” appeared in the requirement. Due to the cost, instead of a cluster setup, it is done with a single server. In addition to the Hadoop framework and HDFS, the course demanded MongoDB, Hive, Storm, Spark, Python and Java to be installed.

I personally doubt whether the student can learn all these technologies with several months. I also question whether the student really understands the underlying differences. For example, by following the HDFS tutorial, the list directory command is so similar to a standard Unix “ls” command. The ability to copy-paste commands from tutorials to run in the terminal is so far from understanding the HDFS cluster technology.  Indeed, it is not so different from using any Unix file systems. 

From the IT education point of view, the process of setting up HDFS could be far more useful. However, at the same time, it presents little value to a data scientist. By using the tools and frameworks, the technology complexity of Hadoop has been hidden to the data scientists. 

At the end of the day, knowing the name of the concepts does not equal to knowing the concept. I personally believe that universities should consider more on real skill requirements of data scientists (such as the ones listed in https://www.cio.com/article/3263790/the-essential-skills-and-traits-of-an-expert-data-scientist.html)  instead of focusing on the technology buzz words.