Blog: Shell Basics every Data Scientist should know
An interactive post on how to deal with big files on the Command line
Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows).
Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop, and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a Python/Perl script.
Probably whatever you do it won’t be simpler/less time consuming than this
cat data.txt | cut -d "|" -f 1 | sort -u | wc -l30
And this will run way faster than whatever you do with Perl/Python script.
Trending AI Articles:
Now, this command says:
- Use the cat command to print/stream the contents of the file to stdout.
- Pipe the streaming contents from our cat command to the next command cut.
- The cut commands specify the delimiter “|” by the argument -d (Notice the semicolon around the pipe since we don’t want to use the pipe as a stream) and select the 1st column by the argument -f and stream the output to stdout.
- Pipe the streaming content to the sort command which sorts the input and streams only the distinct values to the stdout. It takes the argument -u that specifies that we just need unique values.
- Pipe the output to the wc -l command which counts the number of lines in the input.
A lot is going on here, and I will try my best to ensure that you will be able to understand most of it by the end of this Blog post . Although I will also try to explain more advanced concepts than the above command in this post.
Now, I use shell commands extensively at my job. I will try to explain the usage of each of the commands based on use cases that I counter nearly daily at my day job as a data scientist. I am using the Salaries.csv data from the Lahman Baseball Database to illustrate different shell functions. You might want to download the data to go along with the post yourself.
Some Basic Commands in Shell:
There are a lot of times when you need to know a little bit about the data. You want to see maybe a couple of lines to inspect a file. One way of doing this is opening the txt/csv file in the notepad. And that is probably the best way for small files. But big data are a problem. You could also do it in the shell using:
But sometimes the files will be so big that you won’t be able to open them up in sublime or any other software utility and there the cat command will shine.
2. Head and Tail:
Now you might ask me why you would print the whole file in the terminal itself? Generally, I won’t. But I just wanted to tell you about the cat command. For the use case when you want only the top/bottom n lines of your data, you will generally use the head/tail commands. You can use them as below.
Notice the structure of the shell command here.
CommandName [-arg1name] [arg1value] [-arg2name] [arg2value] filename
Now we could have also written the same command as:
This brings me to one of the essential concepts of Shell usage — piping. You won’t be able to utilize the full power the shell provides without using this concept. And the idea is simple.
Just read the “|” in the command as “pass the data on to”
So I would understand the above command as:
cat(print) the whole data to stream, pass the data on to head so that it can just give me the first few lines only.
So did you understood what piping did? It is providing us a way to use our basic commands consecutively. There are a lot of commands that are relatively basic, and it lets us use these basic commands in sequence to do some fairly non-trivial things.
Now let me tell you about a couple of more commands before I show you how we can chain them to do reasonably advanced tasks.
wc is a fairly useful shell utility/command that lets us count the number of lines(-l), words(-w) or characters(-c) in a given file
You may want to print all the lines in your file which have a particular word. Or you might like to see the salaries for the team BAL in 2000. In this case, we have printed all the lines in the file which contain “2000,BAL”. grep is your friend.
You could also use regular expressions with grep.
You may want to sort your dataset on a particular column. Sort is your friend. Say you want to find out the top 10 maximum salaries given to any player in your dataset.
So there are indeed a lot of options in this command. Let’s go through them one by one.
- -t: Which delimiter to use?
- -k: Which column to sort on?
- -n: If you want Numerical Sorting. Don’t use this option if you wish to do Lexographical sorting.
- -r: I want to sort Descending. Sorts Ascending by Default.
This command lets you select specific columns from your data. Sometimes you may want to look at just some of the columns in your data. As in you may want to look only at the year, team and salary and not the other columns. cut is the command to use.
The options are:
- -d: Which delimiter to use?
- -f: Which column/columns to cut?
uniq is a little bit tricky as in you will want to use this command in sequence with sort. This command removes sequential duplicates. So in conjunction with sort, it can be used to get the distinct values in the data. For example, if I wanted to find out ten distinct teams in data, I would use:
This command could be used with an argument -c to count the occurrence of these distinct values. Something akin to count distinct.
Some Other Utility Commands for Other Operations
Some Other command line tools that you could use without going in the specifics as the specifics are pretty hard.
1. Change delimiter in a file:
Find and Replace Magic.: You may want to replace certain characters in the file with something else using the tr command.
2. Sum of a column in a file:
Using the awk command, you could find the sum of a column in a file. Divide it by the number of lines, and you can get the mean.
awk is a powerful command which is a whole language in itself. Do see the wiki page for awk for a lot of good use cases of awk. I also wrote a post on awk as the second part of this series. Check it HERE
3. Find the files in a directory that satisfy a specific condition:
You can do this by using the find command. Let’s say you want to find all the .txt files in the current working dir that start with A.
To find all .txt files starting with A or B we could use regex.
Other Cool Tricks:
Sometimes you want your data that you got by some command line utility(Shell commands/ Python scripts) not to be shown on stdout but stored in a text file. You can use the ”>” operator for that. For Example, You could have stored the file after replacing the delimiters in the previous example into another file called newdata.txt as follows:
cat data.txt | tr ',' '|' > newdata.txt
I got confused between ”|” (piping) and ”>” (to_file) operations a lot in the beginning. One way to remember is that you should only use ”>” when you want to write something to a file. ”|” cannot be used to write to a file. Another operation you should know about is the ”>>” operation. It is analogous to ”>” but it appends to an existing file rather than replacing the file and writing over.
PS: If you would like to know more about the command line, which I guess you would, there is The UNIX workbench course on Coursera which you can try out.
So, this is just the tip of the iceberg. Although I am not an expert in shell usage, these commands reduced my workload to a large extent. If there are some shell commands, you use regularly or some shell command that is cool, do tell in the comments. I would love to include it in the blog post.
I wrote a blog post on awk as the second part of this post. Check it Here