<h1 id="tocheading">Guide to Common Unix Commands</h1>
Jen Wisecaver - January 20, 2019


## Alphabetically ordered command list
You will be learning each of these UNIX commands in class today.
* [cat](#cat)
* [cd](#cd)
* [chmod](#chmod)
* [cp](#cp) 
* [cut](#cut) 
* [echo](#echo)
* [grep](#grep) 
* [gzip/gunzip](#gzip)
* [head](#head)
* [ls](#ls)
* [man](#man)
* [mkdir](#mkdir) 
* [mv](#mv)
* [pwd](#pwd)
* [rm](#rm)
* [sort](#sort) 
* [tail](#tail)
* [tar](#tar)
* [touch](#touch)
* [wc](#wc) 
* [uniq](#uniq) 
* [wget](#wget)

## Other unix syntax
* [pipe `|`](#pipe)
* [redirecting output `>`](#redirect)
* [wildcard `*`](#wildcard)

# Learn the most common UNIX commands
Work through this jupyter notebook to learn about the common UNIX commands that we will be using in this class. Select a code block and type **Shift + Enter** to execute the code. 

### Create a checkpoint
You can edit each code block as you want to try running commands with different options or arguments. However, it's a good idea to create a checkpoint in advance. This will let you reset the Notebook and undo any changes if you start getting errors. 

On the top menu bar, click **File > Save and Checkpoint**. 

### Revert to a saved checkpoint
On the top menu bar, click **File > Revert to Checkpoint**. Select the checkpoint and click **Revert**. 

##### Now on to the tutorial!

# echo <a class="anchor" id="echo"></a>
Display a line of text

You've already learned this command! Execute the cell block as is, then change it to have it echo something else. 

In [3]:
echo Hello World!

Hello World!


# pwd <a class="anchor" id="pwd"></a>
Prints the name and path to the current working direcotry

##### Print the current working directory

In [6]:
pwd

/home/jwisecav/BCHM495/classes/class03


### You can capture the output of a command and store it in a variable using \`backtics\`
##### For example, save the output of pwd and store it in the variable WORKING_DIRECTORY

In [8]:
WORKING_DIRECTORY=`pwd`

##### echo the variable NOTEBOOK_DIRECTORY to confirm that the pwd was stored

In [9]:
echo $WORKING_DIRECTORY

/home/jwisecav/BCHM495/classes/class03


# cd <a class="anchor" id="cd"></a>
Change the current directory. In other words, move from the current directory into the directory specified in the command. 

`USAGE: cd [DIRECTORY]`

**Common shortcuts**
* `cd ..` : go up one directory
* `cd /`  : go to root directory
* `cd -`  : go to the last directory you were just in
* `cd $HOME`  : go to your home directory
* `cd ~`      : go to your home directory 
* `cd `       : go to your home directory (There are lots of shortcuts for getting back to your home directory!)

Try executing the following commands

##### move up one directory

In [10]:
cd ..

##### confirm that the current directory is now different with pwd

In [15]:
echo Current working directory: `pwd`

Current working directory: /home/jwisecav/BCHM495/classes/class03


##### if no directory is specified, cd will take you back to your home directory

In [12]:
cd 
echo Current working directory: `pwd`

Current working directory: /home/jwisecav


##### cd into the directory stored in the variable `WORKING_DIRECTORY`

In [22]:
cd $WORKING_DIRECTORY
echo Current working directory: `pwd`

Current working directory: /home/jwisecav/BCHM495/classes/class03


# ls <a class="anchor" id="ls"></a>
List the contents of a directory 

`USAGE: ls [OPTION/S] [FILE/DIRECTORY]`

**Common options**
* `-l` : long list format with more information
* `-a` : lists all contents, even hidden files
* `-h` : print sizes in human readable format (e.g., 1K 234M 2G)
* `-t` : sort by modification time, newest first
* `-S` : sort by file size, largest first
* `-r` : reverse order while sorting

Try executing the following `ls` commands

##### if no file/directory is specified, `ls` lists the contents of the current directory

In [23]:
ls

Class03_my_unix_guide.ipynb  flu_genome  flu_genome.tar  my_file.txt


##### List the contents of a specific directory

In [24]:
ls /depot/jwisecav/darwin/class_material/class03/genomes

Caenorhabditis_elegans_genomic.fna   Populus_trichocarpa_genomic.fna
Caenorhabditis_elegans_genomic.gff   Populus_trichocarpa_genomic.gff
Caenorhabditis_elegans_protein.faa   Populus_trichocarpa_protein.faa
Drosophila_melanogaster_genomic.fna  Saccharomyces_cerevisiae_genomic.fna
Drosophila_melanogaster_genomic.gff  Saccharomyces_cerevisiae_genomic.gff
Drosophila_melanogaster_protein.faa  Saccharomyces_cerevisiae_protein.faa
Influenza_A_virus_genomic.fna	     Streptomyces_coelicolor_genomic.fna
Influenza_A_virus_genomic.gff	     Streptomyces_coelicolor_genomic.gff
Influenza_A_virus_protein.faa	     Streptomyces_coelicolor_protein.faa
Pan_troglodytes_genomic.fna	     Symbiodinium_microadriaticum_genomic.fna
Pan_troglodytes_genomic.gff	     Symbiodinium_microadriaticum_genomic.gff
Pan_troglodytes_protein.faa	     Symbiodinium_microadriaticum_protein.faa


##### The `ls` option `-l` lists the contents of a directory in long format, which includes more information

In [25]:
ls -l /depot/jwisecav/darwin/class_material/class03/genomes

total 13255424
-rw-rw---- 1 jwisecav jwisecav-data  101540352 Dec  9 19:25 Caenorhabditis_elegans_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data  187872185 Dec  9 19:25 Caenorhabditis_elegans_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data   15565427 Dec  9 19:25 Caenorhabditis_elegans_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data  145657746 Dec  9 19:20 Drosophila_melanogaster_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data  162998066 Dec  9 18:05 Drosophila_melanogaster_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data   22752241 Dec  9 19:19 Drosophila_melanogaster_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data      14506 Dec 10 07:57 Influenza_A_virus_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data       8226 Dec 10 07:57 Influenza_A_virus_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data       5812 Dec 10 07:57 Influenza_A_virus_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data 3089233887 Dec  9 19:27 Pan_troglodytes_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data  706486269 Dec  9 

### Description of the `ls -l` columns (also described [here](https://linuxconfig.org/understanding-of-ls-command-with-a-long-listing-format-output-with-permission-bits))
* column 1 : permissions (learn more about permissions [below](#chmod))
* column 2 : number of linked hard-links (this info in this column doesn't matter or our purposes very often)
* column 3 : file owner (in this case jwisecav [me] because I downloaded these files)
* column 4 : file group 
* column 5 : size
* column 6 : modification/creation date and time
* column 7 : name

##### The `ls` option `-h` makes the size of the file easier to read. Notice how you can combine options below. 

In [26]:
ls -lh /depot/jwisecav/darwin/class_material/class03/genomes

total 13G
-rw-rw---- 1 jwisecav jwisecav-data  97M Dec  9 19:25 Caenorhabditis_elegans_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 180M Dec  9 19:25 Caenorhabditis_elegans_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data  15M Dec  9 19:25 Caenorhabditis_elegans_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data 139M Dec  9 19:20 Drosophila_melanogaster_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 156M Dec  9 18:05 Drosophila_melanogaster_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data  22M Dec  9 19:19 Drosophila_melanogaster_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data  15K Dec 10 07:57 Influenza_A_virus_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 8.1K Dec 10 07:57 Influenza_A_virus_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 5.7K Dec 10 07:57 Influenza_A_virus_protein.faa
-rw-rw---- 1 jwisecav jwisecav-data 2.9G Dec  9 19:27 Pan_troglodytes_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 674M Dec  9 19:27 Pan_troglodytes_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data  

##### The `ls` option `-S` sorts the files based on size.

In [27]:
ls -lhS /depot/jwisecav/darwin/class_material/class03/genomes

total 13G
-rw-rw---- 1 jwisecav jwisecav-data 2.9G Dec  9 19:27 Pan_troglodytes_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 782M Dec  9 19:16 Symbiodinium_microadriaticum_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 674M Dec  9 19:27 Pan_troglodytes_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 671M Dec  9 19:12 Symbiodinium_microadriaticum_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 420M Dec  9 19:22 Populus_trichocarpa_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data 205M Dec  9 19:22 Populus_trichocarpa_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 180M Dec  9 19:25 Caenorhabditis_elegans_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 156M Dec  9 18:05 Drosophila_melanogaster_genomic.gff
-rw-rw---- 1 jwisecav jwisecav-data 139M Dec  9 19:20 Drosophila_melanogaster_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data  97M Dec  9 19:25 Caenorhabditis_elegans_genomic.fna
-rw-rw---- 1 jwisecav jwisecav-data  63M Dec  9 19:27 Pan_troglodytes_protein.faa
-rw-rw---- 1 jwisecav jw

### QUESTION: What file is the largest in the `genomes` directory? What file is the smallest?

##### List all contents of your home directory. The tilde '~' is one way to refer to your home directory. (Using `echo` without an argument prints a blank line, which can make the output easier to read.)

In [28]:
echo Using '~' lists the contents of my home directory : ~
echo 
ls ~

Using ~ lists the contents of my home directory : /home/jwisecav

BCHM495  Library  comp_gen_repo   ondemand.scholar  usr
Desktop  R	  ondemand.brown  perl5


### Did you notice how putting a variable in 'quotes' prints the text exactly as it is? When you don't include the quotes, the bash shell interprets those characters/variables instead. 

##### Using the variable `$HOME` is another way to refer to your home directory

In [29]:
echo Using '$HOME' ALSO lists the contents of my home directory : $HOME
echo
ls $HOME

Using $HOME ALSO lists the contents of my home directory : /home/jwisecav

BCHM495  Library  comp_gen_repo   ondemand.scholar  usr
Desktop  R	  ondemand.brown  perl5


##### The `ls` option `-a` lists all contents of a directory, including hidden files. Hidden files start with a period '.'

In [30]:
ls -a ~

.				 .ipython
..				 .irods
.DS_Store			 .java
.ICEauthority			 .jupyter
.RData				 .kobasrc
.RepeatMaskerCache		 .lesshst
.Rhistory			 .lmod.d
.Rlibs				 .local
.Rprofile			 .mozilla
.Xauthority			 .ncbi
._.DS_Store			 .oracle_jre_usage
._AlienIndexCalculator_preview1  .parallel
.aspera				 .pki
.bash_history			 .private
.bash_profile			 .pulse
.bashrc				 .pulse-cookie
.cache				 .putty
.conda				 .python_history
.condarc			 .rstudio
.config				 .rstudio-desktop
.cpan				 .seaviewrc
.cpanm				 .sequenceserver.conf
.dbus				 .ssh
.emacs.d			 .subversion
.empty				 .t_coffee
.esd_auth			 .thumbnails
.etetoolkit			 .vim
.fltk				 .viminfo
.fontconfig			 .vnc
.gconf				 .xalt.d
.gconfd				 .xfce4-session.verbose-log
.gitconfig			 .xfce4-session.verbose-log.last
.globus				 BCHM495
.globus.cfg			 Desktop
.gm_key				 Library
.gnome2				 R
.gnome2_private			 comp_gen_repo
.gnupg				 ondemand.brown
.gs				 ondemand.scholar
.gvfs				 perl5
.interproscan-5			 usr
.ipynb_checkpoints


##### List the contents of the directory you created during the first day of class

In [31]:
ls $HOME/BCHM495/classes/class01

GCF_000146045.2_R64_protein.faa


##### Using `..` will list the contents of the directory one above your current directory (also called the parent directory)

In [33]:
echo This is my current directory: `pwd`
echo 
echo These are the contents of the parent directory: 
ls  ..

This is my current directory: /home/jwisecav/BCHM495/classes/class03

These are the contents of the parent directory:
class01  class02  class03


# man <a class="anchor" id="man"></a>
print a reference manual for a given unix command

`USAGE: man [UNIX COMMAND]`

### At this point all these options may be getting a little overwhelming! It can be hard to remember which options do what. This is where the `man` command comes in handy!

#### print the manual for `ls`

In [35]:
man ls

LS(1)                            User Commands                           LS(1)



NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci-
       fied.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..

       --author
              with -l, print the author of each file

       -b, --escape
              print C-style escapes for nongraphic characters

       --block-size=SIZE
              scale sizes by SIZE before printing them; e.g., '--block-size=M'
              prints sizes in units of 1,048,576 bytes; see SIZE format below

       -B, --ignore-backups
              do not list implied entries ending with ~


##### If you are ever getting confused with a unix command. Stop and create a new code block to print the manual page what whatever command is giving you trouble. You can even print a manual for the `man` command itself!

In [34]:
man man

MAN(1)                        Manual pager utils                        MAN(1)



NAME
       man - an interface to the on-line reference manuals

SYNOPSIS
       locale] [-m system[,...]] [-M path] [-S list]  [-e  extension]  [-i|-I]
       [--regex|--wildcard]   [--names-only]  [-a]  [-u]  [--no-subpages]  [-P
       pager] [-r prompt] [-7] [-E encoding] [--no-hyphenation] [--no-justifi-
       cation]  [-p  string]  [-t]  [-T[device]]  [-H[browser]] [-X[dpi]] [-Z]
       [[section] page ...] ...
       man -k [apropos options] regexp ...
       man -K [-w|-W] [-S list] [-i|-I] [--regex] [section] term ...
       man -f [whatis options] page ...
       locale]  [-P  pager]  [-r  prompt]  [-7] [-E encoding] [-p string] [-t]
       [-T[device]] [-H[browser]] [-X[dpi]] [-Z] file ...
       man -w|-W [-C file] [-d] [-D] page ...
       man -c [-C file] [-d] [-D] page ...
       man [-?V]

DESCRIPTION
       man is the system's manual pager. Each page argument given  to  man  is
       no

       either $LC_MESSAGES, $LANG  or  another  system  dependent  environment
       variable to your language locale, usually specified in the POSIX 1003.1
       based format:

       <language>[_<territory>[.<character-set>[,<version>]]]

       If the desired page is available in your locale, it will  be  displayed
       in lieu of the standard (usually American English) page.

       Support  for  international message catalogues is also featured in this
       package and can be activated in the same way, again if  available.   If
       you  find  that  the  manual pages and message catalogues supplied with
       this package are not available in your native language  and  you  would
       like  to supply them, please contact the maintainer who will be coordi-
       nating such activity.

       For information regarding other features and extensions available  with
       this manual pager, please read the documents supplied with the package.

DEFAULTS
       man  will sea

              pages,  they can be accessed using this option.  To search for a
              manual page from NewOS's manual page collection, use the  option
              -m NewOS.

              The  system  specified  can  be a combination of comma delimited
              operating system names.  To include a search of the native oper-
              ating  system's manual pages, include the system name man in the
              argument string.  This option will override the $SYSTEM environ-
              ment variable.

       -M path, --manpath=path
              Specify  an alternate manpath to use.  By default, man uses man-
              path derived code to determine the path to search.  This  option
              overrides the $MANPATH environment variable and causes option -m
              to be ignored.

              A path specified as a manpath must be the root of a manual  page
              hierarchy  structured  into  sections as described in the man-db
              m

              text.  The following table  shows  the  translations  performed:
              some  parts  of it may only be displayed properly when using GNU
              nroff's latin1(7) device.


              Description        Octal   latin1   ascii
              ------------------------------------------
              continuation        255      -        -
              hyphen
              bullet   (middle    267      o        o
              dot)
              acute accent        264      '        '
              multiplication      327      x        x
              sign

              If  the  latin1  column displays correctly, your terminal may be
              set up for latin1 characters and this option is  not  necessary.
              If  the  latin1 and ascii columns are identical, you are reading
              this page using this option or man  did  not  format  this  page
              using  the  latin1  device description.  If the latin1 column is
              mi

              and  is expected to be in a similar format.  As all of the other
              man specific environment variables can be expressed  as  command
              line  options,  and  are  thus  candidates for being included in
              $MANOPT it is expected that they will become obsolete.  N.B. All
              spaces  that  should be interpreted as part of an option's argu-
              ment must be escaped.

       MANWIDTH
              If $MANWIDTH is set, its value is used as the  line  length  for
              which  manual pages should be formatted.  If it is not set, man-
              ual pages will be formatted with a line  length  appropriate  to
              the  current terminal (using an ioctl(2) if available, the value
              of $COLUMNS, or falling back to  80  characters  if  neither  is
              available).   Cat pages will only be saved when the default for-
              matting can be used, that is when the terminal  line  length  is

# mkdir <a class="anchor" id="mkdir"></a>
make a new directory

`USAGE: mkdir [DIRECTORY]`


##### First, move into the directory for class today.

In [36]:
cd $HOME/BCHM495/classes/class03

##### Create a new directory called `genomes`

In [37]:
mkdir genomes

In [40]:
echo Current working directory: `pwd`
ls -l

Current working directory: /home/jwisecav/BCHM495/classes/class03
total 64
-rw-r----- 1 jwisecav student 22743 Dec  9 16:30 Class03_my_unix_guide.ipynb
drwxr-xr-x 2 jwisecav student  4096 Dec 13 07:57 genomes


# cp <a class="anchor" id="cp"></a>
copy files and directories 

`USAGE: ls [OPTION/S] [SOURCE] [DESTINATION]`

**Common option**
* `-r` : copy directories recursively (copy the directory and all its contents)


##### Copy the files for Influenza A virus from the data depot into your `BCHM495/classes/class03/genomes/` directory

In [41]:
cp /depot/jwisecav/darwin/class_material/class03/genomes/Influenza* genomes

##### Confirm the move.

In [42]:
ls genomes

Influenza_A_virus_genomic.fna  Influenza_A_virus_protein.faa
Influenza_A_virus_genomic.gff


## Did you notice how you only need one command to copy three files? That's because the asterisk `*` acts as a wildcard character. So `influenza*` will recognize any file that starts with 'Influenza'. <a class="anchor" id="wildcard"></a>

# rm <a class="anchor" id="rm"></a>
remove (delete) files and directories 

`USAGE: rm [OPTION/S] [SOURCE] [DESTINATION]`

**Common option**
* `-r` : remove directories recursively (copy the directory and all its contents)


##### delete `Influenza_A_virus_genomic.gff` from the `genomes` directory

In [43]:
rm genomes/Influenza_A_virus_genomic.gff

##### confirm the deletion

In [44]:
ls genomes

Influenza_A_virus_genomic.fna  Influenza_A_virus_protein.faa


##### ...and copy the file back over again.



In [45]:
cp /depot/jwisecav/darwin/class_material/class03/genomes/Influenza_A_virus_genomic.gff genomes
ls genomes

Influenza_A_virus_genomic.fna  Influenza_A_virus_protein.faa
Influenza_A_virus_genomic.gff


# mv <a class="anchor" id="mv"></a>
move or rename files and directories 

`USAGE: mv [SOURCE] [DESTINATION]`


##### Move `Influenza_A_virus_genomic.gff` from `genomes` into the current directory. The `.` is a unix shortcut that contains the path to your current directory. 

In [51]:
mv genomes/Influenza_A_virus_genomic.gff .

##### Confirm the move.

In [52]:
ls -l

total 128
-rw-r----- 1 jwisecav student 22743 Dec  9 16:30 Class03_my_unix_guide.ipynb
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
drwxr-xr-x 2 jwisecav student  4096 Dec 13 08:02 genomes


##### Move `Influenza_A_virus_genomic.gff` back into the `genomes` directory

In [53]:
mv Influenza_A_virus_genomic.gff genomes
ls -l

total 64
-rw-r----- 1 jwisecav student 22743 Dec  9 16:30 Class03_my_unix_guide.ipynb
drwxr-xr-x 2 jwisecav student  4096 Dec 13 08:02 genomes


##### Rename the `genomes` directory `flu_genome`

In [54]:
mv genomes flu_genome
ls -l

total 64
-rw-r----- 1 jwisecav student 22743 Dec  9 16:30 Class03_my_unix_guide.ipynb
drwxr-xr-x 2 jwisecav student  4096 Dec 13 08:02 flu_genome


# chmod <a class="anchor" id="chmod"></a>
change permissions on a file or directory

`USAGE: chmod [ugo] +/- [rwx] [FILE]`

**where:**
* `u` stands for user
* `g` stands for group
* `o` stands for other
* `r` stands for read
* `w` stands for write
* `x` stands for execute

**Common option**
* `-R` : apply the change recursively on a directory and all its contents

##### First, let's take a look at the structure of the `ls -l` output.

In [87]:
cd $HOME/BCHM495/classes/class03/flu_genome
ls -l

total 128
-rw-r----- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


### The first 10 characters of the `ls -l` output tell you how the permissions are set for each file. It's written in shorthand, and you can see a breakdown of the general structure here: 

![](https://www.comentum.com/images/permissions.jpg)

##### Refer back to the [ls](#ls) section for a description of the other `ls -l` columns

##### Change the permissions on `Influenza_A_virus_genomic.fna` so that your group has permission to write to that file

In [88]:
chmod g+w Influenza_A_virus_genomic.fna
ls -l

total 128
-rw-rw---- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


##### Change the permissions on `Influenza_A_virus_genomic.fna` so that all other users have permission to read and write to that file

In [89]:
chmod o+rw Influenza_A_virus_genomic.fna
ls -l

total 128
-rw-rw-rw- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


##### Change the permissions on `Influenza_A_virus_genomic.fna` so that all your group and other users DO NOT have permission to read and write to that file

In [90]:
chmod go-rw Influenza_A_virus_genomic.fna
ls -l

total 128
-rw------- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


##### Change the permissions on `Influenza_A_virus_genomic.fna` so that you (the user) can execute that file. We will be talking more about what it means to execute a file later on in the class. 

In [91]:
chmod u+x Influenza_A_virus_genomic.fna
ls -l

total 128
-rwx------ 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


##### Change the permissions on `Influenza_A_virus_genomic.fna` so its permissions matches the other two files. 

In [92]:
chmod u-x Influenza_A_virus_genomic.fna
chmod g+r Influenza_A_virus_genomic.fna
ls -l

total 128
-rw-r----- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  3163 Dec 13 07:58 Influenza_A_virus_protein.faa.gz


# touch <a class="anchor" id="touch"></a>
Change file timestamps. If the FILE argument does not exist, the file is created empty

`USAGE: touch [FILE]`

##### Create an empty file called `my_file.txt`

In [93]:
cd $HOME/BCHM495/classes/class03
touch my_file.txt

##### confirm creation of the new file

In [94]:
ls -l 

total 64
-rw-r----- 1 jwisecav student 22743 Dec  9 16:30 Class03_my_unix_guide.ipynb
drwxr-xr-x 2 jwisecav student  4096 Dec 13 08:18 flu_genome
-rw-r--r-- 1 jwisecav student     0 Dec 13 08:18 my_file.txt


# cat <a class="anchor" id="cat"></a>
print the contents of the listed file(s)

`USAGE: cat [FILE1] [FILE2] ... [FILEN]`

##### Print the contents of the `Influenza_A_virus_protein.faa` file. This command works best for small files with only a few lines.

In [98]:
cd $HOME/BCHM495/classes/class03/flu_genome
cat Influenza_A_virus_protein.faa

>YP_006575868.1 PA-X protein [Influenza A virus (A/New York/392/2004(H3N2))]
MEDFVRQCFNPMIVELAEKAMKEYGEDLKIETNKFAAICTHLEVCFMYSDFHFINEQGESIVVELDDPNALLKHRFEIIE
GRDRTMAWTVVNSICNTTGAEKPKFLPDLYDYKENRFIEIGVTRREVHIYYLEKANKIKSENTHIHIFSFTGEEIATKAD
YTLDEESRARIKTRLFTIRQEMANRGLWDSFVSPKEAKKQLKKNLKSQELCVGLPTKVSHRNSPALRILEPMWMDSNRTA
ALRASFLKCPKK
>YP_308839.1 hemagglutinin [Influenza A virus (A/New York/392/2004(H3N2))]
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGGICDSPHQILDGENC
TLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACKRRSN
NSFFSRLNWLTHLKFKYPALNVTMPNNEKFDKLYIWGVHHPGTDNDQISLYAQASGRITVSTKRSQQTVIPSIGSRPRIR
DVPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGA
CPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGTGQAADLKSTQAAINQINGKLNRL
IGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFERTKKQLRENAEDMGN
GCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNI
RCNICI
>YP_308840.1 matri

# head <a class="anchor" id="head"></a>
Print  the  first  10 lines of a FILE to standard output

`USAGE: head [FILE]`

**Common option**
* `-n K` : print the first K lines instead of the default 10

In [99]:
head -n 2 Influenza_A_virus_protein.faa

>YP_006575868.1 PA-X protein [Influenza A virus (A/New York/392/2004(H3N2))]
MEDFVRQCFNPMIVELAEKAMKEYGEDLKIETNKFAAICTHLEVCFMYSDFHFINEQGESIVVELDDPNALLKHRFEIIE


# tail <a class="anchor" id="tail"></a>
Print  the  last  10 lines of a FILE to standard output

`USAGE: tail [FILE]`

**Common option**
* `-n K` : print the last K lines instead of the default 10

In [100]:
tail -n 14 Influenza_A_virus_protein.faa

>YP_308848.1 PB1-F2 protein [Influenza A virus (A/New York/392/2004(H3N2))]
MEQEQDTPWTQSTEHTNIQRRGSGRQIQKLGHPNSTQLMDHYLRIMSQVDMHKQTVSWRLWPSLKNPTQVSLRTHALKQW
KSFNKQGWTN
>YP_308849.1 polymerase PB2 [Influenza A virus (A/New York/392/2004(H3N2))]
MERIKELRNLMSQSRTREILTKTTVDHMAIIKKYTSGRQEKNPSLRMKWMMAMKYPITADKRITEMVPERNEQGQTLWSK
MSDAGSDRVMVSPLAVTWWNRNGPVASTVHYPKVYKTYFDKVERLKHGTFGPVHFRNQVKIRRRVDINPGHADLSAKEAQ
DVIMEVVFPNEVGARILTSESQLTITKEKKEELRDCKISPLMVAYMLERELVRKTRFLPVAGGTSSIYIEVLHLTQGTCW
EQMYTPGGEVRNDDVDQSLIIAARNIVRRAAVSADPLASLLEMCHSTQIGGTRMVDILRQNPTEEQAVDICKAAMGLRIS
SSFSFGGFTFKRTSGSSVKKEEEVLTGNLQTLKIRVHEGYEEFTMVGKRATAILRKATRRLVQLIVSGRDEQSIAEAIIV
AMVFSQEDCMIKAVRGDLNFVNRANQRLNPMHQLLRHFQKDAKVLFQNWGIEHIDSVMGMVGVLPDMTPSTEMSMRGIRV
SKMGVDEYSSTERVVVSIDRFLRVRDQRGNVLLSPEEVSETQGTERLTITYSSSMMWEINGPESVLVNTYQWIIRNWEAV
KIQWSQNPAMLYNKMEFEPFQSLVPKAIRSQYSGFVRTLFQQMRDVLGTFDTTQIIKLLPFAAAPPKQSRMQFSSLTVNV
RGSGMRILVRGNSPVFNYNKTTKRLTILGKDAGTLIEDPDESTSGVESAVLRGFLIIGKEDRRYGPALSINELSNLAKGE
KANVLIGQGDVVLVMKRKRDSSILTDS

# wc <a class="anchor" id="wc"></a>
print number of lines, words, and characters (in that order) for each file

`USAGE: wc [FILE]`

**Common options**
* `-l` : print just the number of lines in a file
* `-w` : print just the number of words in a file
* `-m` : print just the number of characters in a file

In [8]:
wc Influenza_A_virus_protein.faa

  79  165 5812 Influenza_A_virus_protein.faa


In [10]:
wc -l Influenza_A_virus_protein.faa

79 Influenza_A_virus_protein.faa


# gzip/gunzip <a class="anchor" id="gzip"></a>
compress or expand files

`USAGE to compress : gzip [FILE]`

`USAGE to expand : gunzip [FILE]`


##### First, note the size of `Influenza_A_virus_protein.faa` prior to compression



In [101]:
cd $HOME/BCHM495/classes/class03
ls -lh flu_genome

total 192K
-rw-r----- 1 jwisecav student  15K Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student 8.1K Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student 5.7K Dec 13 07:58 Influenza_A_virus_protein.faa


###### Compress the `Influenza_A_virus_genomic.gff` file

In [102]:
gzip flu_genome/Influenza_A_virus_genomic.gff

##### Check the size after compression

In [103]:
ls -lh flu_genome

total 128K
-rw-r----- 1 jwisecav student  15K Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student 1.5K Dec 13 07:59 Influenza_A_virus_genomic.gff.gz
-rw-r----- 1 jwisecav student 5.7K Dec 13 07:58 Influenza_A_virus_protein.faa


##### Expand the compressed file

In [104]:
gunzip flu_genome/Influenza_A_virus_genomic.gff.gz

##### Check the size after compression

In [105]:
ls -lh flu_genome

total 192K
-rw-r----- 1 jwisecav student  15K Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student 8.1K Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student 5.7K Dec 13 07:58 Influenza_A_virus_protein.faa


# tar <a class="anchor" id="tar"></a>
Save many files together into a single archive file, or restore individual files from an archive

There are [many ways to run tar](https://xkcd.com/1168/), but here are two common methods

**Create an archive from a directory **

`USAGE : tar -cvf [ARCHIVE] [DIRECTORY]`

**List contents of an archive**
    
`USAGE : tar -tvf [ARCHIVE]`

**Extract files from an archive **

`USAGE : tar -xvf [ARCHIVE]`

##### Create an archive of the `flu_genome` directory and its contents

In [106]:
cd $HOME/BCHM495/classes/class03
tar -cvf flu_genome.tar flu_genome

flu_genome/
flu_genome/Influenza_A_virus_protein.faa
flu_genome/Influenza_A_virus_genomic.gff
flu_genome/Influenza_A_virus_genomic.fna


##### Confirm the existence of the new archive

In [107]:
ls -lh

total 192K
-rw-r----- 1 jwisecav student  23K Dec  9 16:30 Class03_my_unix_guide.ipynb
drwxr-xr-x 2 jwisecav student 4.0K Dec 13 08:19 flu_genome
-rw-r--r-- 1 jwisecav student  40K Dec 13 08:20 flu_genome.tar
-rw-r--r-- 1 jwisecav student    0 Dec 13 08:18 my_file.txt


##### List the contents of the new tar archive

In [108]:
tar -tvf flu_genome.tar

drwxr-xr-x jwisecav/student  0 2019-12-13 08:19 flu_genome/
-rw-r----- jwisecav/student 5812 2019-12-13 07:58 flu_genome/Influenza_A_virus_protein.faa
-rw-r----- jwisecav/student 8226 2019-12-13 07:59 flu_genome/Influenza_A_virus_genomic.gff
-rw-r----- jwisecav/student 14506 2019-12-13 07:58 flu_genome/Influenza_A_virus_genomic.fna


# wget <a class="anchor" id="wget"></a>
download of files from the Web

`USAGE : wget [URL]`

##### Lets download data for a second Influenza genome (type C) from NCBI: https://www.ncbi.nlm.nih.gov/genome/5193
##### You can right click to copy the link to the files for 'genome', 'protein', and 'GFF' towards the top of the page. That's how I got the ftp addresses in the commands below.

In [2]:
cd $HOME/BCHM495/classes/class03/flu_genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/665/GCF_000856665.10_ViralMultiSegProj15055/GCF_000856665.10_ViralMultiSegProj15055_genomic.fna.gz
#wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/665/GCF_000856665.10_ViralMultiSegProj15055/GCF_000856665.10_ViralMultiSegProj15055_genomic.fna.gz

--2019-12-13 08:53:01--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/665/GCF_000856665.10_ViralMultiSegProj15055/GCF_000856665.10_ViralMultiSegProj15055_genomic.fna.gz
           => 'GCF_000856665.10_ViralMultiSegProj15055_genomic.fna.gz'
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:21... 


In [None]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/665/GCF_000856665.10_ViralMultiSegProj15055/GCF_000856665.10_ViralMultiSegProj15055_protein.faa.gz

In [None]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/856/665/GCF_000856665.10_ViralMultiSegProj15055/GCF_000856665.10_ViralMultiSegProj15055_genomic.gff.gz

##### Check to see if the files downloaded

In [None]:
ls -lh

##### Uncompress the three new files. Notice the use of the wildcard character again.

In [None]:
gunzip *gz

In [None]:
ls -lh

##### Rename the new files to match the existing format

In [None]:
mv GCF_000856665.10_ViralMultiSegProj15055_genomic.fna Influenza_C_virus_genomic.fna
mv GCF_000856665.10_ViralMultiSegProj15055_genomic.gff Influenza_C_virus_genomic.gff
mv GCF_000856665.10_ViralMultiSegProj15055_protein.faa Influenza_C_virus_protein.faa

In [50]:
ls -lh

total 384K
-rw-r----- 1 jwisecav student  15K Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student 8.1K Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student 5.7K Dec 13 07:58 Influenza_A_virus_protein.faa
-rw-r--r-- 1 jwisecav student  14K Dec 13 17:46 Influenza_C_virus_genomic.fna
-rw-r--r-- 1 jwisecav student 6.1K Dec 13 17:46 Influenza_C_virus_genomic.gff
-rw-r--r-- 1 jwisecav student 4.9K Dec 13 17:46 Influenza_C_virus_protein.faa


# grep <a class="anchor" id="grep"></a>
print lines that contain a pattern

`USAGE: grep [OPTIONS] [PATTERN] [FILE]`

**Common options**
* `-i` : ignore upper/lower case in both the pattern and the file
* `-v` : invert match to return lines that do NOT contain the pattern


### grep is a incredibly useful command that you will use over and over again this semester!

##### For example, you can use `grep` to pull out the header lines of a fasta file, because each header starts with a `>`

##### Print all header lines in `Influenza_A_virus_genomic.fna`

In [55]:
grep '>' Influenza_C_virus_genomic.fna

>NC_006307.2 Influenza C virus (C/Ann Arbor/1/50) PB2 gene for polymerase 2, complete cds
>NC_006308.2 Influenza C virus (C/Ann Arbor/1/50) PB1 gene for polymerase 1, complete cds
>NC_006309.2 Influenza C virus (C/Ann Arbor/1/50) P3 gene for polymerase 3, complete cds
>NC_006310.2 Influenza C virus (C/Ann Arbor/1/50) HEF gene for hemagglutinin-esterase-fusion, complete cds
>NC_006311.1 Influenza C virus (C/Ann Arbor/1/50) segment 5, complete sequence
>NC_006312.2 Influenza C virus (C/Ann Arbor/1/50) M1, CM2 genes for matrix protein, CM2 protein, complete cds
>NC_006306.2 Influenza C virus (C/Ann Arbor/1/50) segment 7, complete sequence


##### Lets select the first header's accession (NC_006307.2) and search for this sequence in `Influenza_C_virus_genomic.gff`. You can review the column format for gff files [here](https://www.ensembl.org/info/website/upload/gff.html). 

In [60]:
grep NC_006307.2 Influenza_C_virus_genomic.gff

##sequence-region NC_006307.2 1 2365
NC_006307.2	RefSeq	region	1	2365	.	+	.	ID=NC_006307.2:1..2365;Dbxref=taxon:11553;gbkey=Src;genome=genomic;mol_type=genomic RNA;segment=1;strain=C/Ann Arbor/1/50
NC_006307.2	RefSeq	gene	22	2346	.	+	.	ID=gene-FLUCVs1gp1;Dbxref=GeneID:3077363;Name=PB2;gbkey=Gene;gene=PB2;gene_biotype=protein_coding;locus_tag=FLUCVs1gp1
NC_006307.2	RefSeq	CDS	22	2346	.	+	0	ID=cds-YP_089652.1;Parent=gene-FLUCVs1gp1;Dbxref=Genbank:YP_089652.1,GeneID:3077363;Name=YP_089652.1;gbkey=CDS;gene=PB2;locus_tag=FLUCVs1gp1;product=polymerase 2;protein_id=YP_089652.1


## The pipe `|` character allows you to take the output from one unix command and pass it as input into a second command. This allows you to create more complex commands. <a class="anchor" id="pipe"></a>

##### For example, you can count the number of sequences in a fasta file by piping the output from grep to the [wc](#wc) command to count the number of lines. 

In [53]:
grep '>' Influenza_A_virus_genomic.fna | wc -l

8


# cut <a class="anchor" id="cut"></a>
remove sections from each line of files

`USAGE: cut [OPTIONS] [FILE]`

**Common options**
* `-d` : use to specify a delimiter. Default is TAB
* `-f` : used to specify the fields (columns) to print

##### Use cut to extract only some columns from a larger text file. By default the command cuts on tabs, which works perfectly for gff files. 

##### Print the first, third, fourth, and fifth columns of `Influenza_C_virus_genomic.gff`

In [61]:
cut -f 1,3,4,5 Influenza_C_virus_genomic.gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ViralMultiSegProj15055
#!genome-build-accession NCBI_Assembly:GCF_000856665.10
##sequence-region NC_006307.2 1 2365
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11553
NC_006307.2	region	1	2365
NC_006307.2	gene	22	2346
NC_006307.2	CDS	22	2346
##sequence-region NC_006308.2 1 2363
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11553
NC_006308.2	region	1	2363
NC_006308.2	gene	18	2282
NC_006308.2	CDS	18	2282
##sequence-region NC_006309.2 1 2183
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11553
NC_006309.2	region	1	2183
NC_006309.2	gene	22	2151
NC_006309.2	CDS	22	2151
##sequence-region NC_006310.2 1 2073
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11553
NC_006310.2	region	1	2073
NC_006310.2	gene	22	1989
NC_006310.2	CDS	22	1989
##sequence-region NC_006311.1 1 1807
##species https://www.ncbi.nlm.nih.gov/Taxonomy/

##### Lets clean up the output to not include the comment lines that start with a `#` using grep.

In [73]:
grep -v '#' Influenza_A_virus_genomic.gff | cut -f 1,3,4,5

NC_007373.1	region	1	2341
NC_007373.1	gene	28	2307
NC_007373.1	CDS	28	2307
NC_007372.1	region	1	2341
NC_007372.1	gene	25	2298
NC_007372.1	CDS	25	2298
NC_007372.1	gene	119	391
NC_007372.1	CDS	119	391
NC_007371.1	region	1	2233
NC_007371.1	gene	25	2175
NC_007371.1	CDS	25	2175
NC_007371.1	gene	25	784
NC_007371.1	CDS	25	597
NC_007371.1	CDS	599	784
NC_007366.1	region	1	1762
NC_007366.1	gene	30	1730
NC_007366.1	CDS	30	1730
NC_007369.1	region	1	1566
NC_007369.1	gene	46	1542
NC_007369.1	CDS	46	1542
NC_007368.1	region	1	1467
NC_007368.1	gene	20	1429
NC_007368.1	CDS	20	1429
NC_007367.1	region	1	1027
NC_007367.1	gene	26	1007
NC_007367.1	CDS	26	51
NC_007367.1	CDS	740	1007
NC_007367.1	gene	26	784
NC_007367.1	CDS	26	784
NC_007370.1	region	1	890
NC_007370.1	gene	27	864
NC_007370.1	CDS	27	56
NC_007370.1	CDS	529	864
NC_007370.1	gene	27	719
NC_007370.1	CDS	27	719


# uniq <a class="anchor" id="uniq"></a>
report or omit repeated lines

`USAGE: uniq [FILE]`

##### List the uniq contig accessions in `Influenza_C_virus_genomic.gff` 

In [74]:
grep -v '#' Influenza_C_virus_genomic.gff | cut -f 1 | uniq

NC_006307.2
NC_006308.2
NC_006309.2
NC_006310.2
NC_006311.1
NC_006312.2
NC_006306.2


## You can save the output of a command by redirecting the output to a file using `>` <a class="anchor" id="redirect"></a>

In [81]:
grep -v '#' Influenza_C_virus_genomic.gff | cut -f 1 | uniq > contig_list.txt

#### Confirm the new file

In [83]:
ls -l

total 384
-rw-r----- 1 jwisecav student 14506 Dec 13 07:58 Influenza_A_virus_genomic.fna
-rw-r----- 1 jwisecav student  8226 Dec 13 07:59 Influenza_A_virus_genomic.gff
-rw-r----- 1 jwisecav student  5812 Dec 13 07:58 Influenza_A_virus_protein.faa
-rw-r--r-- 1 jwisecav student 13714 Dec 13 17:46 Influenza_C_virus_genomic.fna
-rw-r--r-- 1 jwisecav student  6178 Dec 13 17:46 Influenza_C_virus_genomic.gff
-rw-r--r-- 1 jwisecav student  4957 Dec 13 17:46 Influenza_C_virus_protein.faa
-rw-r--r-- 1 jwisecav student    84 Dec 13 21:29 contig_list.txt


In [84]:
cat contig_list.txt

NC_006307.2
NC_006308.2
NC_006309.2
NC_006310.2
NC_006311.1
NC_006312.2
NC_006306.2


## If the output file already exists, a single `>` will overwrite the file. You can use two `>>` to append to an existing file instead.  

# sort <a class="anchor" id="sort"></a>
sort lines of text files

`USAGE: cut [OPTIONS] [FILE]`

##### uniq is a handy command, but it only works if the redundancy is sequencial. This is where sort can help.


#### Get a list of unique feature types (column 3) in `Influenza_C_virus_genomic.gff`. 

In [80]:
grep -v '#' Influenza_C_virus_genomic.gff | cut -f 3 | sort | uniq

CDS
gene
intron
region
