Nov 10, 2018

Linux and Unix cut command for text processing

About cut

Remove or "cut out" sections of each line of a file or files.

Syntax

cut OPTION... [FILE]...

Options

-b--bytes=LISTSelect only the bytes from each line as specified in LISTLIST specifies a byte, a set of bytes, or a range of bytes; see Specifying LIST below.
-c--characters=LISTSelect only the characters from each line as specified in LISTLIST specifies a character, a set of characters, or a range of characters; see Specifying LISTbelow.
-d--delimiter=DELIMuse character DELIM instead of a tab for the field delimiter.
-f--fields=LISTselect only these fields on each line; also print any line that contains no delimiter character, unless the -s option is specified. LIST specifies a field, a set of fields, or a range of fields; see Specifying LIST below.
-nThis option is ignored, but is included for compatibility reasons.
--complementcomplement the set of selected bytes, characters or fields.
-s--only-delimiteddo not print lines not containing delimiters.
--output-delimiter=STRINGuse STRING as the output delimiter string. The default is to use the input delimiter.
--helpDisplay a help message and exit.
--versionoutput version information and exit.

Usage Notes

When invoking cut, use the -b-c, or -f option, but only one of them.

If no FILE is specified, cut reads from the standard input.

Specifying LIST

Each LIST is made up of an integer, a range of integers, or multiple integer ranges separated by commas. Selected input is written in the same order that it is read, and is written to output exactly once. A range consists of:
Nthe Nth byte, character, or field, counted from 1.
N-from the Nth byte, character, or field, to the end of the line.
N-Mfrom the Nth to the Mth byte, character, or field (inclusive).
-Mfrom the first to the Mth byte, character, or field.
The tab character is the default delimiter of cut, so it will by default consider a field to be anything delimited by a tab.

Counting bytes instead of characters will result in the same output in an ASCII-encoded text file, each character is represented by a single byte (eight bits) of data. So the command:
cut -b 3-12 data.txt
equals:
cut -c 3-12 data.txt

Specifying A Delimiter Other Than Tab

The tab character is the default delimiter that cut uses to determine what constitutes a field. So, if your file's fields are already delimited by tabs, you don't need to specify a different delimiter character.

You can specify any character as the delimiter, however. For instance, the file/etc/passwd contains information about each user on the system, one user per line, and each information field is delimited by a colon (":"). For example, the line of/etc/passwd for the root user may look like this:
root:x:0:0:root:/root:/bin/bash
These fields contain the following information, in the following order, separated by a colon character:
  1. Username
  2. Password (shown as x if encrypted)
  3. User ID number (UID)
  4. Group ID number (GID)
  5. Comment field (used by the finger command)
  6. Home Directory
  7. Shell
The username is the first field on the line, so to display each username on the system, use the command:
cut -f 1 -d ':' /etc/passwd
...which will output, for example:
root
i88ca
goyun

The third field of each line in the /etc/passwd file is the UID (user ID number), so to display each username and user ID number, use the command:
cut -f 1,3 -d ':' /etc/passwd
...which will output the following, for example:
root:0
i88ca:88
goyun:1000

As you can see, the output will be delimited, by default, using the same delimiter character specified for the input. In this case, that's the colon character (":"). You can specify a different delimiter for the input and output, however. So, if you wanted to run the previous command, but have the output delimited by a '#', you could use the command:
cut -f 1,3 -d ':' --output-delimiter='#' /etc/passwd
root#0
i88ca#88
goyun#1000
But what if you want the output to be delimited by a tab? Specifying a tab character on the command line is a bit more complicated, because it is an unprintable character. To specify it on the command line, you must "protect" it from the shell. This is done differently depending on which shell you're using, but in the Linux default shell (bash), you can specify the tab character with $'\t'. So the command:
cut -f 1,3 -d ':' --output-delimiter=$'\t' /etc/passwd
...will output the following, for example:
root 0
i88ca 88
goyun   1000

More Examples

cut -c 3 file.txt
Outputs the third character of every line of the file file.txt, omitting the others.
cut -c 1-3 file.txt
Outputs the first three characters of every line of the file file.txt, omitting the rest.
cut -c 3- file.txt
Outputs the third through the last characters of each line of the file file.txt, omitting the first two characters.
cut -d ':' -f 1 /etc/passwd
Outputs the first field of the file /etc/passwd, where fields are delimited by a colon (':'). The first field of /etc/passwd is the username, so this command will output every username in the passwd file.
grep '/bin/bash' /etc/passwd | cut -d ':' -f 1,6
Outputs the first and sixth fields, delimited by a colon, of any entry in the /etc/passwd file which specifies /bin/bash as the login shell. This command will output the username and home directory of any user whose login shell is /bin/bash.