1
00:00:03,030 --> 00:00:09,700
In this lesson you learn how to use the cut and OC commands the cut command is used for cutting out

2
00:00:09,700 --> 00:00:15,850
sections from each line of input it receives and displaying those sections to standard output.

3
00:00:15,850 --> 00:00:22,570
You can use cut to extract pieces of a line by byte position character position or by a delimiter.

4
00:00:22,570 --> 00:00:27,970
This makes cut ideal for extracting columns from a c v file for example.

5
00:00:28,370 --> 00:00:32,090
It is not a shell built in it's a standalone utility.

6
00:00:33,380 --> 00:00:40,840
So we can use the man command to get some information on this command line utility you can perform cuts

7
00:00:40,840 --> 00:00:47,860
by bytes by using the dash B option by characters with a dash option and by fields by using the dash

8
00:00:47,950 --> 00:00:53,120
f option for each one of these options you'll need to supply a range.

9
00:00:53,410 --> 00:00:57,330
Ranges are pretty simple and I'll demonstrate them with a couple of examples.

10
00:00:57,340 --> 00:01:02,970
Also you probably use the dash d option to specify a delimiter when using the dash f option unless you

11
00:01:02,970 --> 00:01:09,230
are working with tab delimited data OK I'm just going to use the text that already exists in the Etsy

12
00:01:09,230 --> 00:01:14,390
password file and then use the cut utility to cut it up or slice it up in a different way.

13
00:01:14,390 --> 00:01:17,270
So let's just look at the contents of that file now.

14
00:01:19,520 --> 00:01:25,840
Let's start out by cutting the password file by character to print the first character of each line

15
00:01:25,850 --> 00:01:28,610
we'll use dash Dasch see one in one.

16
00:01:28,610 --> 00:01:31,020
Here is the range that we're specifying.

17
00:01:31,340 --> 00:01:32,640
So Will do cut.

18
00:01:32,640 --> 00:01:35,860
I see one ETSI password.

19
00:01:36,050 --> 00:01:39,420
And if you notice here at the bottom we have a vagrant user.

20
00:01:39,500 --> 00:01:42,850
And of the box and user at the bottom of our screen.

21
00:01:42,890 --> 00:01:48,470
So when we execute this command the last two lines should begin with the letter V for example.

22
00:01:48,580 --> 00:01:52,210
Sure enough it's v and the last two lines just like we anticipated.

23
00:01:53,010 --> 00:01:57,740
If you supply a single number then only that number is displayed.

24
00:01:57,750 --> 00:02:03,940
So to display the seventh character of each line use dashes see 7.

25
00:02:05,310 --> 00:02:06,870
OK that's not too useful.

26
00:02:06,870 --> 00:02:08,980
This next one is too useful either.

27
00:02:09,060 --> 00:02:10,460
But hang in there with me.

28
00:02:10,470 --> 00:02:13,140
We'll do some actual work in just a minute.

29
00:02:13,140 --> 00:02:19,300
You can specify a starting position and an ending position by connecting them with a hyphen.

30
00:02:19,410 --> 00:02:28,740
So to cut out characters four through seven we could use Dessie for Dash 7.

31
00:02:28,750 --> 00:02:33,410
It's important when you're specifying a range that you do not use spaces.

32
00:02:33,490 --> 00:02:37,790
There is no space between for the dash and the seven in our command.

33
00:02:37,810 --> 00:02:39,370
Just keep that in mind.

34
00:02:39,750 --> 00:02:46,030
OK let's say you want to display every character on a line starting with character for to do that.

35
00:02:46,030 --> 00:02:48,330
Use a range of four dash.

36
00:02:48,340 --> 00:02:53,880
This is useful if you don't know how long each line is or if the lines are of varying lengths.

37
00:02:54,280 --> 00:02:57,870
So cut Desch see for dash.

38
00:02:58,060 --> 00:02:59,350
See password.

39
00:02:59,620 --> 00:03:06,670
You can do the opposite which is to display every character up to and including a position so to display

40
00:03:06,670 --> 00:03:09,850
the first four characters you can use dashi dash for

41
00:03:13,660 --> 00:03:20,640
by the way that range is exactly the same as 1 dash 4.

42
00:03:20,960 --> 00:03:23,280
And we get the same output.

43
00:03:23,300 --> 00:03:25,180
So here is one last range.

44
00:03:25,220 --> 00:03:31,280
You can pick out multiple individual characters by separating them with a comma or example to print

45
00:03:31,280 --> 00:03:40,810
the first third and fifth characters use one comma three come a five.

46
00:03:40,840 --> 00:03:46,600
It's important to point out here that cut won't rearrange the order even if you specify a different

47
00:03:46,660 --> 00:03:48,100
order in the range.

48
00:03:48,160 --> 00:03:55,450
For example this command that I'm about to type generates the exact same output Dessie five comma three

49
00:03:55,800 --> 00:03:58,830
from a 1.

50
00:03:58,880 --> 00:04:03,470
There are a couple of different methods to rearranging the data but we'll get to that later.

51
00:04:03,470 --> 00:04:08,810
And one more thing if you supply a range that doesn't match anything then you'll get a blank line.

52
00:04:08,810 --> 00:04:14,240
So let's try to prevent the nine hundred and ninety ninth character in the password file.

53
00:04:18,030 --> 00:04:24,750
There isn't a 999 character in any of the lines so there is nothing to display for the line and you

54
00:04:24,750 --> 00:04:28,430
end up with a blank line being displayed and said OK.

55
00:04:28,440 --> 00:04:31,500
So that really covers how to use ranges.

56
00:04:31,500 --> 00:04:37,110
You'll notice that I've been using the dash see option to cut on characters but you can use the dash

57
00:04:37,110 --> 00:04:43,100
B option to cut by byte to display the first bite of every line in the ANSI password file.

58
00:04:43,170 --> 00:04:45,480
Well we would use Dasch be one

59
00:04:49,540 --> 00:04:52,780
that's the same in this particular case as dash C-1

60
00:04:55,460 --> 00:05:00,800
however a byte is not always the same as a character because there are some characters that are made

61
00:05:00,800 --> 00:05:04,980
up of multiple bytes and are thusly called multibyte characters.

62
00:05:05,090 --> 00:05:08,720
For example UTF 8 characters are multibyte characters.

63
00:05:08,720 --> 00:05:12,180
Let me display a multibyte character to the screen with Echo.

64
00:05:18,620 --> 00:05:24,910
To display the first character in that string used dash see one that will just pipe the output of echo

65
00:05:25,190 --> 00:05:30,150
as the standard input to the command and do Dessie one.

66
00:05:30,620 --> 00:05:35,400
So what we did there with a pipe is pretty standard and you've been doing it a lot throughout this course.

67
00:05:35,420 --> 00:05:41,130
I just wanted to be explicit and say that you don't have to supply a file for cut to operate Onsen data.

68
00:05:41,150 --> 00:05:43,910
You can also use standard input as well.

69
00:05:44,760 --> 00:05:47,510
OK so Ben 2 the difference between a byte and a character.

70
00:05:47,520 --> 00:05:53,580
So dash C-One prints the NDA and I hope I pronounced that correctly apologies to anyone who speaks Spanish.

71
00:05:53,820 --> 00:05:57,030
But look what happens when we only print the first by.

72
00:05:57,030 --> 00:06:00,310
Would that be one.

73
00:06:00,540 --> 00:06:03,900
The first bite is displayed of the multi-byte Indian character.

74
00:06:04,050 --> 00:06:06,790
In most cases this is not what you want.

75
00:06:06,870 --> 00:06:08,540
Just something to keep in mind.

76
00:06:08,910 --> 00:06:11,060
Let's move on to the dash f option.

77
00:06:11,070 --> 00:06:15,150
It allows you to cut lines by field by default dash.

78
00:06:15,210 --> 00:06:19,860
Splits on a tab anything before a tab is considered to be the first field.

79
00:06:19,890 --> 00:06:25,180
Anything after the first tab and before the second tab is considered to be the second field and so on.

80
00:06:25,330 --> 00:06:29,800
Cut uses the term field but you can think of these as columns if you wish.

81
00:06:30,120 --> 00:06:35,760
Let me generate some tab delimited data with the echo command the dash option to echo allows you to

82
00:06:35,760 --> 00:06:41,040
use some backslash escapes that allow you to do some things like generate a tab character a new line

83
00:06:41,040 --> 00:06:45,240
and so on backslash t represents a tab so we can do this.

84
00:06:45,270 --> 00:06:50,930
Echo dash each one and then Ford slashed he will produce a tab.

85
00:06:51,130 --> 00:06:56,210
Will use the word to forge slash t is another tab and will use the word 3.

86
00:06:56,610 --> 00:06:59,770
So if we want to display just the first field we can do this.

87
00:07:00,010 --> 00:07:07,790
But dash F 1 and to display the second fill cut have to third field is of course cut dash 3.

88
00:07:08,010 --> 00:07:11,930
So what happens if you have data that is not tab separated.

89
00:07:11,970 --> 00:07:16,400
Let's say we're dealing with a C S A V or a comma separated value file.

90
00:07:20,500 --> 00:07:24,730
In this case you need to tell the code command what to use as the delimiter.

91
00:07:24,730 --> 00:07:25,750
Here it's a comma

92
00:07:31,050 --> 00:07:34,560
at the first field the second field and the third field.

93
00:07:34,590 --> 00:07:40,860
Sometimes you'll see people do what I'm about to do here which is to not use quotes around the delimiter

94
00:07:41,040 --> 00:07:44,570
when specifying that delimiter so you might see something like this.

95
00:07:44,650 --> 00:07:49,390
Cut Dashti comma dash F to for example.

96
00:07:49,590 --> 00:07:54,470
You may also see people not put a space after the Dashti and do this.

97
00:07:54,480 --> 00:08:00,480
Either one of those methods work as long as it's a character that's not used or interpreted by the shell.

98
00:08:00,510 --> 00:08:11,830
If you're trying to do this it won't work so we'll do this.

99
00:08:11,870 --> 00:08:13,580
Here you have to quote the forward slash.

100
00:08:13,610 --> 00:08:17,550
Otherwise the shell interprets it as a line continuation character.

101
00:08:21,470 --> 00:08:26,270
It's just one of those little got used that can happen and that's why I suggest you always quote your

102
00:08:26,270 --> 00:08:27,470
delimiter.

103
00:08:27,710 --> 00:08:35,190
The password file is actually made up of a series of columns or fields all separated by a colon.

104
00:08:35,450 --> 00:08:41,960
So let's print the user name and your ID of every user in the password file so will specify the delimiter

105
00:08:41,960 --> 00:08:48,250
as a colon and say give us fields 1 and 3 from the ETSI password file.

106
00:08:48,260 --> 00:08:54,560
Notice that the output is limited by the original delimiter to change that use the dash dash output

107
00:08:54,590 --> 00:08:58,220
dash delimiter option so let's change it to something else here

108
00:09:09,090 --> 00:09:15,090
here's a common situation that you'll face you'll have a C S V file with a header or some other type

109
00:09:15,090 --> 00:09:16,970
of data that contains a header.

110
00:09:16,980 --> 00:09:43,730
Let me create a C S V file on the fly here.

111
00:09:43,830 --> 00:09:44,760
When you do this

112
00:09:48,570 --> 00:09:50,970
you get the header and the output.

113
00:09:51,210 --> 00:09:57,030
So you have two choices the first one is to remove the header before you send the data to cut or remove

114
00:09:57,030 --> 00:09:59,200
it after cut has done its work.

115
00:09:59,430 --> 00:10:03,260
Before we do that let's review the grip command quickly by default.

116
00:10:03,300 --> 00:10:07,130
Grep displays matches to a pattern that you supply.

117
00:10:07,140 --> 00:10:12,870
So if we look for the pattern of first it will display the line or lines that match that pattern.

118
00:10:13,020 --> 00:10:15,530
So we do grab first people.

119
00:10:15,810 --> 00:10:23,220
And here we have three matches and notice that it doesn't display any of the lines that do not match.

120
00:10:23,220 --> 00:10:27,130
Let's narrow down our search such that it only matches the header.

121
00:10:27,150 --> 00:10:31,650
You can do that by supplying more information or a more exact pattern.

122
00:10:34,630 --> 00:10:39,900
If you want to be exact you can use regular expression anchors speaking of regular expressions when

123
00:10:39,910 --> 00:10:44,780
I'm about to show you are my two most commonly used regular expressions ever.

124
00:10:44,890 --> 00:10:50,920
If you never learn any more about regular expressions you'll have at least these two very important

125
00:10:50,920 --> 00:10:52,590
ones at your disposal.

126
00:10:52,660 --> 00:10:55,460
The first regular expression is a carrot symbol.

127
00:10:55,660 --> 00:10:57,830
It matches the beginning of a line.

128
00:10:57,880 --> 00:11:01,500
It matches a position and not a character.

129
00:11:01,720 --> 00:11:07,020
So if we want to match all the lines that start with first use carrot first like so

130
00:11:10,690 --> 00:11:15,510
notice that the results returned are different when you do not use the carrot character.

131
00:11:17,430 --> 00:11:20,050
The second regular expression is the dollar sign.

132
00:11:20,100 --> 00:11:22,270
It matches the end of a line.

133
00:11:22,530 --> 00:11:25,030
Two matches a position and not a character.

134
00:11:25,050 --> 00:11:31,620
So if you want to find all the lines that end in t we can do this grep t dollar sign.

135
00:11:32,630 --> 00:11:42,360
Sort of force an exact match you can start your pattern with a carrot and end it with a dollar sign.

136
00:11:42,630 --> 00:11:49,260
Now we have isolated the header of the file but we want everything except that luckily grep has a handy

137
00:11:49,260 --> 00:11:51,220
option that inverts matching.

138
00:11:51,390 --> 00:11:53,240
That option is Dasch v

139
00:11:58,350 --> 00:12:04,160
the dash via option makes grab display any lines that do not match the pattern supplied.

140
00:12:04,200 --> 00:12:11,540
Now that we've removed the header we can send it to cut.

141
00:12:11,810 --> 00:12:15,430
Another option is to perform the cut first then remove the header.

142
00:12:15,470 --> 00:12:20,750
I don't like this as much because Cut alters the output first making the header change too.

143
00:12:20,750 --> 00:12:22,050
But it does work.

144
00:12:22,130 --> 00:12:27,080
So we can do this.

145
00:12:27,300 --> 00:12:28,130
That's what we get.

146
00:12:28,140 --> 00:12:29,180
That contains the header

147
00:12:34,120 --> 00:12:39,700
and that removes the header by the way cut only handles single character delimiters.

148
00:12:39,710 --> 00:12:45,350
This is fine in most cases but there might be occasions where you would want or need to split on multiple

149
00:12:45,350 --> 00:12:46,060
characters.

150
00:12:46,070 --> 00:12:47,550
Take this example.

151
00:13:06,000 --> 00:13:09,400
At first glance you might think oh I can just split this on the colon.

152
00:13:09,480 --> 00:13:11,320
Let's try that and see what happens.

153
00:13:16,510 --> 00:13:21,020
That leaves the string data which really should be considered part of the delimiter.

154
00:13:21,040 --> 00:13:23,360
It's not part of the actual data itself.

155
00:13:23,410 --> 00:13:25,540
It's a pointer to the real data.

156
00:13:25,540 --> 00:13:32,790
What you would really like to do is this.

157
00:13:32,880 --> 00:13:35,340
But as you can see that doesn't work.

158
00:13:35,520 --> 00:13:36,710
We can do that with all.

159
00:13:36,720 --> 00:13:42,480
However now I'm not saying that this is the only way to handle this situation but it is one way plus

160
00:13:42,480 --> 00:13:48,590
it gives me a chance to briefly cover AWC every good shellscript or should at least be aware of.

161
00:13:48,780 --> 00:13:52,670
Let me just give you the answer first and then I'll explain it to you in just a second.

162
00:14:01,280 --> 00:14:09,190
This is an entire program on a single line the dash capital f option allows you to specify a field separator.

163
00:14:09,320 --> 00:14:13,610
We're telling it to use data Colan as the field separator.

164
00:14:13,610 --> 00:14:20,000
The entire program is contained in the next set of single quotes the braces in awk mean an action.

165
00:14:20,000 --> 00:14:26,460
This makes awk do things or take actions the action we want to take is to print.

166
00:14:26,540 --> 00:14:28,410
As you probably have figured out by now.

167
00:14:28,550 --> 00:14:31,660
Dollar 2 represents the contents of the second field.

168
00:14:31,790 --> 00:14:36,770
So a dollar sign one is the data in the first field assigned to the data in the second field and so

169
00:14:36,770 --> 00:14:38,060
on and so forth.

170
00:14:38,060 --> 00:14:41,010
Let's go back to a previous cut example.

171
00:14:47,250 --> 00:14:51,700
So here we are displaying the first and third fields from the ETSI password file.

172
00:14:52,140 --> 00:15:03,830
To do something very similar and all you can do this.

173
00:15:04,010 --> 00:15:09,250
Here you see that separates dollar sign one and dollar sign three with a space.

174
00:15:09,270 --> 00:15:14,630
That's because the comma in the print statement represents the output field separator.

175
00:15:14,670 --> 00:15:19,920
By default the output field separator is a space in awk if you leave out that comma.

176
00:15:19,920 --> 00:15:22,650
Then the fields just run together like this.

177
00:15:26,930 --> 00:15:30,510
Let's go back to our print statement here I'll just execute this.

178
00:15:30,620 --> 00:15:38,210
Awk has a special built in variable named o f s That's capital O F S and that stands for output fields

179
00:15:38,210 --> 00:15:39,180
separator.

180
00:15:39,200 --> 00:15:42,800
You can change the default from a space to anything you would like.

181
00:15:42,860 --> 00:15:49,580
By changing the value of that variable to change a variable in awk use the dash v option and then perform

182
00:15:49,580 --> 00:15:51,050
the variable assignment.

183
00:15:51,050 --> 00:15:53,870
So to change off as to a comma we can do this.

184
00:15:54,200 --> 00:15:57,170
So I'll go here add a dash.

185
00:15:57,170 --> 00:16:03,340
The option set off as equal to a comma and hit enter.

186
00:16:03,500 --> 00:16:08,960
To be clear it's definitely the office variable that controls the output delimiter being displayed and

187
00:16:08,960 --> 00:16:11,750
not the space used in the print statement.

188
00:16:11,750 --> 00:16:14,780
We get the same result even if we do this.

189
00:16:14,840 --> 00:16:19,180
Let me just add a bunch of spaces like in between here and some here.

190
00:16:19,320 --> 00:16:20,320
Well when I hit enter.

191
00:16:20,330 --> 00:16:25,460
You're going to see that the data is exactly the same instead of setting the off as a variable.

192
00:16:25,460 --> 00:16:27,920
You can just give it a string to print like so

193
00:16:34,960 --> 00:16:39,890
the string are going to print as a comma and then will specify field 3 here.

194
00:16:42,360 --> 00:16:46,770
If you want a space after the comma for example just add that space in your strings.

195
00:16:46,770 --> 00:16:51,210
So let me go back up here and just put a space in my string and hit enter.

196
00:16:51,250 --> 00:16:55,640
All X really lenient with spacing so this is the exact same command.

197
00:16:56,920 --> 00:17:00,950
I can run dollar sign one right up against the streaming and dollar sign 3 there.

198
00:17:01,150 --> 00:17:07,180
Or I could put a lot of space in between these like this and you'll see that that doesn't really affect

199
00:17:07,450 --> 00:17:09,920
the execution of awk.

200
00:17:09,940 --> 00:17:11,770
That may be clearer to you or it may not.

201
00:17:11,770 --> 00:17:15,300
Now let's add some more text to our print statement here.

202
00:17:20,490 --> 00:17:23,760
Let's do this let's say we want a print column one

203
00:17:30,580 --> 00:17:34,510
so hopefully that gives you an idea of how you can use strings with the print statements and how the

204
00:17:34,510 --> 00:17:37,150
output field separator works.

205
00:17:37,150 --> 00:17:42,490
Joe if you remember earlier I said you can't control the order of the data being displayed with cut.

206
00:17:42,490 --> 00:17:44,370
So let's take this example.

207
00:17:51,760 --> 00:17:56,740
It displays the fields in the order that they appear from the input with Orch.

208
00:17:56,740 --> 00:17:58,250
You can change it like so.

209
00:18:03,160 --> 00:18:07,050
We just tell to print the third field first and then the first field.

210
00:18:09,650 --> 00:18:15,770
Then you can combine it with any other additional strings you want so maybe we want to say this is the

211
00:18:15,860 --> 00:18:16,360
UAD

212
00:18:20,130 --> 00:18:25,450
and we'll separate that with a semi-colon log in like.

213
00:18:25,470 --> 00:18:33,720
So in addition to dollar sign one dollar sign to sign 3 and so on or gives us dollar sign and F which

214
00:18:33,720 --> 00:18:36,390
represents the number of fields found.

215
00:18:36,570 --> 00:18:40,820
So to print the last field for every line in a file use dollar sign in f

216
00:18:48,620 --> 00:18:54,740
the password file is very uniform so using dollar sign N.F. isn't exactly groundbreaking here.

217
00:18:54,740 --> 00:19:00,260
But if you are dealing with the regular data that doesn't fit nicely into columns you can often see

218
00:19:00,260 --> 00:19:06,080
something common in each line and then use dollar sign N.F. to shorten the data down or to normalize

219
00:19:06,080 --> 00:19:09,890
that data even if the number of fields is consistent in the data.

220
00:19:09,890 --> 00:19:15,590
It might be easier to say print the last filled or print or NF So you don't have to count the number

221
00:19:15,590 --> 00:19:16,890
of fields first.

222
00:19:17,000 --> 00:19:23,180
If you have a CXXVI file with 47 of her fields or columns and need the last one it's a lot quicker to

223
00:19:23,180 --> 00:19:30,140
use dollar sign NF instead of counting all those columns you can do some math with awk to just surround

224
00:19:30,140 --> 00:19:32,370
it in parentheses So check this out.

225
00:19:40,950 --> 00:19:47,430
What this command does is Prince N.F. minus one which is of course the second to the last field.

226
00:19:47,460 --> 00:19:53,990
So if there are seven fields in F seven minus one is six then it prints six field for example.

227
00:19:54,270 --> 00:19:56,860
Let's generate some irregular data.

228
00:20:26,730 --> 00:20:32,820
You can see that what we really have here is a file with four lines in it and each line is made up of

229
00:20:32,820 --> 00:20:40,240
two columns separated by varying lengths of white space white space being spaces and or tabs.

230
00:20:40,320 --> 00:20:45,570
It would be really hard to make sense of this data was cut because it only allows us to split on a single

231
00:20:45,570 --> 00:20:51,870
character even if we split on a space we wouldn't end up with what we wanted because different lines

232
00:20:51,870 --> 00:20:55,160
have different number of spaces separating the columns.

233
00:20:55,170 --> 00:20:58,080
Also it wouldn't handle lines with tabs.

234
00:20:58,300 --> 00:21:02,130
However awk performs really well in this situation.

235
00:21:07,840 --> 00:21:11,420
By default the field separator for all is whitespace.

236
00:21:11,590 --> 00:21:15,060
Or if you say it another may be even more accurate way.

237
00:21:15,200 --> 00:21:19,820
Awk considers non whitespace characters to be a field by default.

238
00:21:20,020 --> 00:21:26,100
Awk easily handles extraneous spaces at the beginning and end of each line for example.

239
00:21:26,110 --> 00:21:29,430
So really those are the two times that I personally use awk.

240
00:21:29,440 --> 00:21:34,750
One is to use a delimiter that's comprised of more than a single character and the other time is to

241
00:21:34,750 --> 00:21:36,950
handle fields separated by whitespace.
