﻿WEBVTT

1
00:00:01.350 --> 00:00:04.650 line:15% 
<v Instructor>My favorite Unix command is the tar pipe,</v>

2
00:00:04.650 --> 00:00:07.100
and it's not just a thing you can type at the command line,

3
00:00:07.100 --> 00:00:09.490
but it's a pattern that combines a couple

4
00:00:09.490 --> 00:00:11.180
features of this shell, with a couple

5
00:00:11.180 --> 00:00:13.623
of primitive Unix commands to solve a problem.

6
00:00:15.130 --> 00:00:18.137
The problem that it solves is you have a source directory

7
00:00:18.137 --> 00:00:21.200
and a destination and inside of the source,

8
00:00:21.200 --> 00:00:24.410
you have some file structure potentially nested,

9
00:00:24.410 --> 00:00:26.700
and you want to clone it to the destination

10
00:00:26.700 --> 00:00:29.670
preserving metadata like file permissions

11
00:00:29.670 --> 00:00:33.350
and owners, and groups and things like that.

12
00:00:33.350 --> 00:00:35.530
In modern systems, there are some easy ways

13
00:00:35.530 --> 00:00:38.160
to do this like cp -rp, which means

14
00:00:38.160 --> 00:00:40.670
recursive preserving permissions

15
00:00:40.670 --> 00:00:45.110
and you could also do an async -a to archive, I'm sorry,

16
00:00:45.110 --> 00:00:49.330
an rsync -a to archive one directory to another.

17
00:00:49.330 --> 00:00:52.750
But, in the days before rsync's existence

18
00:00:52.750 --> 00:00:55.520
and the days before all cp implementations

19
00:00:55.520 --> 00:00:59.470
had the metadata preservation flag,

20
00:00:59.470 --> 00:01:03.660
there was a tar pipe, and a tar pipe looks like this,

21
00:01:03.660 --> 00:01:06.670
you cd into one directory and tar it up,

22
00:01:06.670 --> 00:01:09.390
and then you pip that to another sub-shell

23
00:01:09.390 --> 00:01:11.760
that cds to the destination directory

24
00:01:11.760 --> 00:01:14.343
and extracts it preserving permissions.

25
00:01:15.180 --> 00:01:18.740
And once I've done that, we'll have the file copied

26
00:01:18.740 --> 00:01:21.180
into the destination, same content,

27
00:01:21.180 --> 00:01:23.610
same metadata and everything.

28
00:01:23.610 --> 00:01:26.130
Now, why am I telling you about a command

29
00:01:26.130 --> 00:01:28.140
that is effectively obsolete?

30
00:01:28.140 --> 00:01:30.810
Well, it's because it has a lot of really

31
00:01:30.810 --> 00:01:32.780
good Unix stuff in it.

32
00:01:32.780 --> 00:01:35.510
There are a lot of things in here that will teach

33
00:01:35.510 --> 00:01:38.283
you how Unix really works, once you understand them.

34
00:01:39.350 --> 00:01:43.510
So, starting at the outside, we have these two subshells

35
00:01:43.510 --> 00:01:46.190
one here and one here and they're joined by the pipe

36
00:01:46.190 --> 00:01:48.560
and a subshell basically tells bash

37
00:01:48.560 --> 00:01:51.870
or in this case is the sh to fork itself

38
00:01:51.870 --> 00:01:55.400
and do whatever's in the subshell in the new process.

39
00:01:55.400 --> 00:01:58.320
And this effectively isolates the subshells,

40
00:01:58.320 --> 00:02:01.170
current working directory and its variables

41
00:02:01.170 --> 00:02:04.470
and it's options like set-e.

42
00:02:04.470 --> 00:02:06.240
So you use it when you wanna do something

43
00:02:06.240 --> 00:02:09.390
in it without polluting the containing

44
00:02:09.390 --> 00:02:10.763
instance of the shell.

45
00:02:11.940 --> 00:02:13.810
Now, like I said, that's done using fork

46
00:02:13.810 --> 00:02:15.500
and I think that a lot of people

47
00:02:15.500 --> 00:02:18.300
have either never actually used fork directly

48
00:02:18.300 --> 00:02:20.990
or haven't really thought about it since their operating

49
00:02:20.990 --> 00:02:23.490
systems class back in undergrad.

50
00:02:23.490 --> 00:02:25.970
So, I wanted to just look at it quickly,

51
00:02:25.970 --> 00:02:28.690
it takes no arguments and it returns a pid _t,

52
00:02:28.690 --> 00:02:30.790
which is an integer.

53
00:02:30.790 --> 00:02:33.260
So, it's a very, very simple function

54
00:02:33.260 --> 00:02:36.250
and the thing that it returns is either the pid

55
00:02:36.250 --> 00:02:38.190
of the child that gets created,

56
00:02:38.190 --> 00:02:40.800
or it's a zero in the case of the child.

57
00:02:40.800 --> 00:02:43.500
So, what happens is you call fork once

58
00:02:43.500 --> 00:02:46.880
and it actually returns twice in two different processes,

59
00:02:46.880 --> 00:02:50.660
and neither process really has any history

60
00:02:50.660 --> 00:02:52.570
that would indicate where it came from other

61
00:02:52.570 --> 00:02:56.770
than this difference in the pid returned by fork

62
00:02:56.770 --> 00:02:58.500
and some other minor stuff like you can see

63
00:02:58.500 --> 00:03:01.240
the child process has a unique process ID.

64
00:03:01.240 --> 00:03:02.150
But for the most part,

65
00:03:02.150 --> 00:03:04.730
you get two identical processes that both think

66
00:03:04.730 --> 00:03:08.063
they've been executing the code for all time.

67
00:03:09.350 --> 00:03:13.680
Now, I wanna do an example of this because you really

68
00:03:13.680 --> 00:03:17.140
don't feel how simple it is, until you see it in action.

69
00:03:17.140 --> 00:03:18.840
So, I'm gonna do this in Python

70
00:03:18.840 --> 00:03:22.710
and I'm going to import fork and getpid.

71
00:03:22.710 --> 00:03:25.190
We're going to get a child_pid by calling fork,

72
00:03:25.190 --> 00:03:28.360
that's all we have to do and then remember

73
00:03:28.360 --> 00:03:31.620
both processes start executing from that point.

74
00:03:31.620 --> 00:03:34.880
So, if there's a child_pid, we're in the parent,

75
00:03:34.880 --> 00:03:37.100
otherwise we're in the child

76
00:03:37.100 --> 00:03:39.390
because the child doesn't get his own pid

77
00:03:39.390 --> 00:03:40.900
this is the way that you know whether

78
00:03:40.900 --> 00:03:43.200
you're in the parent or the child

79
00:03:43.200 --> 00:03:45.660
and I'm gonna just have them print some stuff out.

80
00:03:45.660 --> 00:03:48.160
So, I'm the parent, that's what the parent

81
00:03:48.160 --> 00:03:51.540
will say and he'll protect his child_pid,

82
00:03:51.540 --> 00:03:55.313
he'll print out his IP or pid, excuse me,

83
00:03:57.130 --> 00:03:59.930
and the child does the same thing,

84
00:03:59.930 --> 00:04:01.720
except he'll say I'm the child

85
00:04:01.720 --> 00:04:04.260
and he'll still pronounce child_pid in his own,

86
00:04:04.260 --> 00:04:05.660
just so we can compare them.

87
00:04:06.570 --> 00:04:09.900
If I run this, we get I'm the parent,

88
00:04:09.900 --> 00:04:13.030
he has the actual child_pid which is 02,

89
00:04:13.030 --> 00:04:17.680
his pid is 01 unsurprisingly, it's one lower

90
00:04:17.680 --> 00:04:19.310
then the child executes his pid,

91
00:04:19.310 --> 00:04:22.900
his child_pid is zero, because he is the child

92
00:04:22.900 --> 00:04:25.310
and his actual pid is 02,

93
00:04:25.310 --> 00:04:28.853
which matches what the parent got for the child_pid.

94
00:04:29.730 --> 00:04:33.100
Now, these two happened to have executed in serial,

95
00:04:33.100 --> 00:04:34.920
or at least it looks that way to us due

96
00:04:34.920 --> 00:04:37.770
to potentially I/O buffering or something

97
00:04:37.770 --> 00:04:40.840
but I'm going to talk throughout the screencast

98
00:04:40.840 --> 00:04:43.440
as if I had a single core and the process

99
00:04:43.440 --> 00:04:45.420
scheduling were very simple,

100
00:04:45.420 --> 00:04:48.930
it's not actually but everything is going to work out

101
00:04:48.930 --> 00:04:51.140
as if it were, so we don't really have to think

102
00:04:51.140 --> 00:04:53.893
about the complexities of multi-core scheduling.

103
00:04:55.140 --> 00:04:57.210
Anyway, going back to our example,

104
00:04:57.210 --> 00:05:00.760
I'm going to erase these prints and we're going to step

105
00:05:00.760 --> 00:05:02.960
this up a little bit and involve a pipe.

106
00:05:02.960 --> 00:05:07.350
So, if we just look at our tar pipe again,

107
00:05:07.350 --> 00:05:09.600
we've got these two processes being created

108
00:05:09.600 --> 00:05:11.373
and they're joined by this pipe.

109
00:05:12.260 --> 00:05:15.640
The question is, how do they get access as to that pipe?

110
00:05:15.640 --> 00:05:17.240
How does the writer get to write into it?

111
00:05:17.240 --> 00:05:19.550
How does the reader get to read from it?

112
00:05:19.550 --> 00:05:22.080
And the answer falls directly out of Unix

113
00:05:22.080 --> 00:05:23.550
and it's very very simple,

114
00:05:23.550 --> 00:05:25.780
so I'm gonna import a couple more things

115
00:05:26.720 --> 00:05:29.400
and basically what happens is there's a read end

116
00:05:29.400 --> 00:05:31.770
and a write end of a pipe,

117
00:05:31.770 --> 00:05:34.360
and you make them by calling the pipe system call,

118
00:05:34.360 --> 00:05:36.490
which once again takes no arguments.

119
00:05:36.490 --> 00:05:38.830
Of course in Unix, it would take some arguments

120
00:05:38.830 --> 00:05:41.190
because it needs to return these somehow,

121
00:05:41.190 --> 00:05:44.530
but in Python they wrap it so that,

122
00:05:44.530 --> 00:05:46.820
it just returned to integers.

123
00:05:46.820 --> 00:05:49.230
And it's important that I'm doing this before I fork,

124
00:05:49.230 --> 00:05:51.474
if I did it afterwards, then both processes

125
00:05:51.474 --> 00:05:54.680
would just create their own pairs of file descriptors

126
00:05:54.680 --> 00:05:56.650
here and it wouldn't work at all.

127
00:05:56.650 --> 00:05:58.790
You have to create the pair of file descriptors

128
00:05:58.790 --> 00:06:01.670
before creating the child processes.

129
00:06:01.670 --> 00:06:03.990
And the thing I'm gonna have them do,

130
00:06:03.990 --> 00:06:07.680
is the child is going to print, child is about to write,

131
00:06:07.680 --> 00:06:10.483
then he will write into the right end saying, hello,

132
00:06:11.870 --> 00:06:15.900
then he will print child wrote and the parent will print,

133
00:06:15.900 --> 00:06:18.240
parent is about to read,

134
00:06:18.240 --> 00:06:21.310
then he will read from the read end.

135
00:06:21.310 --> 00:06:23.120
He has to specify a maximum size

136
00:06:23.120 --> 00:06:25.100
because these are very very thin wrappers

137
00:06:25.100 --> 00:06:27.403
around the direct Unix calls,

138
00:06:28.310 --> 00:06:30.840
and then he will say parent read

139
00:06:30.840 --> 00:06:33.263
and he will print out the data.

140
00:06:35.220 --> 00:06:36.673
And if we run this,

141
00:06:37.870 --> 00:06:39.570
we see the parent is about to read,

142
00:06:39.570 --> 00:06:41.060
but then he blocks and he blocks

143
00:06:41.060 --> 00:06:43.010
because he's reading from an empty pip.

144
00:06:43.920 --> 00:06:45.957
Then the child wakes up, he's about to write

145
00:06:45.957 --> 00:06:48.430
and he does immediately, and then he exits

146
00:06:48.430 --> 00:06:49.950
'cause he hits the end of his script.

147
00:06:49.950 --> 00:06:52.450
So the parent wakes back up, he reads

148
00:06:52.450 --> 00:06:54.593
and he prints out the data that he gets.

149
00:06:55.750 --> 00:06:58.500
So we can see that the processes here are being interleaved,

150
00:06:58.500 --> 00:07:02.010
you've got parent, then child, then parent,

151
00:07:02.010 --> 00:07:03.560
and they're all mixed together.

152
00:07:04.690 --> 00:07:08.220
Now, let's go back to our tar pipe and see how this applies

153
00:07:08.220 --> 00:07:12.520
so in reality, here, there are three processes, not two,

154
00:07:12.520 --> 00:07:15.070
there's the parent shell that's doing all my prompt

155
00:07:15.070 --> 00:07:17.480
and interactive stuff and he's forking off

156
00:07:17.480 --> 00:07:19.930
both of these, but when you forks the first one,

157
00:07:19.930 --> 00:07:22.390
he's setting it standard out to be the writing end

158
00:07:22.390 --> 00:07:25.640
of this pip and then he forks off the second one

159
00:07:25.640 --> 00:07:27.350
and he's setting it's a reading end,

160
00:07:27.350 --> 00:07:28.460
or I'm sorry, it's standard

161
00:07:28.460 --> 00:07:30.393
in to be the reading end of this pip.

162
00:07:31.580 --> 00:07:34.500
So both of these processes are hooked up to two different

163
00:07:34.500 --> 00:07:36.720
file descriptors, each of which represents

164
00:07:36.720 --> 00:07:38.410
one side of the pipe.

165
00:07:38.410 --> 00:07:39.330
Now we haven't actually talked

166
00:07:39.330 --> 00:07:42.740
about what's going on inside of them yet so let's do that.

167
00:07:42.740 --> 00:07:45.550
In the first one, you've got cd'ing into source

168
00:07:45.550 --> 00:07:48.050
and then tarring up that directory.

169
00:07:48.050 --> 00:07:50.700
So let's look at what actually happens if we do that.

170
00:07:52.410 --> 00:07:55.110
This is the actual tar file format,

171
00:07:55.110 --> 00:07:57.250
and you very rarely see this, so I thought

172
00:07:57.250 --> 00:07:59.410
it would be nice to just show it.

173
00:07:59.410 --> 00:08:01.070
And actually, if I do it on the current directory,

174
00:08:01.070 --> 00:08:03.400
we'll see a little more, you can see my script,

175
00:08:03.400 --> 00:08:05.870
I just wrote is in there and you've got all these numbers

176
00:08:05.870 --> 00:08:10.553
that represent various metadata about file modes,

177
00:08:11.510 --> 00:08:14.880
like read, write, execute, who owns it, which group

178
00:08:14.880 --> 00:08:18.270
is it in, all that kind of stuff that's all in here.

179
00:08:18.270 --> 00:08:21.260
So, this is the actual data that's being emitted

180
00:08:21.260 --> 00:08:23.630
by this, in fact, this block of data right here

181
00:08:23.630 --> 00:08:28.090
is the actual stream of bytes being emitted right here.

182
00:08:28.090 --> 00:08:30.700
And then you've got it going to this other side,

183
00:08:30.700 --> 00:08:32.640
which cds into the destination directory

184
00:08:32.640 --> 00:08:34.853
and extracts a -XP,

185
00:08:36.060 --> 00:08:37.880
which means preserved permissions,

186
00:08:37.880 --> 00:08:40.260
it extracts that stream of tar data

187
00:08:40.260 --> 00:08:42.153
into the current directory, which is the destination

188
00:08:42.153 --> 00:08:45.100
directory since it's cd into it.

189
00:08:45.100 --> 00:08:46.772
So, that's where the tar pipe does,

190
00:08:46.772 --> 00:08:49.870
it's very simple, but it illustrates

191
00:08:49.870 --> 00:08:51.726
all these concepts that illustrates the forking

192
00:08:51.726 --> 00:08:54.240
with the subshells and the pipe,

193
00:08:54.240 --> 00:08:56.470
and it makes you about when does the pip get created?

194
00:08:56.470 --> 00:08:59.020
How does it get hooked to standard in and standard out?

195
00:08:59.020 --> 00:09:02.610
That uses the duplicate system calls, I didn't explain.

196
00:09:02.610 --> 00:09:04.730
But there's all kinds of other stuff going on here

197
00:09:04.730 --> 00:09:07.460
if you just think about how does every little piece work,

198
00:09:07.460 --> 00:09:10.750
you'll discover all these things about Unix.

199
00:09:10.750 --> 00:09:12.260
Now there's one more thing I didn't talk about

200
00:09:12.260 --> 00:09:14.620
and it's a very practical concern.

201
00:09:14.620 --> 00:09:17.602
It's why did I use this double ampersand?

202
00:09:17.602 --> 00:09:19.373
Well, if I go back up to this,

203
00:09:20.370 --> 00:09:23.301
if I had done a cd something, and then semi-colon

204
00:09:23.301 --> 00:09:28.020
then echo, or do whatever, do something after the cd,

205
00:09:28.020 --> 00:09:31.010
then even if the cd fails, the thing happens anyway.

206
00:09:31.010 --> 00:09:35.540
So, if this cd had failed and I had a semi-colon after it,

207
00:09:35.540 --> 00:09:37.940
then the tar would have happened anyway,

208
00:09:37.940 --> 00:09:40.010
and likewise, if this cd had failed

209
00:09:40.010 --> 00:09:41.380
and there was a semi-colon after that this

210
00:09:41.380 --> 00:09:42.470
tar would have happened.

211
00:09:42.470 --> 00:09:45.280
So we'd be doing stuff like compressing the wrong directory,

212
00:09:45.280 --> 00:09:48.310
or tiring the wrong directory, or we would be extracting

213
00:09:48.310 --> 00:09:51.030
into the wrong directory and neither of those is any good.

214
00:09:51.030 --> 00:09:52.960
So, most of the time you wanna use

215
00:09:52.960 --> 00:09:55.530
the double ampersand rather than the semi-colon

216
00:09:55.530 --> 00:09:59.930
because when you use it, the errors stop execution

217
00:09:59.930 --> 00:10:01.990
so you won't accidentally continue

218
00:10:01.990 --> 00:10:03.943
in an unknown state in your script.

219
00:10:04.930 --> 00:10:09.375
So that's the tar pipe, that's what it does and how,

220
00:10:09.375 --> 00:10:12.850
and like I said, nowadays, there are better simpler ways

221
00:10:12.850 --> 00:10:16.920
to do this, but, it's still a really awesome example.

222
00:10:16.920 --> 00:10:18.920
And I wanna actually dive down

223
00:10:18.920 --> 00:10:21.810
into the tar itself one more time.

224
00:10:21.810 --> 00:10:25.446
This is the tar data, and we've got how many bytes here?

225
00:10:25.446 --> 00:10:30.300
About 10K, and this is why you always see things tarred

226
00:10:30.300 --> 00:10:32.620
and then gzipped, if you look at that tar data,

227
00:10:32.620 --> 00:10:34.980
it's very fluffy there's a lot of numbers,

228
00:10:34.980 --> 00:10:37.710
it's plain text, it's not compressed.

229
00:10:37.710 --> 00:10:40.230
So if we stick a gzip in this chain,

230
00:10:40.230 --> 00:10:42.750
then we get about one 25th of the size,

231
00:10:42.750 --> 00:10:44.826
400 bytes instead of 10K.

232
00:10:44.826 --> 00:10:49.810
And this wonderfully illustrates the Unix principle,

233
00:10:49.810 --> 00:10:52.816
that tools should do one thing and do them well.

234
00:10:52.816 --> 00:10:55.670
You have the tar tool, which is really good

235
00:10:55.670 --> 00:10:58.479
at combining multiple files and storing their metadata,

236
00:10:58.479 --> 00:11:00.020
and you've got the gzip tool,

237
00:11:00.020 --> 00:11:01.590
which is really good at taking a stream

238
00:11:01.590 --> 00:11:03.360
of bytes and compressing them.

239
00:11:03.360 --> 00:11:06.260
But gzip does not support multiple files

240
00:11:06.260 --> 00:11:08.780
and tar does not support compression.

241
00:11:08.780 --> 00:11:10.780
And this is a really great thing,

242
00:11:10.780 --> 00:11:13.330
because for example, if you're doing a tar pipe,

243
00:11:13.330 --> 00:11:14.730
you don't want compression.

244
00:11:14.730 --> 00:11:17.520
There's no reason to compress something that you're just

245
00:11:17.520 --> 00:11:19.610
writing into a finite size pip anyway,

246
00:11:19.610 --> 00:11:21.481
you're just burning CPU for no reason.

247
00:11:21.481 --> 00:11:24.490
And likewise, if you're gzipping,

248
00:11:24.490 --> 00:11:27.117
for example, the response from an http server,

249
00:11:27.117 --> 00:11:29.910
you don't wanna tar that up, you donna have the concept

250
00:11:29.910 --> 00:11:32.690
of multiple files, you just want compression.

251
00:11:32.690 --> 00:11:34.400
By separating these concepts you can use

252
00:11:34.400 --> 00:11:36.220
them independently and you can think

253
00:11:36.220 --> 00:11:38.620
about them more intelligently, like you can think

254
00:11:38.620 --> 00:11:41.390
about this intermediate representation.

255
00:11:41.390 --> 00:11:43.640
Anyway, I just wanted to digress there a little bit

256
00:11:43.640 --> 00:11:47.950
because tar and gzip are such a great example of Unix,

257
00:11:47.950 --> 00:11:50.069
and that's all I have to say about tar

258
00:11:50.069 --> 00:11:53.420
and the tar pipe and all that stuff.

259
00:11:53.420 --> 00:11:55.710
I think this is a wonderful example

260
00:11:55.710 --> 00:11:57.790
I wish more people would think about it

261
00:11:57.790 --> 00:12:01.020
and talk about it and I think it's a great learning tool.

262
00:12:01.020 --> 00:12:04.623
So that's all I have to say, and I will see you in a week.

