﻿WEBVTT

1
00:00:00.610 --> 00:00:02.800
<v Instructor>Our goal here is to</v>

2
00:00:02.800 --> 00:00:04.330
build a compiler from scratch.

3
00:00:04.330 --> 00:00:07.150
And we're going to have a very simple language.

4
00:00:07.150 --> 00:00:09.070
Here's a little test source file.

5
00:00:09.070 --> 00:00:11.760
We're gonna be able to do things like define a function f

6
00:00:11.760 --> 00:00:15.050
with no arguments that returns to constant one

7
00:00:15.050 --> 00:00:18.719
and this is sort of the simplest possible function

8
00:00:18.719 --> 00:00:20.600
in this language.

9
00:00:20.600 --> 00:00:24.030
To actually compile this into JavaScript,

10
00:00:24.030 --> 00:00:25.880
which is going to be our destination,

11
00:00:25.880 --> 00:00:30.090
we will have a lexer, also called a tokenizer,

12
00:00:30.090 --> 00:00:33.170
we will have a parser and we will have a code generator

13
00:00:33.170 --> 00:00:37.110
and these are three phases that almost every compiler has.

14
00:00:37.110 --> 00:00:39.180
There are many other phases that we won't deal with,

15
00:00:39.180 --> 00:00:40.100
but these are the three

16
00:00:40.100 --> 00:00:43.210
to get sort of basic thing up and running.

17
00:00:43.210 --> 00:00:45.650
So beginning with the lexer,

18
00:00:45.650 --> 00:00:47.170
actually let's call it a tokenizer,

19
00:00:47.170 --> 00:00:49.230
I think that's a little more straightforward of a word,

20
00:00:49.230 --> 00:00:50.160
it would probably help

21
00:00:50.160 --> 00:00:52.750
if I were actually editing a Ruby file

22
00:00:52.750 --> 00:00:54.390
since we are going to do this in Ruby.

23
00:00:54.390 --> 00:00:56.448
So let's have a tokenizer

24
00:00:56.448 --> 00:01:00.200
and actually I will need to usr/bin/env ruby

25
00:01:00.200 --> 00:01:04.580
and also chmod this u plus x, so it's executable.

26
00:01:04.580 --> 00:01:06.150
And now we will call

27
00:01:06.150 --> 00:01:09.410
we will instantiate our tokenizer

28
00:01:09.410 --> 00:01:12.430
reading in our little test.src file

29
00:01:12.430 --> 00:01:14.730
and then we will tokenize it.

30
00:01:14.730 --> 00:01:17.160
And that will give us some tokens

31
00:01:17.160 --> 00:01:19.198
we will print out for testing.

32
00:01:19.198 --> 00:01:23.110
I'm gonna map a key now just to write the current file

33
00:01:23.110 --> 00:01:24.730
and run it.

34
00:01:24.730 --> 00:01:26.173
And is that right?

35
00:01:27.490 --> 00:01:28.580
Colon bang.

36
00:01:28.580 --> 00:01:30.160
That's right, oops.

37
00:01:30.160 --> 00:01:31.760
I cannot type today.

38
00:01:31.760 --> 00:01:32.750
There we go.

39
00:01:32.750 --> 00:01:33.870
Okay.

40
00:01:33.870 --> 00:01:34.770
Wrong number of arguments

41
00:01:34.770 --> 00:01:36.840
because the initializer here takes none.

42
00:01:36.840 --> 00:01:39.230
So let's say that's some code

43
00:01:39.230 --> 00:01:41.920
and we'll store that off for later

44
00:01:41.920 --> 00:01:43.680
and then undefined method tokenize.

45
00:01:43.680 --> 00:01:46.240
So let's define that.

46
00:01:46.240 --> 00:01:49.570
And what we wanna do here is break this code

47
00:01:49.570 --> 00:01:52.610
into the smallest pieces that are meaningful.

48
00:01:52.610 --> 00:01:53.610
That's what the tokens are.

49
00:01:53.610 --> 00:01:55.892
So in our case, that's each line here

50
00:01:55.892 --> 00:01:58.820
as I'm editing is a single token.

51
00:01:58.820 --> 00:02:01.510
For example, def is multiple characters,

52
00:02:01.510 --> 00:02:04.110
but if we broke it down into d and f

53
00:02:04.110 --> 00:02:07.080
neither of those means anything in the language,

54
00:02:07.080 --> 00:02:08.900
it has to be def.

55
00:02:08.900 --> 00:02:10.560
And on the other side,

56
00:02:10.560 --> 00:02:13.546
if we combined open and close paren into a single token,

57
00:02:13.546 --> 00:02:14.770
that's no good

58
00:02:14.770 --> 00:02:16.940
because sometimes there are things inside of there.

59
00:02:16.940 --> 00:02:18.160
So they need to be separate.

60
00:02:18.160 --> 00:02:21.300
So they can be on either side of the argument list.

61
00:02:21.300 --> 00:02:23.500
So this is the smallest components of the language

62
00:02:23.500 --> 00:02:25.010
but no smaller.

63
00:02:25.010 --> 00:02:26.910
Going back over to our tokenizer,

64
00:02:26.910 --> 00:02:28.790
our goal here is to break this code up

65
00:02:28.790 --> 00:02:30.640
and we're going to use regular expressions

66
00:02:30.640 --> 00:02:33.740
one per token type in order to do that.

67
00:02:33.740 --> 00:02:35.910
So while we still have some code

68
00:02:35.910 --> 00:02:39.540
while it is actually until the code is empty,

69
00:02:39.540 --> 00:02:42.520
we are going to pull tokens off the front of it,

70
00:02:42.520 --> 00:02:44.740
of course, until in Ruby is like saying

71
00:02:44.740 --> 00:02:47.980
while not code dot empty, same thing.

72
00:02:47.980 --> 00:02:49.970
So while the code is not empty,

73
00:02:49.970 --> 00:02:53.350
we are going to go loop over a set of

74
00:02:53.350 --> 00:02:54.830
sort of token definitions.

75
00:02:54.830 --> 00:02:58.040
So while token, excuse me not while

76
00:02:58.040 --> 00:03:00.520
a token types dot each do

77
00:03:00.520 --> 00:03:03.620
and this is going to be an array.

78
00:03:03.620 --> 00:03:05.220
So let's define that.

79
00:03:05.220 --> 00:03:08.720
An array where each element defines a single token.

80
00:03:08.720 --> 00:03:10.910
So beginning with our def token

81
00:03:10.910 --> 00:03:14.210
that's just going to be word boundary def word boundary.

82
00:03:14.210 --> 00:03:16.430
Whenever this occurs in our language

83
00:03:16.430 --> 00:03:18.330
that's going to be the def keyword

84
00:03:18.330 --> 00:03:20.750
and the word boundary just means it must be proceeded

85
00:03:20.750 --> 00:03:23.850
by some kind of non identifier character.

86
00:03:23.850 --> 00:03:27.960
This is just part of a Perl compatible regex syntax.

87
00:03:27.960 --> 00:03:30.400
We will also have an end keyword

88
00:03:30.400 --> 00:03:32.410
and we will have identifiers,

89
00:03:32.410 --> 00:03:35.150
which are going to be A through Z

90
00:03:35.150 --> 00:03:38.420
I'm ignoring numbers and underscores just for simplicity,

91
00:03:38.420 --> 00:03:40.280
one or more of those characters.

92
00:03:40.280 --> 00:03:42.160
And we will have integers.

93
00:03:42.160 --> 00:03:44.230
So that's going to be zero through nine

94
00:03:44.230 --> 00:03:45.740
one or more of those.

95
00:03:45.740 --> 00:03:48.007
And we will have what else open parens

96
00:03:48.007 --> 00:03:50.260
that's just a literal open paren

97
00:03:50.260 --> 00:03:52.460
literal close paren for close paren.

98
00:03:52.460 --> 00:03:55.170
And I believe that is everything for now

99
00:03:55.170 --> 00:03:56.870
except I have made a mistake here.

100
00:03:56.870 --> 00:03:59.400
We should have a word boundary at the beginning and end

101
00:03:59.400 --> 00:04:02.980
of each of these, because an identifier

102
00:04:02.980 --> 00:04:05.580
we don't want to accidentally interpret

103
00:04:05.580 --> 00:04:08.740
the inside of an identifier as its own identifier.

104
00:04:08.740 --> 00:04:10.090
So if we have foobar

105
00:04:10.090 --> 00:04:12.560
we wanna make sure this is the whole thing foobar,

106
00:04:12.560 --> 00:04:15.510
not like foo and bar for some reason.

107
00:04:15.510 --> 00:04:18.330
So that's why we have these word boundaries here.

108
00:04:18.330 --> 00:04:20.610
Now, each of these also needs a token type

109
00:04:20.610 --> 00:04:22.690
so that we can track what the heck it actually is.

110
00:04:22.690 --> 00:04:25.860
So let's give those just a sort of clear names.

111
00:04:25.860 --> 00:04:28.650
These are Ruby symbols, which are just kind of like strings.

112
00:04:28.650 --> 00:04:32.330
You can think of them as strings for our purposes here.

113
00:04:32.330 --> 00:04:34.110
And we have an identifier here.

114
00:04:34.110 --> 00:04:35.900
We have integers.

115
00:04:35.900 --> 00:04:37.920
We will not be supporting other type of numbers

116
00:04:37.920 --> 00:04:39.910
other types of numbers.

117
00:04:39.910 --> 00:04:42.950
We have an open paren, we have a cparen

118
00:04:42.950 --> 00:04:44.800
and each of these is going to be a little

119
00:04:44.800 --> 00:04:48.210
two element array to just store these things.

120
00:04:48.210 --> 00:04:50.340
Now you might wonder why did I not use a hash

121
00:04:50.340 --> 00:04:52.170
because we do have these two values.

122
00:04:52.170 --> 00:04:55.320
And it seems pretty straightforward for this to map to this.

123
00:04:55.320 --> 00:04:56.153
And the reason is

124
00:04:56.153 --> 00:04:58.560
that these are in a very particular order

125
00:04:58.560 --> 00:05:02.270
specifically if we had identifiers first, for example

126
00:05:02.270 --> 00:05:05.040
then def would be interpreted as an identifier

127
00:05:05.040 --> 00:05:07.350
rather than as the def keyword

128
00:05:07.350 --> 00:05:09.720
because we're just going to go through this list in order.

129
00:05:09.720 --> 00:05:12.810
And so we have to maintain that order to have any kind

130
00:05:12.810 --> 00:05:15.340
of precedence between these token types.

131
00:05:15.340 --> 00:05:18.810
So now we have this token types array we can iterate over

132
00:05:18.810 --> 00:05:22.490
each of those elements has a type and a regex

133
00:05:22.490 --> 00:05:26.360
and we are going to compare that regex against the code.

134
00:05:26.360 --> 00:05:27.280
But in order to do that

135
00:05:27.280 --> 00:05:29.240
we need to do a couple tricky little bits.

136
00:05:29.240 --> 00:05:31.520
So actually let's first write the comparison.

137
00:05:31.520 --> 00:05:33.780
If the code matches this regex

138
00:05:33.780 --> 00:05:37.980
then we have found the token that is currently there.

139
00:05:37.980 --> 00:05:39.140
But if we just do this

140
00:05:39.140 --> 00:05:41.430
it's going to search the entire string.

141
00:05:41.430 --> 00:05:44.180
So any of these is going to match right now

142
00:05:44.180 --> 00:05:45.680
because there's no anchoring.

143
00:05:45.680 --> 00:05:49.230
So let's extend this regex by saying,

144
00:05:49.230 --> 00:05:51.580
first of all we have this regex

145
00:05:51.580 --> 00:05:55.230
you can use interpolation in Ruby for regexes,

146
00:05:55.230 --> 00:05:56.760
as well as for strings.

147
00:05:56.760 --> 00:05:59.491
And we're to throw a hat on the beginning of it,

148
00:05:59.491 --> 00:06:01.380
which is wrong.

149
00:06:01.380 --> 00:06:03.950
The hat actually means the beginning of the line

150
00:06:03.950 --> 00:06:04.980
not beginning of string.

151
00:06:04.980 --> 00:06:07.840
So this is a bug and it's one that's very easy to make

152
00:06:07.840 --> 00:06:10.390
if you're not super familiar with regexes.

153
00:06:10.390 --> 00:06:12.880
And specifically what would happen here is

154
00:06:12.880 --> 00:06:15.018
in this case where we do actually have whitespace

155
00:06:15.018 --> 00:06:19.210
in our language, we would match the beginning

156
00:06:19.210 --> 00:06:21.800
of any of these three strings

157
00:06:21.800 --> 00:06:23.840
because the hat means beginning of line

158
00:06:23.840 --> 00:06:24.760
not beginning of a string.

159
00:06:24.760 --> 00:06:28.630
So if we throw a backslash A there that's like super hat

160
00:06:28.630 --> 00:06:30.810
which means only the true beginning

161
00:06:30.810 --> 00:06:32.720
of the string being matched.

162
00:06:32.720 --> 00:06:33.770
So now that we have that,

163
00:06:33.770 --> 00:06:35.470
we are also going to just throw a group

164
00:06:35.470 --> 00:06:36.697
around our original regex

165
00:06:36.697 --> 00:06:40.020
and this will allow us to pull the value of that token out.

166
00:06:40.020 --> 00:06:41.873
And now we are ready to actually match this thing

167
00:06:41.873 --> 00:06:43.040
against the code.

168
00:06:43.040 --> 00:06:44.940
So if the beginning of the code

169
00:06:44.940 --> 00:06:47.010
now that we have that backslash A

170
00:06:47.010 --> 00:06:49.750
if the beginning of the code matches our regex

171
00:06:49.750 --> 00:06:52.980
then we are first of all, going to store a token value

172
00:06:52.980 --> 00:06:54.923
that is dollar sign one.

173
00:06:54.923 --> 00:06:56.170
This is just going to be

174
00:06:56.170 --> 00:06:57.760
whatever was in the first capture group

175
00:06:57.760 --> 00:06:58.940
which happens to be the group

176
00:06:58.940 --> 00:07:01.660
that we put around the entire original regex

177
00:07:01.660 --> 00:07:04.640
defining the the token value.

178
00:07:04.640 --> 00:07:07.100
Then we are going to truncate the code.

179
00:07:07.100 --> 00:07:09.460
So we're going to say from valued at length

180
00:07:09.460 --> 00:07:11.840
until the end of the string

181
00:07:11.840 --> 00:07:13.700
there are other ways we could do these two things

182
00:07:13.700 --> 00:07:15.059
but these are sort of straightforward

183
00:07:15.059 --> 00:07:17.610
if you don't know Ruby very well.

184
00:07:17.610 --> 00:07:19.920
And then we are going to construct a new token

185
00:07:19.920 --> 00:07:21.400
with the type,

186
00:07:21.400 --> 00:07:24.950
which we got from iterating over our type definition array

187
00:07:24.950 --> 00:07:27.670
and the new value of that token.

188
00:07:27.670 --> 00:07:28.930
We will return that early

189
00:07:28.930 --> 00:07:31.610
which is going to return from the whole tokenized method

190
00:07:31.610 --> 00:07:33.450
but we'll get to that in a moment.

191
00:07:33.450 --> 00:07:36.020
Let's define that class, which is just going to be a struct

192
00:07:36.020 --> 00:07:39.020
with two parameters type and value.

193
00:07:39.020 --> 00:07:41.280
And if we run this, it returns nil

194
00:07:41.280 --> 00:07:43.880
which is not what I expected at all.

195
00:07:43.880 --> 00:07:44.780
Why is that happening?

196
00:07:44.780 --> 00:07:46.730
It means nothing is matching.

197
00:07:46.730 --> 00:07:47.840
Interesting.

198
00:07:47.840 --> 00:07:49.810
Okay so, this is a good time

199
00:07:49.810 --> 00:07:54.090
to put some error handling in here and say

200
00:07:55.200 --> 00:07:59.630
couldn't match token on the code

201
00:07:59.630 --> 00:08:04.630
and we will inspect that to make it nicely computer readable

202
00:08:04.700 --> 00:08:06.510
couldn't match token on empty string.

203
00:08:06.510 --> 00:08:07.590
Okay actually that seems

204
00:08:07.590 --> 00:08:10.440
like a perfectly reasonable problem to have.

205
00:08:10.440 --> 00:08:11.810
What's happening here I think

206
00:08:11.810 --> 00:08:13.970
is that we have not written this file

207
00:08:13.970 --> 00:08:15.570
on the right-hand side test.src.

208
00:08:16.670 --> 00:08:17.503
There we go.

209
00:08:17.503 --> 00:08:21.960
So now we're just returning the first token, which is fine.

210
00:08:21.960 --> 00:08:23.630
So let's extract this out.

211
00:08:23.630 --> 00:08:26.470
This is really the code that pulls a single token out.

212
00:08:26.470 --> 00:08:31.470
So we will def tokenize one token

213
00:08:32.400 --> 00:08:33.710
and put that in there.

214
00:08:33.710 --> 00:08:36.510
We will also put our little helper error in there

215
00:08:36.510 --> 00:08:39.280
in case anything goes wrong again.

216
00:08:39.280 --> 00:08:43.250
And now we will call that several times

217
00:08:43.250 --> 00:08:45.070
until the code is empty.

218
00:08:45.070 --> 00:08:47.690
So now could not match token on whitespace

219
00:08:47.690 --> 00:08:49.200
this is expected.

220
00:08:49.200 --> 00:08:51.997
We never told the tokenize or what to do with whitespace

221
00:08:51.997 --> 00:08:54.458
and there are many different approaches to handling this.

222
00:08:54.458 --> 00:08:57.230
And the one we're going to use is

223
00:08:57.230 --> 00:08:58.820
maybe the bluntest weapon of them all,

224
00:08:58.820 --> 00:09:01.570
which is to just say every single time we go through here

225
00:09:01.570 --> 00:09:04.050
we're going to just truncate all whitespace

226
00:09:04.050 --> 00:09:05.310
off of this thing.

227
00:09:05.310 --> 00:09:07.490
And this is maybe a good time to mention

228
00:09:07.490 --> 00:09:09.500
none of this is efficient at all

229
00:09:09.500 --> 00:09:11.650
but I'm just trying to use the simplest methods

230
00:09:11.650 --> 00:09:13.730
to show what the tokenizer does

231
00:09:13.730 --> 00:09:15.220
rather than what would be a good way

232
00:09:15.220 --> 00:09:18.560
to actually implement a fast one in the real world.

233
00:09:18.560 --> 00:09:21.310
So continuing running this, we got nil

234
00:09:21.310 --> 00:09:24.070
that is because we are not maintaining an array of tokens.

235
00:09:24.070 --> 00:09:26.660
So let's start an empty one or turn it at the end

236
00:09:26.660 --> 00:09:30.640
and just shovel the new token in every single time.

237
00:09:30.640 --> 00:09:33.620
And that gives us what looks very much like what we want.

238
00:09:33.620 --> 00:09:37.470
So let's say we'll inspect each of these

239
00:09:37.470 --> 00:09:38.890
then join them with a new line

240
00:09:38.890 --> 00:09:41.540
just to make a nicer output for our debugging.

241
00:09:41.540 --> 00:09:44.790
And that didn't work because that's the wrong method.

242
00:09:44.790 --> 00:09:48.460
So we have a def token, an identifier who's value's f,

243
00:09:48.460 --> 00:09:50.510
we have an open paren, we have a closed paren.

244
00:09:50.510 --> 00:09:53.200
We have the integer one, and then we have an end.

245
00:09:53.200 --> 00:09:57.330
So we have successfully tokenized our code.

246
00:09:57.330 --> 00:10:00.770
We will have one more extension to this tokenizer later

247
00:10:00.770 --> 00:10:02.860
but the basic thing is done.

248
00:10:02.860 --> 00:10:05.040
And all we're going to have to do is add a new type

249
00:10:05.040 --> 00:10:07.390
later on when we get to commas.

250
00:10:07.390 --> 00:10:10.730
So moving on to the second phase of this compiler

251
00:10:10.730 --> 00:10:14.380
we have a tokenizer and we are now going to have a parser

252
00:10:14.380 --> 00:10:17.600
that takes the tokens and then parses them.

253
00:10:17.600 --> 00:10:19.230
And that's going to return a tree

254
00:10:19.230 --> 00:10:23.450
which we will once again, print out for debugging purposes.

255
00:10:23.450 --> 00:10:24.940
So let's define that class.

256
00:10:24.940 --> 00:10:27.690
It has an initializer that takes some tokens.

257
00:10:27.690 --> 00:10:29.910
It assigns them to an instance variable.

258
00:10:29.910 --> 00:10:31.970
It has a parse method.

259
00:10:31.970 --> 00:10:35.550
And the goal here is to transform the token stream,

260
00:10:35.550 --> 00:10:38.200
which is just an array of our token objects

261
00:10:38.200 --> 00:10:39.860
into a tree structure,

262
00:10:39.860 --> 00:10:41.760
representing the structure of the code

263
00:10:41.760 --> 00:10:43.270
that was input to the compiler.

264
00:10:43.270 --> 00:10:45.450
And it needs to match the human's understanding

265
00:10:45.450 --> 00:10:48.040
in their mind of how the different pieces

266
00:10:48.040 --> 00:10:50.750
how the different tokens are related.

267
00:10:50.750 --> 00:10:53.040
So let's look a little bit more deeply

268
00:10:53.040 --> 00:10:54.740
what that actually means.

269
00:10:54.740 --> 00:10:57.370
We are going to have a definition at the top level.

270
00:10:57.370 --> 00:10:58.203
That's what this is.

271
00:10:58.203 --> 00:10:59.460
We're not even going to worry about

272
00:10:59.460 --> 00:11:02.700
multi definition files just for simplicity.

273
00:11:02.700 --> 00:11:05.380
And the definition is made up of three parts.

274
00:11:05.380 --> 00:11:07.090
It has a name being defined

275
00:11:07.090 --> 00:11:09.350
which happens to be f in our case,

276
00:11:09.350 --> 00:11:11.454
it has some arguments some formal arguments

277
00:11:11.454 --> 00:11:14.210
being declared in our case that's nothing,

278
00:11:14.210 --> 00:11:16.240
but this might be for example, I don't know

279
00:11:16.240 --> 00:11:17.740
x and y and z.

280
00:11:17.740 --> 00:11:20.420
And I wanna note that these are just simple tokens.

281
00:11:20.420 --> 00:11:22.190
There's no structure here.

282
00:11:22.190 --> 00:11:24.400
We were not supporting anything more complex

283
00:11:24.400 --> 00:11:27.690
than argument names inside of a definition.

284
00:11:27.690 --> 00:11:29.220
We then have a body

285
00:11:29.220 --> 00:11:31.730
which is just going to be some kind of expression

286
00:11:31.730 --> 00:11:33.110
in our case it's the number one,

287
00:11:33.110 --> 00:11:35.370
but it could be something like a function call,

288
00:11:35.370 --> 00:11:37.070
or it could have multiple arguments

289
00:11:37.070 --> 00:11:39.460
or a nested function calls and so on,

290
00:11:39.460 --> 00:11:41.220
all of this will come in a few minutes

291
00:11:41.220 --> 00:11:45.640
for now we just have a body that is an integer literal

292
00:11:45.640 --> 00:11:46.530
let's call it.

293
00:11:46.530 --> 00:11:48.750
And the value of that is one.

294
00:11:48.750 --> 00:11:51.550
So we need to construct a structure like this

295
00:11:51.550 --> 00:11:54.010
from that token stream, which is what parsing is.

296
00:11:54.010 --> 00:11:57.876
So moving back to our actual parser.

297
00:11:57.876 --> 00:12:00.310
To do the parse we are going to parse a definition

298
00:12:00.310 --> 00:12:02.760
and I'm just doing this for a sort of clarity.

299
00:12:02.760 --> 00:12:04.640
Our parse def function is going to need

300
00:12:04.640 --> 00:12:06.440
to pull out those three parts I mentioned,

301
00:12:06.440 --> 00:12:08.760
which were the name, the arguments

302
00:12:08.760 --> 00:12:10.210
or actually the argument names

303
00:12:10.210 --> 00:12:13.630
since they don't have any other structure and the body.

304
00:12:13.630 --> 00:12:16.480
The name is going to be a simple token.

305
00:12:16.480 --> 00:12:19.990
So we're going to consume an identifier token

306
00:12:19.990 --> 00:12:22.440
and we haven't defined this yet we'll do that in a moment.

307
00:12:22.440 --> 00:12:24.270
Actually, let's do it right now.

308
00:12:24.270 --> 00:12:25.720
Let's define a consume function

309
00:12:25.720 --> 00:12:27.970
which has an expected token type.

310
00:12:27.970 --> 00:12:30.060
And we are going to pull out

311
00:12:30.060 --> 00:12:32.910
the token on the front of the list.

312
00:12:32.910 --> 00:12:36.137
So that's going to be tokens.unshift

313
00:12:36.137 --> 00:12:37.810
unshift is a weird Perl ism,

314
00:12:37.810 --> 00:12:42.300
or really an I forget which may be a shell ism.

315
00:12:42.300 --> 00:12:45.640
If we have an array one, two, three, and say x.unshift

316
00:12:45.640 --> 00:12:48.170
it's going to that doesn't make any sense.

317
00:12:48.170 --> 00:12:50.680
If we say x.shift, excuse me

318
00:12:50.680 --> 00:12:52.200
it's going to give us the first element

319
00:12:52.200 --> 00:12:54.860
and destructively remove it from the list

320
00:12:54.860 --> 00:12:56.330
which is exactly what we want

321
00:12:56.330 --> 00:12:58.620
because we're just pulling off the front element.

322
00:12:58.620 --> 00:13:03.620
So we shift one token off, and then we are going to assert

323
00:13:04.880 --> 00:13:06.440
that it is in fact, this type.

324
00:13:06.440 --> 00:13:10.690
So if the token type equals the expected type

325
00:13:10.690 --> 00:13:12.340
then we just return the token.

326
00:13:12.340 --> 00:13:15.530
Otherwise we're going to raise a new runtime error

327
00:13:15.530 --> 00:13:19.670
and we're gonna say expected token type expected type

328
00:13:19.670 --> 00:13:22.620
but got actually let me inspect that

329
00:13:22.620 --> 00:13:24.450
to make it easier to read

330
00:13:24.450 --> 00:13:28.350
but we got the actual token.type.inspect.

331
00:13:28.350 --> 00:13:31.070
And this is going to be triggered a number of times

332
00:13:31.070 --> 00:13:34.860
as we continue to develop the system.

333
00:13:34.860 --> 00:13:36.719
So moving on

334
00:13:36.719 --> 00:13:39.230
let's actually run this and see what happens.

335
00:13:39.230 --> 00:13:42.400
We expected an identifier, but we got a def.

336
00:13:42.400 --> 00:13:43.233
So this is good.

337
00:13:43.233 --> 00:13:44.066
The def's coming in.

338
00:13:44.066 --> 00:13:45.520
We expect an identifier.

339
00:13:45.520 --> 00:13:48.060
Our little consume function is correctly telling us

340
00:13:48.060 --> 00:13:48.960
that something is wrong

341
00:13:48.960 --> 00:13:51.840
and what's wrong here is that when we parse a def,

342
00:13:51.840 --> 00:13:53.690
the name is not the first thing

343
00:13:53.690 --> 00:13:56.179
the def is the first thing.

344
00:13:56.179 --> 00:13:59.010
So we need to consume a def keyword token.

345
00:13:59.010 --> 00:14:02.160
And now if we run this, we get undefined local variable.

346
00:14:02.160 --> 00:14:03.360
That's not interesting.

347
00:14:03.360 --> 00:14:05.936
Let's say arg_names is going to be the result

348
00:14:05.936 --> 00:14:09.810
of parsing some arg_names.

349
00:14:09.810 --> 00:14:11.760
And the reason I'm introducing this function here

350
00:14:11.760 --> 00:14:14.860
is that the arg_names are more than a single token.

351
00:14:14.860 --> 00:14:16.780
So I like to separate that out

352
00:14:16.780 --> 00:14:18.970
rather than have a whole bunch of complexity

353
00:14:18.970 --> 00:14:20.430
in this one function.

354
00:14:20.430 --> 00:14:22.040
So let's write that function.

355
00:14:22.040 --> 00:14:24.720
It's going to consume the open paren

356
00:14:24.720 --> 00:14:28.080
at the beginning of the arg so that's this guy right here.

357
00:14:28.080 --> 00:14:30.489
Then it's going to consume the close paren.

358
00:14:30.489 --> 00:14:32.240
They will eventually be stuffed between those two

359
00:14:32.240 --> 00:14:35.190
but for now, for this simple case, we can just do this.

360
00:14:35.190 --> 00:14:37.110
We don't need to worry about multiple args

361
00:14:37.110 --> 00:14:39.020
and that'll let us move forward.

362
00:14:39.020 --> 00:14:40.373
So running this again,

363
00:14:41.423 --> 00:14:42.690
we get undefined local variable and method body,

364
00:14:42.690 --> 00:14:45.420
which means we actually got to here so that's good.

365
00:14:45.420 --> 00:14:48.940
And the body of the function is going to be an expression

366
00:14:48.940 --> 00:14:52.010
which is almost always a very complex part

367
00:14:52.010 --> 00:14:54.430
or maybe the most complex part of a grammar

368
00:14:54.430 --> 00:14:56.300
because it's deeply recursive

369
00:14:56.300 --> 00:14:58.230
with all the different types of expressions.

370
00:14:58.230 --> 00:15:00.480
But we'll see that in a moment.

371
00:15:00.480 --> 00:15:03.240
For now after that, we get an end token

372
00:15:03.240 --> 00:15:06.000
let's just put this in here, 'cause otherwise I will forget.

373
00:15:06.000 --> 00:15:07.530
And then we will move down

374
00:15:07.530 --> 00:15:11.040
and figure out how to actually parse an expression.

375
00:15:11.040 --> 00:15:14.030
In our case, the only expression we have is the number one.

376
00:15:14.030 --> 00:15:16.610
So we don't need to handle all that complexity yet.

377
00:15:16.610 --> 00:15:18.580
We can just say parse integer.

378
00:15:18.580 --> 00:15:23.580
And that's a function that just consumes an integer token.

379
00:15:23.930 --> 00:15:27.190
And running this now we get the original token output

380
00:15:27.190 --> 00:15:28.910
it would probably help if I actually had

381
00:15:28.910 --> 00:15:31.500
something useful to print here actually I do.

382
00:15:31.500 --> 00:15:32.930
Oh, okay it is doing the right thing.

383
00:15:32.930 --> 00:15:34.880
So here are the original tokens.

384
00:15:34.880 --> 00:15:37.050
And this is actually the result of the parse

385
00:15:37.050 --> 00:15:38.200
which is not what we want.

386
00:15:38.200 --> 00:15:39.770
We don't want a single token out.

387
00:15:39.770 --> 00:15:43.900
So let's have our parts def function, construct a DefNode.

388
00:15:43.900 --> 00:15:46.000
This is the actual parse tree node

389
00:15:46.000 --> 00:15:47.780
which is going to have a name

390
00:15:47.780 --> 00:15:51.050
some arg names and a body which is an expression.

391
00:15:51.050 --> 00:15:55.140
And that will be a struct that has once again

392
00:15:55.140 --> 00:15:57.990
name, arg_names, and a body.

393
00:15:57.990 --> 00:16:00.750
And now if I run this, we get a DefNode

394
00:16:00.750 --> 00:16:04.200
whose name is an identifier token with the value f,

395
00:16:04.200 --> 00:16:08.010
it's arg names is a single token of type cparen

396
00:16:08.010 --> 00:16:09.510
so that doesn't look right

397
00:16:09.510 --> 00:16:12.060
and it's a body is an integer token.

398
00:16:12.060 --> 00:16:15.260
Now every single one of those attributes is not what I want.

399
00:16:15.260 --> 00:16:17.360
I don't want any tokens in the parse tree.

400
00:16:17.360 --> 00:16:20.710
So let's begin with name and just say dot value here

401
00:16:20.710 --> 00:16:22.670
so that we get the actual string f

402
00:16:22.670 --> 00:16:25.620
we don't need that whole token structure around.

403
00:16:25.620 --> 00:16:28.360
The arg names for now should just be empty array

404
00:16:28.360 --> 00:16:30.330
because we've hard coded it that way.

405
00:16:30.330 --> 00:16:31.163
So let's do that.

406
00:16:31.163 --> 00:16:32.550
That looks a little cleaner

407
00:16:32.550 --> 00:16:35.690
and then the body should just be, should not be a token

408
00:16:35.690 --> 00:16:38.540
but rather it should be a new type of parse tree node

409
00:16:38.540 --> 00:16:40.910
which is going to be an integer node.

410
00:16:40.910 --> 00:16:45.200
And will have the value from that token.

411
00:16:45.200 --> 00:16:47.910
In fact let me convert that to an integer to a Ruby integer,

412
00:16:47.910 --> 00:16:50.690
which will be our internal representation of integers,

413
00:16:50.690 --> 00:16:53.080
just because I might forget to do that later.

414
00:16:53.080 --> 00:16:55.750
So let's define that integer node that is going to be

415
00:16:55.750 --> 00:16:59.210
of course, a struct and it is going to have a value.

416
00:16:59.210 --> 00:17:00.440
And now if we run this

417
00:17:00.440 --> 00:17:02.890
we get something that looks a lot more like a parse tree.

418
00:17:02.890 --> 00:17:04.550
We have a DefNode the name is f,

419
00:17:04.550 --> 00:17:06.190
there are no argument names in the body

420
00:17:06.190 --> 00:17:08.450
is an integer node with a value of one.

421
00:17:08.450 --> 00:17:11.870
And you can see it's not a string anymore, which is good.

422
00:17:11.870 --> 00:17:13.830
So we're now parsing this.

423
00:17:13.830 --> 00:17:15.700
And just as a note, it doesn't matter

424
00:17:15.700 --> 00:17:19.440
whether the there's new lines or not same result either way.

425
00:17:19.440 --> 00:17:22.400
So let's move on and do a more complex example

426
00:17:22.400 --> 00:17:25.680
where the function f takes an argument like x.

427
00:17:25.680 --> 00:17:27.520
This is of course going to blow us up.

428
00:17:27.520 --> 00:17:30.210
We expected a cparen when we were looking at those args

429
00:17:30.210 --> 00:17:31.750
but we got an identifier.

430
00:17:31.750 --> 00:17:34.560
And specifically we were expecting this close paren

431
00:17:34.560 --> 00:17:37.490
right here and instead we got this x

432
00:17:37.490 --> 00:17:39.300
and that's happening right here.

433
00:17:39.300 --> 00:17:41.740
So between these two consumes

434
00:17:41.740 --> 00:17:45.270
we need to put the code for consuming arbitrary sequences

435
00:17:45.270 --> 00:17:48.030
of formal function arguments.

436
00:17:48.030 --> 00:17:51.138
So to do that, let's first of all say

437
00:17:51.138 --> 00:17:53.080
if we peek at the next token

438
00:17:53.080 --> 00:17:55.420
so this is gonna be another new function.

439
00:17:55.420 --> 00:17:58.510
If we peek at the next token and see an identifier

440
00:17:58.510 --> 00:18:01.910
then we know that we are looking at actual arguments

441
00:18:01.910 --> 00:18:04.940
as opposed to an empty argument list.

442
00:18:04.940 --> 00:18:06.550
And when we see those, we need to

443
00:18:06.550 --> 00:18:09.480
or when we see that identifier, we first need to consume it.

444
00:18:09.480 --> 00:18:10.960
And this is going to give us an arg.

445
00:18:10.960 --> 00:18:13.310
So let's say we have some arg names

446
00:18:13.310 --> 00:18:15.680
we shovel in the value of that identifier

447
00:18:15.680 --> 00:18:17.140
which is the actual name.

448
00:18:17.140 --> 00:18:20.300
And that means, of course, I have to start that array off.

449
00:18:20.300 --> 00:18:23.930
And then while we continue to see commas, so x comma

450
00:18:23.930 --> 00:18:26.900
y comma z and so on, we're going to continue doing this.

451
00:18:26.900 --> 00:18:29.619
So while we peek and see a comma

452
00:18:29.619 --> 00:18:34.470
we are going to shovel yet another identifier in.

453
00:18:34.470 --> 00:18:36.160
And we also before doing that

454
00:18:36.160 --> 00:18:38.110
are going to consume the comma.

455
00:18:38.110 --> 00:18:39.890
So looking through this,

456
00:18:39.890 --> 00:18:43.670
this is a lot like a regex that says

457
00:18:43.670 --> 00:18:47.060
identifier and then comma,

458
00:18:47.060 --> 00:18:51.510
comma followed by identifier plus if you get my meaning.

459
00:18:51.510 --> 00:18:54.990
We see a single identifier then we see an arbitrary number

460
00:18:54.990 --> 00:18:58.580
of comma identifier, comma identifier, and so on.

461
00:18:58.580 --> 00:19:00.940
So this is almost ready to run.

462
00:19:00.940 --> 00:19:03.130
We need to return arg names down here

463
00:19:03.130 --> 00:19:04.620
and now this is gonna blow up,

464
00:19:04.620 --> 00:19:05.950
but it should blow up in a good way,

465
00:19:05.950 --> 00:19:08.650
which is undefined method peek of course

466
00:19:08.650 --> 00:19:10.330
so let's define that.

467
00:19:10.330 --> 00:19:12.500
Peek is going to be very simple.

468
00:19:12.500 --> 00:19:14.870
We're just going to have an expected type.

469
00:19:14.870 --> 00:19:18.020
And we are going to return whether or not

470
00:19:18.020 --> 00:19:21.620
the zero with token equals, excuse me

471
00:19:21.620 --> 00:19:24.770
the zero with tokens type equals that expected type.

472
00:19:24.770 --> 00:19:28.100
And now if I run this, it does work on our example

473
00:19:28.100 --> 00:19:31.030
but that's because we only have a one arg here

474
00:19:31.030 --> 00:19:33.272
'cause I never saved this file once again.

475
00:19:33.272 --> 00:19:36.250
So this example works with only one arg,

476
00:19:36.250 --> 00:19:38.050
but if I throw a second arg in there,

477
00:19:38.050 --> 00:19:39.670
it's going to confuse it

478
00:19:39.670 --> 00:19:44.670
because the comma token type does not exist yet.

479
00:19:44.880 --> 00:19:46.360
So let's run this

480
00:19:46.360 --> 00:19:48.670
and couldn't match token on comma y

481
00:19:48.670 --> 00:19:50.990
so it doesn't know what to do with this character

482
00:19:50.990 --> 00:19:53.070
but fortunately for us we can fix that

483
00:19:53.070 --> 00:19:56.690
by just putting a new token type in here called comma.

484
00:19:56.690 --> 00:19:57.890
It's a literal comma.

485
00:19:57.890 --> 00:20:00.520
Well, it's a regex comma strictly speaking,

486
00:20:00.520 --> 00:20:02.700
but there's nothing fancy in there.

487
00:20:02.700 --> 00:20:04.970
And now if we run this, it does work

488
00:20:04.970 --> 00:20:08.160
and you can see the two arg names come out successfully.

489
00:20:08.160 --> 00:20:10.420
So this is what I meant about going back

490
00:20:10.420 --> 00:20:11.840
and just adding a token types

491
00:20:11.840 --> 00:20:13.240
without changing the core logic

492
00:20:13.240 --> 00:20:15.161
because none of this core tokenizing logic

493
00:20:15.161 --> 00:20:19.420
cares about the particular token types.

494
00:20:19.420 --> 00:20:22.910
Now moving on, we have arbitrary function arguments

495
00:20:22.910 --> 00:20:24.770
or at least definition arguments working.

496
00:20:24.770 --> 00:20:27.320
So let's move on to allowing function calls

497
00:20:27.320 --> 00:20:29.130
inside the body of a function.

498
00:20:29.130 --> 00:20:31.430
For example, we might wanna call g

499
00:20:31.430 --> 00:20:33.100
and that's gonna mess everything up.

500
00:20:33.100 --> 00:20:36.200
It expects token type integer, but got identifier.

501
00:20:36.200 --> 00:20:39.540
And that's because remember when we did parse expr

502
00:20:39.540 --> 00:20:41.587
we hard-coded it to parse an integer,

503
00:20:41.587 --> 00:20:43.890
and instead of an integer in the body

504
00:20:43.890 --> 00:20:46.640
we are getting this g open paren close paren thing

505
00:20:46.640 --> 00:20:48.670
and it doesn't know what to do with that.

506
00:20:48.670 --> 00:20:51.930
So going into parse expr, what we need to say here

507
00:20:51.930 --> 00:20:53.370
we need to have some kind of dispatch.

508
00:20:53.370 --> 00:20:57.320
We need to examine the current token and make a decision.

509
00:20:57.320 --> 00:21:00.100
So if we peek and find an integer

510
00:21:00.100 --> 00:21:02.130
then we are going to parse an integer.

511
00:21:02.130 --> 00:21:06.330
Otherwise we are going to parse a call, a function call.

512
00:21:06.330 --> 00:21:08.680
So let's define that new function

513
00:21:08.680 --> 00:21:10.850
and for now we can sort of hard-code it

514
00:21:10.850 --> 00:21:12.680
to just do exactly this structure

515
00:21:12.680 --> 00:21:14.900
just like we've been doing all along.

516
00:21:14.900 --> 00:21:17.500
So we're going to consume an identifier,

517
00:21:17.500 --> 00:21:21.570
consume an oparen, consume a cparen.

518
00:21:21.570 --> 00:21:24.990
And if I run this, it does successfully exit.

519
00:21:24.990 --> 00:21:27.900
Of course the body is wrong, but that's our next step.

520
00:21:27.900 --> 00:21:29.460
So moving on to that

521
00:21:29.460 --> 00:21:32.650
we are going to introduce a new type of node

522
00:21:32.650 --> 00:21:35.770
which is a call node that represents a function call.

523
00:21:35.770 --> 00:21:39.680
It has a name and some arguments.

524
00:21:39.680 --> 00:21:43.080
And in fact I will name this arg expr to make it very clear

525
00:21:43.080 --> 00:21:47.090
that the arguments to a function call are expressions.

526
00:21:47.090 --> 00:21:49.380
And this is a very big difference from the arguments

527
00:21:49.380 --> 00:21:51.870
to a definition, which are just names.

528
00:21:51.870 --> 00:21:54.360
One other note here I am including the name

529
00:21:54.360 --> 00:21:56.330
and the call node, which means we can't do things

530
00:21:56.330 --> 00:21:59.890
like call f which returns a function that we then call.

531
00:21:59.890 --> 00:22:01.830
Of course, we could do that,

532
00:22:01.830 --> 00:22:04.090
but in this particular demo we're not going to

533
00:22:04.090 --> 00:22:06.350
just in the name of simplicity.

534
00:22:06.350 --> 00:22:08.380
So going back up to parse call

535
00:22:08.380 --> 00:22:10.440
we are going to instantiate a call node.

536
00:22:10.440 --> 00:22:14.070
It is going to have a name and some arg exprs.

537
00:22:14.070 --> 00:22:16.390
The name is just going to be the result

538
00:22:16.390 --> 00:22:20.530
of consuming that identifier and the arg exprs, excuse me.

539
00:22:20.530 --> 00:22:23.550
Yes arg exprs for now are just going to be empty

540
00:22:23.550 --> 00:22:27.250
because we're hard coded to only handle functions

541
00:22:27.250 --> 00:22:29.100
with no arguments.

542
00:22:29.100 --> 00:22:33.430
So running this, we now get a body that is a call node.

543
00:22:33.430 --> 00:22:36.240
It has a name of g and it has no arguments

544
00:22:36.240 --> 00:22:38.610
which is exactly what we see over here.

545
00:22:38.610 --> 00:22:41.350
So the next step then the next sort of straightforward step

546
00:22:41.350 --> 00:22:42.740
is to add an argument.

547
00:22:42.740 --> 00:22:44.870
And as usual watch it blow up.

548
00:22:44.870 --> 00:22:47.780
I'd expected a cparen but it got an identifier

549
00:22:47.780 --> 00:22:50.730
and that's happening right here where we are not consuming

550
00:22:50.730 --> 00:22:52.570
whatever's between those two.

551
00:22:52.570 --> 00:22:54.410
So what we're going to do is actually

552
00:22:54.410 --> 00:22:58.530
I will delete that stuff and that's not right.

553
00:22:58.530 --> 00:23:02.750
Let's actually introduce a parse arg exprs function

554
00:23:02.750 --> 00:23:04.820
and that will be assigned to that variable

555
00:23:04.820 --> 00:23:07.260
to keep that complexity out of that function

556
00:23:07.260 --> 00:23:09.150
out of that parse call function.

557
00:23:09.150 --> 00:23:12.460
So this should do the same thing and it does.

558
00:23:12.460 --> 00:23:15.260
And now we're ready to actually consume those tokens

559
00:23:15.260 --> 00:23:17.160
that are inside the parens,

560
00:23:17.160 --> 00:23:18.750
but it turns out that that process

561
00:23:18.750 --> 00:23:20.970
is basically this process.

562
00:23:20.970 --> 00:23:22.270
It's not exactly this process

563
00:23:22.270 --> 00:23:25.987
but it's very similar to the handling of the argument names

564
00:23:25.987 --> 00:23:27.270
and the function definition.

565
00:23:27.270 --> 00:23:28.960
So I'm going to reuse this code.

566
00:23:28.960 --> 00:23:32.560
Well, not strict, oops, I just lost it from my buffer.

567
00:23:32.560 --> 00:23:34.590
I'm going to reuse it and then modify it

568
00:23:34.590 --> 00:23:36.080
because they are not exactly the same

569
00:23:36.080 --> 00:23:37.620
but they are quite similar.

570
00:23:37.620 --> 00:23:39.500
We need to begin with

571
00:23:39.500 --> 00:23:43.150
a list of arg exprs and we're going to return it at the end.

572
00:23:43.150 --> 00:23:45.543
We also do have an open paren and a closed paren

573
00:23:45.543 --> 00:23:47.000
just like before.

574
00:23:47.000 --> 00:23:48.470
But the major difference

575
00:23:48.470 --> 00:23:51.660
is that we don't have just identifiers in here anymore.

576
00:23:51.660 --> 00:23:53.430
We have entire expressions.

577
00:23:53.430 --> 00:23:55.250
So we're going to parse an expr.

578
00:23:55.250 --> 00:23:57.740
And then if we see a comma after that expr

579
00:23:57.740 --> 00:23:58.660
we're going to consume it.

580
00:23:58.660 --> 00:24:01.010
And then we're going to parse another expr

581
00:24:01.010 --> 00:24:04.320
and continue doing that until we stop seeing commas.

582
00:24:04.320 --> 00:24:06.270
We also need to change this guard up here

583
00:24:06.270 --> 00:24:08.920
because we won't necessarily see an identifier.

584
00:24:08.920 --> 00:24:12.310
So what we should say is if we peek and we do not see

585
00:24:12.310 --> 00:24:14.460
and open, not an open paren

586
00:24:14.460 --> 00:24:18.620
if we don't see a cparen next then that must mean

587
00:24:18.620 --> 00:24:20.470
that we had some args in there.

588
00:24:20.470 --> 00:24:23.730
And now if I run this guy expected token type oparen

589
00:24:23.730 --> 00:24:26.689
but got cparen that is not what I expected.

590
00:24:26.689 --> 00:24:28.830
What is going on here?

591
00:24:28.830 --> 00:24:30.620
We are peeking at the cparen

592
00:24:30.620 --> 00:24:34.830
and we are only going in here if it doesn't exist,

593
00:24:34.830 --> 00:24:36.660
we are parsing an expression.

594
00:24:36.660 --> 00:24:38.550
Oh, right of course,

595
00:24:38.550 --> 00:24:40.640
x is not an expression currently.

596
00:24:40.640 --> 00:24:42.700
So let's make that one for now.

597
00:24:42.700 --> 00:24:44.750
So that seems right.

598
00:24:44.750 --> 00:24:47.500
We have a function f its arguments are x and y,

599
00:24:47.500 --> 00:24:50.360
the body is a function call to the function g

600
00:24:50.360 --> 00:24:54.660
and the argument is a single integer with a value of one.

601
00:24:54.660 --> 00:24:57.430
Now let's go back to the thing that just tripped me up

602
00:24:57.430 --> 00:24:59.350
having an actual variable here.

603
00:24:59.350 --> 00:25:00.183
This is failing

604
00:25:00.183 --> 00:25:04.220
because we just have no way to have a variable reference

605
00:25:04.220 --> 00:25:05.170
that is an expression

606
00:25:05.170 --> 00:25:07.620
which seems like a pretty big omission.

607
00:25:07.620 --> 00:25:09.250
So let's go up here.

608
00:25:09.250 --> 00:25:12.410
And actually, I'm sorry, one more time

609
00:25:12.410 --> 00:25:15.040
before we do that let me make sure that multi argument stuff

610
00:25:15.040 --> 00:25:17.120
is working and let's see

611
00:25:17.120 --> 00:25:19.980
arguments are an integer node of one, integer node of two.

612
00:25:19.980 --> 00:25:21.830
Yes okay, that's working.

613
00:25:21.830 --> 00:25:25.750
So now let's handle variable references.

614
00:25:25.750 --> 00:25:28.980
This is failing on line 120, which is right here.

615
00:25:28.980 --> 00:25:30.210
That's not very interesting.

616
00:25:30.210 --> 00:25:33.270
We're failing on 101 from 94.

617
00:25:33.270 --> 00:25:35.910
So what's happening here is

618
00:25:35.910 --> 00:25:38.427
it wants to parse and expression in the,

619
00:25:38.427 --> 00:25:39.950
and this is the argument itself.

620
00:25:39.950 --> 00:25:41.680
It already parsed the g oparen parameter.

621
00:25:41.680 --> 00:25:42.950
It wants to figure out what this thing

622
00:25:42.950 --> 00:25:44.560
inside the parens is.

623
00:25:44.560 --> 00:25:47.020
And it knows it's not an integer, so that doesn't happen.

624
00:25:47.020 --> 00:25:49.910
So it just falls through and says, it's a function call

625
00:25:49.910 --> 00:25:51.953
but of course, x is not a function call

626
00:25:51.953 --> 00:25:54.170
it's a reference to a variable.

627
00:25:54.170 --> 00:25:56.723
So we need to differentiate between those two.

628
00:25:58.430 --> 00:26:00.100
And I'm going to do that by saying

629
00:26:00.100 --> 00:26:04.210
if we have an identifier, identifier

630
00:26:04.210 --> 00:26:07.940
and the next thing after the identifier is an open parens.

631
00:26:07.940 --> 00:26:10.100
So we're peeking one token ahead here

632
00:26:10.100 --> 00:26:12.390
and I'm just making up this new argument.

633
00:26:12.390 --> 00:26:14.007
If we see an identifier followed by an oparen

634
00:26:14.007 --> 00:26:15.450
and that it must be a call

635
00:26:15.450 --> 00:26:19.060
otherwise it is going to be a variable reference.

636
00:26:19.060 --> 00:26:20.940
So we need to introduce a couple new things here.

637
00:26:20.940 --> 00:26:24.000
First of all the peek function needs to be extended

638
00:26:24.000 --> 00:26:26.830
to take an offset which by default is zero.

639
00:26:26.830 --> 00:26:29.680
And we will just parse that into the fetch from the array.

640
00:26:29.680 --> 00:26:31.690
So if we parse in no offset

641
00:26:31.690 --> 00:26:33.590
it just gets the front of the array.

642
00:26:33.590 --> 00:26:36.090
But if we parse in a one it's going to look at the token

643
00:26:36.090 --> 00:26:37.990
after the next token.

644
00:26:37.990 --> 00:26:41.340
And the reason for that is we don't wanna be consuming that

645
00:26:41.340 --> 00:26:42.450
when we're making this check,

646
00:26:42.450 --> 00:26:45.060
because if it turns out that it's not a function call

647
00:26:45.060 --> 00:26:48.320
we really don't want to have pulled off this identifier.

648
00:26:48.320 --> 00:26:51.840
The other thing we need to add is parse var_ref.

649
00:26:51.840 --> 00:26:55.160
So let's put that after all this call related stuff

650
00:26:55.160 --> 00:26:57.500
we will have a parse of var_ref function

651
00:26:57.500 --> 00:27:01.140
and it is simply going to consume an identifier.

652
00:27:01.140 --> 00:27:02.700
And we will extend this in just a moment

653
00:27:02.700 --> 00:27:05.120
but for now that should be sufficient to run this

654
00:27:05.120 --> 00:27:09.100
except that I have totally messed up my syntax somewhere.

655
00:27:09.100 --> 00:27:12.290
I'm missing an end where, ah, right here.

656
00:27:12.290 --> 00:27:15.130
This should be an else if, maybe the most awkward token

657
00:27:15.130 --> 00:27:17.283
in the entire Ruby programming language.

658
00:27:18.140 --> 00:27:21.740
Running this now we get arguments of x and y,

659
00:27:21.740 --> 00:27:24.230
a body that is a call to function g

660
00:27:24.230 --> 00:27:26.760
where the arguments are an identifier x.

661
00:27:26.760 --> 00:27:28.430
And again, we have a token in the output.

662
00:27:28.430 --> 00:27:30.220
So let's get rid of that.

663
00:27:30.220 --> 00:27:32.510
That's happening in parts of our ref

664
00:27:32.510 --> 00:27:34.270
where it's just returning a token.

665
00:27:34.270 --> 00:27:37.010
So let's construct a new var_ref node.

666
00:27:37.010 --> 00:27:38.970
And it's just going to get the value

667
00:27:38.970 --> 00:27:41.030
of that identifier token.

668
00:27:41.030 --> 00:27:43.500
And defining the corresponding struct

669
00:27:43.500 --> 00:27:46.130
it only has a value just like the integer

670
00:27:46.130 --> 00:27:48.760
'cause there's really no other details there.

671
00:27:48.760 --> 00:27:50.130
So running this again

672
00:27:50.130 --> 00:27:52.580
we now have an argument expression that is a reference

673
00:27:52.580 --> 00:27:54.180
to the variable x,

674
00:27:54.180 --> 00:27:56.410
which of course is defined in the arguments

675
00:27:56.410 --> 00:27:59.400
although the compiler doesn't even know that.

676
00:27:59.400 --> 00:28:02.500
So going back, we are not yet handling.

677
00:28:02.500 --> 00:28:03.980
Are we handling multiple arguments here?

678
00:28:03.980 --> 00:28:04.813
Yes we are.

679
00:28:04.813 --> 00:28:07.750
So that should work by default

680
00:28:07.750 --> 00:28:10.070
just because we've implemented variable references

681
00:28:10.070 --> 00:28:11.360
and multiple arguments.

682
00:28:11.360 --> 00:28:13.870
So there it is.

683
00:28:13.870 --> 00:28:16.830
The arguments to the g call are a reference

684
00:28:16.830 --> 00:28:20.010
to the variable x and a reference to the variable y

685
00:28:20.010 --> 00:28:22.900
and now if I am not mistaken

686
00:28:22.900 --> 00:28:25.600
we have implemented everything I want to implement

687
00:28:25.600 --> 00:28:26.570
in terms of parsing.

688
00:28:26.570 --> 00:28:29.970
Yes we can mix a variable references and integers.

689
00:28:29.970 --> 00:28:31.310
Everything is good.

690
00:28:31.310 --> 00:28:34.970
So we are now going to move on to the third and final phase

691
00:28:34.970 --> 00:28:37.900
of this compiler, which is going to be a code generator.

692
00:28:37.900 --> 00:28:41.590
It will take the tree and actually, no it won't

693
00:28:41.590 --> 00:28:43.340
we'll instantiate it with no arguments.

694
00:28:43.340 --> 00:28:46.230
And then we will tell it to generate from our tree

695
00:28:46.230 --> 00:28:48.860
which will give us some generated code.

696
00:28:48.860 --> 00:28:51.020
So let's introduce that class

697
00:28:51.020 --> 00:28:52.580
and it's going to be generator.

698
00:28:52.580 --> 00:28:56.440
It is going to have a generate function that takes a node.

699
00:28:56.440 --> 00:28:57.780
Down here we're calling it tree,

700
00:28:57.780 --> 00:29:00.150
but it's just really the top tree

701
00:29:00.150 --> 00:29:02.580
the top node of the tree.

702
00:29:02.580 --> 00:29:05.070
And we are just going to not when

703
00:29:05.070 --> 00:29:06.990
we are going to case on that

704
00:29:06.990 --> 00:29:09.720
which is sort of like select in other languages

705
00:29:09.720 --> 00:29:11.780
although it's a little weirder than that in Ruby,

706
00:29:11.780 --> 00:29:13.140
because Ruby is Ruby.

707
00:29:13.140 --> 00:29:16.730
But anyway, the default for this is always going to be

708
00:29:16.730 --> 00:29:18.237
to raise a runtime error.

709
00:29:18.237 --> 00:29:21.450
And this whole, let us see if an early failure

710
00:29:21.450 --> 00:29:22.990
to guide us forward.

711
00:29:22.990 --> 00:29:26.880
So let's raise a runtime error, unexpected node type

712
00:29:26.880 --> 00:29:30.610
node dot class, just to help us debug.

713
00:29:30.610 --> 00:29:32.910
I happen to know because of Ruby syntax

714
00:29:32.910 --> 00:29:35.350
I have to put that there.

715
00:29:35.350 --> 00:29:38.240
But if we run this unexpected node type DefNode.

716
00:29:38.240 --> 00:29:40.850
So the very top of the tree is a def node.

717
00:29:40.850 --> 00:29:42.190
It's being parsed into generate

718
00:29:42.190 --> 00:29:44.630
and we haven't added a case to handle it.

719
00:29:44.630 --> 00:29:47.730
So it falls through and just blows up in the else.

720
00:29:47.730 --> 00:29:49.770
So handling that, we're going to say

721
00:29:49.770 --> 00:29:52.470
when we see a def node

722
00:29:52.470 --> 00:29:55.630
and notice this is one of the ways that Ruby's case is weird

723
00:29:55.630 --> 00:29:57.330
we cased on the value

724
00:29:57.330 --> 00:30:00.540
but we're whening on the class and that's totally okay

725
00:30:00.540 --> 00:30:01.960
that will match.

726
00:30:01.960 --> 00:30:03.320
So when we see a definition

727
00:30:03.320 --> 00:30:05.110
we're going to generate some JavaScript code

728
00:30:05.110 --> 00:30:06.560
which is our target language.

729
00:30:06.560 --> 00:30:09.840
And it's gonna look something like this.

730
00:30:09.840 --> 00:30:11.560
Now, of course I'm hard coding everything here

731
00:30:11.560 --> 00:30:12.930
just for demonstration.

732
00:30:12.930 --> 00:30:15.330
So let's put some placeholders in

733
00:30:15.330 --> 00:30:17.390
in the places that we want to actually replace things

734
00:30:17.390 --> 00:30:19.290
like the name is going to be variable.

735
00:30:19.290 --> 00:30:21.200
The arguments are going to be variable

736
00:30:21.200 --> 00:30:24.070
and the thing returned is going to be variable.

737
00:30:24.070 --> 00:30:25.640
And we will substitute those

738
00:30:25.640 --> 00:30:28.500
in using the Ruby percent operator.

739
00:30:28.500 --> 00:30:31.250
If you know Python, it's basically the same thing.

740
00:30:31.250 --> 00:30:33.180
And the three things we're going to put in there

741
00:30:33.180 --> 00:30:35.560
are going to come off of this DefNode structure

742
00:30:35.560 --> 00:30:37.490
which is what we're currently looking at.

743
00:30:37.490 --> 00:30:40.670
So we have a node.name,

744
00:30:40.670 --> 00:30:43.160
we have some node.arg_names

745
00:30:44.190 --> 00:30:46.000
which we are going to map over

746
00:30:46.000 --> 00:30:48.160
and do we need to convert those strings?

747
00:30:48.160 --> 00:30:49.770
No, they're already strings.

748
00:30:49.770 --> 00:30:52.910
We're just going to join those with a comma.

749
00:30:52.910 --> 00:30:55.540
And then we're going to have the actual body

750
00:30:55.540 --> 00:30:58.510
which is going to require us to recursively generate

751
00:30:58.510 --> 00:30:59.343
that body.

752
00:30:59.343 --> 00:31:00.860
So node.body.

753
00:31:00.860 --> 00:31:02.010
And the reason for this is that

754
00:31:02.010 --> 00:31:03.760
the body is an arbitrary expression.

755
00:31:03.760 --> 00:31:06.930
And we're gonna end up with case with whens here

756
00:31:06.930 --> 00:31:10.690
for every single node type, which means all four of these.

757
00:31:10.690 --> 00:31:15.020
So moving on, let's add a when for the actual

758
00:31:15.020 --> 00:31:16.520
let's see it blow up.

759
00:31:16.520 --> 00:31:18.990
It blows up unexpected node type call node.

760
00:31:18.990 --> 00:31:19.823
So that's good.

761
00:31:19.823 --> 00:31:20.948
Let's add a when for that.

762
00:31:20.948 --> 00:31:23.450
And I will just say, this is a call

763
00:31:23.450 --> 00:31:26.200
just to make it really clear what's going on here.

764
00:31:26.200 --> 00:31:28.860
And actually it would help if we printed

765
00:31:28.860 --> 00:31:31.410
the result of generating that stuff.

766
00:31:31.410 --> 00:31:33.300
And here we have function f(x,y)

767
00:31:33.300 --> 00:31:36.270
so this f, x and y came from our parse tree.

768
00:31:36.270 --> 00:31:37.940
We really round trip to that.

769
00:31:37.940 --> 00:31:41.110
And of course here's the actual expression.

770
00:31:41.110 --> 00:31:43.910
So let's put a more real thing in there.

771
00:31:43.910 --> 00:31:47.560
A function call in JavaScript is just f open paren

772
00:31:47.560 --> 00:31:50.040
and then the args, and then a closed paren.

773
00:31:50.040 --> 00:31:53.060
So let's put placeholders in there once again

774
00:31:53.060 --> 00:31:56.350
and we will fill those in using number one

775
00:31:56.350 --> 00:31:59.010
the name of the function being called,

776
00:31:59.010 --> 00:32:01.660
and then number two the actual expressions

777
00:32:01.660 --> 00:32:04.500
that make up the arguments.

778
00:32:04.500 --> 00:32:06.560
So starting with the first of those two

779
00:32:06.560 --> 00:32:10.310
it is going to be node.name that's this field.

780
00:32:10.310 --> 00:32:13.860
And then we are going to take the arg exprs.

781
00:32:13.860 --> 00:32:15.350
We are going to map over them

782
00:32:15.350 --> 00:32:18.370
and for each expr, we are going to recursively generate

783
00:32:18.370 --> 00:32:21.210
the corresponding JavaScript code for that.

784
00:32:21.210 --> 00:32:24.350
And now if I run this, it fails because of our ref node

785
00:32:24.350 --> 00:32:25.183
that's okay.

786
00:32:25.183 --> 00:32:27.690
Let's just add a placeholder here to make that shut up.

787
00:32:27.690 --> 00:32:29.620
This is a var.

788
00:32:29.620 --> 00:32:33.620
And now that didn't work integer node when IntegerNode

789
00:32:33.620 --> 00:32:38.210
this is an int, and now we get function f of x and y

790
00:32:38.210 --> 00:32:40.490
returning, calling the function g so that's good.

791
00:32:40.490 --> 00:32:41.870
We made one more step.

792
00:32:41.870 --> 00:32:46.140
And the arguments here are two variables and an integer.

793
00:32:46.140 --> 00:32:47.620
Moving on to the var_ref node.

794
00:32:47.620 --> 00:32:50.050
That is going to be this type here

795
00:32:50.050 --> 00:32:51.700
which has this value field.

796
00:32:51.700 --> 00:32:54.090
The value is the name of the variable being referenced.

797
00:32:54.090 --> 00:32:56.170
So we can just say node.value

798
00:32:56.170 --> 00:33:00.300
and the same thing for node for the integer node.

799
00:33:00.300 --> 00:33:01.570
And now if I run this

800
00:33:01.570 --> 00:33:04.710
this looks an awful lot like something that is correct.

801
00:33:04.710 --> 00:33:07.690
If you look at the input code up here, def f of x and y

802
00:33:07.690 --> 00:33:11.040
g of x and y one down here, this is not quite right

803
00:33:11.040 --> 00:33:14.450
but def f of x and y returned g of x, y one.

804
00:33:14.450 --> 00:33:19.450
These two things are the only problems and they are problems

805
00:33:19.920 --> 00:33:24.920
because we are not joining these expressions.

806
00:33:27.470 --> 00:33:30.540
So we recursively generate JavaScript code

807
00:33:30.540 --> 00:33:33.040
for each argument expression to the function call

808
00:33:33.040 --> 00:33:35.460
but then we were not joining them as a single string.

809
00:33:35.460 --> 00:33:38.900
And so we were seeing some Ruby stuff in the output.

810
00:33:38.900 --> 00:33:41.980
So running this again, we get function fxy

811
00:33:41.980 --> 00:33:44.940
return g of x y one, which is exactly what we put in.

812
00:33:44.940 --> 00:33:48.850
And so we have now successfully round tripped that function.

813
00:33:48.850 --> 00:33:51.570
Now, before we move on and do an example

814
00:33:51.570 --> 00:33:53.727
where we actually run the compiled JavaScript code

815
00:33:53.727 --> 00:33:57.340
I wanna note a very important symmetry

816
00:33:57.340 --> 00:34:01.010
between the parser and the code generator.

817
00:34:01.010 --> 00:34:05.210
The parser is a fairly complex recursive set of functions.

818
00:34:05.210 --> 00:34:09.040
For example, parse expr calls parse call

819
00:34:09.040 --> 00:34:13.740
and parse call calls parse arg exprs to parse the arguments

820
00:34:13.740 --> 00:34:14.930
to the function call

821
00:34:14.930 --> 00:34:17.770
but then that turns around and calls parse expr again.

822
00:34:17.770 --> 00:34:20.750
So we have this sort of sequence which is parse expr

823
00:34:20.750 --> 00:34:25.750
parse call parse parse arg exprs, parse expr parse call

824
00:34:26.400 --> 00:34:29.110
and so on at an arbitrary depth.

825
00:34:29.110 --> 00:34:32.930
So we can have as many nested function calls as we want.

826
00:34:32.930 --> 00:34:34.960
And we have a corresponding structure

827
00:34:34.960 --> 00:34:37.920
inside of the generator class right here

828
00:34:37.920 --> 00:34:39.630
where it's simpler this time

829
00:34:39.630 --> 00:34:41.660
but the generate function keeps calling itself

830
00:34:41.660 --> 00:34:42.840
over and over and over again.

831
00:34:42.840 --> 00:34:46.330
So when we have a Def we call generate on the body,

832
00:34:46.330 --> 00:34:47.500
and if that's a call

833
00:34:47.500 --> 00:34:50.120
then we're going to call generate on each expression

834
00:34:50.120 --> 00:34:51.490
that is an argument.

835
00:34:51.490 --> 00:34:53.440
So there's a matching recursion

836
00:34:53.440 --> 00:34:55.170
in the parser and the code generator.

837
00:34:55.170 --> 00:34:56.330
And that's because the parser

838
00:34:56.330 --> 00:34:58.519
is generating a recursive tree structure

839
00:34:58.519 --> 00:35:02.640
that the generator is then crawling over recursively.

840
00:35:02.640 --> 00:35:04.610
So now let's move on to actually run

841
00:35:04.610 --> 00:35:06.140
our generator JavaScript code.

842
00:35:06.140 --> 00:35:08.520
First of all, I'll get rid of all the debug printing.

843
00:35:08.520 --> 00:35:10.510
So we are only printing the JavaScript.

844
00:35:10.510 --> 00:35:13.080
I'm going to add a little runtime here

845
00:35:13.080 --> 00:35:16.050
which is just going to be a function out of x y

846
00:35:16.050 --> 00:35:18.530
that returns x plus y.

847
00:35:18.530 --> 00:35:21.080
And the reason I'm doing this is because

848
00:35:21.080 --> 00:35:22.890
currently as our language stands

849
00:35:22.890 --> 00:35:25.560
it has no way to touch anything outside of itself.

850
00:35:25.560 --> 00:35:27.830
It can't interact with the outside world

851
00:35:27.830 --> 00:35:29.750
if this were a real programming language

852
00:35:29.750 --> 00:35:31.670
with infix operators and so on

853
00:35:31.670 --> 00:35:34.130
we would be able to say one plus one or x plus y

854
00:35:34.130 --> 00:35:36.310
because this stuff would be built in.

855
00:35:36.310 --> 00:35:37.840
But we don't even operators.

856
00:35:37.840 --> 00:35:42.150
So instead, I'm just building in the function out of x, y

857
00:35:42.150 --> 00:35:43.990
and I'll do that by joining the runtime

858
00:35:43.990 --> 00:35:45.840
and the generated code together

859
00:35:45.840 --> 00:35:49.220
so that we have a single blob of JavaScript code

860
00:35:49.220 --> 00:35:50.211
that includes our little helper function

861
00:35:50.211 --> 00:35:53.150
and the compiled output.

862
00:35:53.150 --> 00:35:54.350
I'm gonna add one other thing,

863
00:35:54.350 --> 00:35:56.400
which is a little sort of test case,

864
00:35:56.400 --> 00:35:58.900
which is just going to do console.log.

865
00:35:58.900 --> 00:36:01.330
And it's going to call our function f

866
00:36:01.330 --> 00:36:05.210
on the arguments one and two, and that's just to make

867
00:36:05.210 --> 00:36:07.580
so we can actually see some output and see what it's doing.

868
00:36:07.580 --> 00:36:11.430
So that just gets appended to the generated JavaScript code.

869
00:36:11.430 --> 00:36:14.500
And finally, I'm going to change this to do add of x y.

870
00:36:14.500 --> 00:36:17.320
So right now, it's just turning the arguments around.

871
00:36:17.320 --> 00:36:19.510
And of course we would expect to get three

872
00:36:19.510 --> 00:36:22.090
because we're parsing in one and two

873
00:36:22.090 --> 00:36:24.070
in our little test case here.

874
00:36:24.070 --> 00:36:25.830
So there's the JavaScript code.

875
00:36:25.830 --> 00:36:29.150
And now if I run this and pipe it into node

876
00:36:29.150 --> 00:36:31.484
we get undefined great job me.

877
00:36:31.484 --> 00:36:35.010
That is because I didn't add a return inside this add

878
00:36:35.010 --> 00:36:38.390
because my brain is in Ruby mode, not JavaScript mode.

879
00:36:38.390 --> 00:36:40.860
So running once again, that looks right.

880
00:36:40.860 --> 00:36:43.970
And then if I pipe it into node, there's the three.

881
00:36:43.970 --> 00:36:47.010
So we have successfully parsed our input code

882
00:36:47.010 --> 00:36:50.810
compiled it down into JavaScript and run it inside of node.

883
00:36:50.810 --> 00:36:53.000
Now, I want to add a couple more complications

884
00:36:53.000 --> 00:36:55.180
just to make it clear that this is really working.

885
00:36:55.180 --> 00:36:57.080
So let's add the literal 10

886
00:36:57.080 --> 00:36:59.200
the result of adding those x and y.

887
00:36:59.200 --> 00:37:01.630
And if we run that we get 13

888
00:37:01.630 --> 00:37:04.500
and just one more time to make sure it works arbitrarily.

889
00:37:04.500 --> 00:37:07.400
So now we have three nested levels of function calls

890
00:37:07.400 --> 00:37:09.170
and actually that was wrong.

891
00:37:09.170 --> 00:37:10.610
We want an add around this.

892
00:37:10.610 --> 00:37:15.480
So it's add 10 two add 10 two add x and y.

893
00:37:15.480 --> 00:37:20.010
And now if I run this, I expect 113 and there it is.

894
00:37:20.010 --> 00:37:22.310
So we are successfully compiling

895
00:37:22.310 --> 00:37:24.210
arbitrarily nested function calls

896
00:37:24.210 --> 00:37:26.153
and our single function definition

897
00:37:26.153 --> 00:37:29.200
into JavaScript and executing it.

898
00:37:29.200 --> 00:37:31.050
Now, of course, in a real world compiler

899
00:37:31.050 --> 00:37:32.940
there's a lot of extra complexity here.

900
00:37:32.940 --> 00:37:35.290
Like there's going to be an optimization stage.

901
00:37:35.290 --> 00:37:37.250
There may be an intermediate representation

902
00:37:37.250 --> 00:37:40.020
where there's a sort of multiple compilers

903
00:37:40.020 --> 00:37:41.700
in a sense involved.

904
00:37:41.700 --> 00:37:44.290
And of course the list of tokens is going to be a lot longer

905
00:37:44.290 --> 00:37:46.270
than this in any real-world programming language.

906
00:37:46.270 --> 00:37:48.610
And the parser is going to be a lot bigger

907
00:37:48.610 --> 00:37:50.730
than this one that we've written.

908
00:37:50.730 --> 00:37:53.810
But the basic ideas for tokenizing, parsing

909
00:37:53.810 --> 00:37:56.880
and code generation are exactly what you see here.

910
00:37:56.880 --> 00:38:00.770
There's not a whole lot of fundamental complexity

911
00:38:00.770 --> 00:38:01.870
beyond what we've done here.

912
00:38:01.870 --> 00:38:04.990
Just more token types, more types of parse nodes,

913
00:38:04.990 --> 00:38:09.760
a bigger parser, and a more complex target language.

914
00:38:09.760 --> 00:38:10.593
So that is it.

915
00:38:10.593 --> 00:38:14.723
We have a working compiler and I will see you next time.

