3. An Outline of What Worgle does
This aims to show a broad overview of how Orgle (and Worgle) will work essentially. Orgle is a bootstrap program written in C, used to generate C code for Worgle (this program here). At the highest level, the two programs share the same basic program structure.
3.1. Initialization
3.1.1. Initialize worgle data
Worgle is initialized before stuff is loaded.
worgle_d worg;
worgle_init(&worg);
3.1.2. Get and set filename
The file name is currently aqcuired from the command line, so the program must check and make sure that there are the right number of arguments. If there isn't, return an error.
char *filename;
filename = NULL;
if(argc < 2) {
fprintf(stderr, "Usage: %s filename.org\n", argv[0]);
return 1;
}
<<parse_cli_args>>
<<check_filename>>
Check the filename. If the filename is not set inside by the command line, return an error,
if(filename == NULL) {
fprintf(stderr, "No filename specified\n");
return 1;
}
3.1.3. Initialize return codes
The main return code determines the overall state of the program.
int rc;
By default, it is set to be okay, which is 0 on POSIX systems.
rc = 0;
3.2. Load file into memory
The first thing the program will do is load the file.
While most parsers tend to parse things on a line by line basis via a file stream, this parser will load the entire file into memory. This is done due to the textual nature of the program. It is much easier to simply allocate everything in one big block and reference chunks, then to allocate smaller chunks as you go.
3.2.1. Loadfile function
for(i = 0; i < worg.nbuffers; i++) {
rc = loadfile(&worg, i);
if(!rc) goto cleanup;
}
A file is loaded into a textbuffer via the function
loadfile
. In the worg startup sequence, the buffer list
has been preallocated with the filename after parsing the
command line arguments (see <file
.
On success, the function will return TRUE (1). On failure, FALSE (0).
static int loadfile(worgle_d *worg, int file);
static int loadfile(worgle_d *worg, int file)
{
<<loadfile_localvars>>
<<loadfile>>
return 1;
}
3.2.2. Open file
File is loaded into a local file handle fp
.
FILE *fp;
char *filename;
worgle_textbuf *txt;
txt = &worg->buffers[file];
filename = txt->filename.str;
fp = fopen(filename, "r");
if(fp == NULL) {
fprintf(stderr, "Could not find file %s\n", filename);
return 1;
}
3.2.3. Get file size
The size is acquired by going to the end of the file and getting the current file position.
size_t size;
fseek(fp, 0, SEEK_END);
size = ftell(fp);
3.2.4. Allocate memory, read, and close
Memory is allocated in a local buffer variable via calloc
.
The buffer is then stored inside of the worg struct.
char *buf;
buf = calloc(1, size);
worgle_textbuf_init(&worg->buffers[file], buf, size);
The file is rewound back to the beginning and then read into the buffer. The file is no longer needed at this point, so it is closed.
fseek(fp, 0, SEEK_SET);
fread(buf, size, 1, fp);
fclose(fp);
3.3. Parsing
3.3.1. Top Level Parsing Function
The second phase of the program is the parsing stage.
The parsing stage will parse files line-by-line. The program will find a line by skimming through the block up to a line break character, then pass that off to be parsed. Line by line, the parser will read the program and produce a structure of the tangled code in memory.
Parsing is done via the function parse_file
.
int i;
for (i = 0; i < worg.nbuffers; i++) {
rc = parse_file(&worg, i);
if (rc) goto cleanup;
}
<<flush_last_block>>
The parse_file
function will parse a file whose filename is located
in the index position denoted by file
.
int parse_file(worgle_d *worg, int file);
int parse_file(worgle_d *worg, int file)
{
char *buf;
size_t size;
worgle_textbuf *curbuf;
<<parser_local_variables>>
curbuf = &worg->buffers[file];
buf = curbuf->buf;
size = curbuf->size;
worg->curbuf = curbuf;
#ifndef WORGLITE
worg->curorg = &worg->orgs[file];
if (file > 0) {
worg->curorg->prev = &worg->orgs[file - 1];
} else {
worg->curorg->prev = NULL;
}
#endif
<<parser_initialization>>
while (1) {
<<getline>>
if(mode == MODE_ORG) {
<<parse_mode_org>>
} else if(mode == MODE_CODE) {
<<parse_mode_code>>
} else if(mode == MODE_BEGINCODE) {
<<parse_mode_begincode>>
}
}
return rc;
}
3.3.2. Parsing Modes
The parser is implemented as a relatively simple state
machine, whose behavior shifts between parsing org-mode
markup (MODE_ORG
), and code blocks (MODE_BEGINCODE
and
MODE_CODE
). The state machine makes a distinction between
the start of a new code block (MODE_BEGINCODE
), which
provides information like the name of the code block and
optionally the name of the file to tangle to, and the code
block itself (MODE_CODE
).
enum {
<<parse_modes>>
};
3.3.2.1. MODE_ORG
MODE_ORG,
3.3.2.1.1. Org Parse Top
When the parser state is set to be in MODE_ORG
, this is
what happens.
#ifndef WORGLITE
if (generate_db) {
<<parse_headers>>
}
#endif
<<find_next_named_block>>
#ifndef WORGLITE
if (generate_db) {
<<parse_content>>
}
#endif
3.3.2.1.2. Finding the next named block
When the parser is in MODE_ORG
, it mostly searching for
the start of the next named block. When it finds a match,
it extracts the name, gets ready to begin the a new block,
and changes the mode MODE_BEGINCODE
.
A common hard-to-find error happens when a colon is
forgotten in the NAME
tag. A special check will occur
here to make sure that colon isn't forgotten.
if(read >= 7) {
if(!strncmp(line, "#+NAME", 6)) {
#ifndef WORGLITE
if (generate_db) {
<<append_content_before_code>>
}
#endif
if(line[6] != ':') {
fprintf(stderr,
"line %lu: expected ':'\n",
worg->linum);
rc = 1;
break;
}
mode = MODE_BEGINCODE;
parse_name(line, read, &str);
worgle_begin_block(worg, &str);
#ifndef WORGLITE
continue;
#endif
}
}
3.3.2.1.3. Extracting information from #+NAME
Name extraction of the current line is done with a function called parse_name
.
static int parse_name(char *line, size_t len, worgle_string *str);
static int parse_name(char *line, size_t len, worgle_string *str)
{
size_t n;
size_t pos;
int mode;
line+=7;
len-=7;
/* *namelen = 0; */
str->size = 0;
str->str = NULL;
if(len <= 0) return 1;
pos = 0;
mode = 0;
for(n = 0; n < len; n++) {
if(mode == 2) break;
switch(mode) {
case 0:
if(line[n] == ' ') {
} else {
str->str = &line[n];
str->size++;
pos++;
mode = 1;
}
break;
case 1:
if(line[n] == 0xa) {
mode = 2;
break;
}
pos++;
str->size++;
break;
default:
break;
}
}
/* *namelen = pos; */
return 1;
}
3.3.2.1.4. Beginning a new block
A new code block is started with the function
worgle_begin_block
.
void worgle_begin_block(worgle_d *worg, worgle_string *name);
When a new block begins, the current block in Worgle is set to be a value retrieved from the block dictionary.
void worgle_begin_block(worgle_d *worg, worgle_string *name)
{
worg->curblock = worgle_hashmap_get(&worg->dict, name);
<<worgle_block_set_id>>
<<increment_nblocks>>
#ifndef WORGLITE
<<append_code_reference>>
#endif
}
3.3.2.1.5. DONE Parsing Header Information
CLOSED: [2019-09-12 Thu 07:10] A valid header in org mode starts with one or more as one or more asterisks, followed by a space. Anything after this space is considered to be the name of the header. The number of asterisks indicates the header level.
If indeed the line is a header, both the header name and level are appended to the currently opened org file.
A quick sanity check is done before the header is parsed
via parse_header
.
if (read >= 2) {
if (parse_header(worg, line, read)) {
continue;
}
}
The actual parsing logic happens in the function
parse_header
.
#ifndef WORGLITE
static int parse_header(worgle_d *worg,
char *line,
size_t len);
#endif
#ifndef WORGLITE
static int parse_header(worgle_d *worg,
char *line,
size_t len)
{
int mode;
int rc;
size_t s;
char *header;
worgle_string str;
int lvl;
mode = 0;
if(line[0] != '*') return 0;
rc = 0;
worgle_string_init(&str);
lvl = 1;
for (s = 1; s < len; s++) {
if (mode == 2) break;
switch (mode) {
case 0:
if (line[s] == '*') {
lvl++;
} else if (line[s] == ' '){
mode = 1;
} else {
mode = 2;
rc = 0;
}
break;
case 1:
rc = 1;
mode = 2;
header = &line[s];
str.str = header;
str.size = len - s;
str.size -= line[len - 1] == '\n';
<<append_content_before_header>>
worgle_orgfile_append_header(worg,
&str,
lvl);
<<set_content_flag_after_header>>
break;
}
}
return rc;
}
#endif
3.3.2.1.6. DONE Content Parsing
CLOSED: [2019-12-10 Tue 20:26]
In between headers and codeblocks are things called
content
. It is assumed to be text like this, but it can
also contain comments and commands that worgle doesn't
yet understand.
Content parsing happens in MODE_ORG
, and is the fallback
option when no other pattern is picked up. When it reaches
that point, the parser will take the current line and append
it to the context block.
Appending content to the content block is a matter of extending the size of the block (text is mapped to a contiguous memory block).
#ifndef WORGLITE
<<setup_new_content_block>>
worg->segblock.size += read;
#endif
When a content block is started, the block variable must be reset. The circumstances for a starting a content block happen: whenever a new header is found, or whenever content is found immediately after a code block ends.
The solution to this is to have a flag for this that is set anytime a new content block has the poential to be started. The next time the parser arrives as a line that is considered to be content, it will check this flag, and utilize the block.
if (worg->new_content) {
worg->new_content = 0;
worgle_string_reset(&worg->segblock);
worg->segblock.str = line;
}
The new_content
flag at startup. It is also set when a
code bock ends, or after a header.
worg->new_content = 1;
worg->new_content = 1;
A content block is considered finished when a code block or new header section is reached, or if a document has ended(?)
No WORGLITE
macro magic or generate_db
conditionals are
needed to append a content block before a header. At this
level, it is already assumed.
worgle_orgfile_append_content(worg, &worg->segblock);
worgle_string_reset(&worg->segblock);
A content block should be appended before a code block starts, which is when a code reference is appended.
worgle_orgfile_append_content(worg, &worg->segblock);
worgle_string_reset(&worg->segblock);
Any remaining blocks at the end of all parsing will be appended to. Not sure where this logic will go yet.
At the end of all parsing, the last block must be flushed out.
#ifndef WORGLITE
if (generate_db) {
worgle_orgfile_append_content(&worg, &worg.segblock);
}
#endif
3.3.2.1.7. DONE Code Reference
CLOSED: [2019-12-10 Tue 20:26]
Anytime a new code block begins, a reference to this new
block is stored in the data representation of the file. This
should happen when a new block begins. Probably in
worgle_begin_block
.
worgle_orgfile_append_reference(worg, worg->curblock);
3.3.2.2. MODE_BEGINCODE
MODE_BEGINCODE,
A parser set to mode MODE_BEGINCODE
is only interested in
finding the beginning block. If it doesn't, it returns a
syntax error. If it does, it goes on to extract a potential
new filename to tangle, which then gets appended to the
Worgle file list.
if (read >= 11) {
if(!strncmp (line, "#+BEGIN_SRC",11)) {
<<begin_the_code>>
if (parse_begin(line, read, &str) == 2) {
worgle_append_file(worg, &str);
}
continue;
} else {
fwrite(line, read, 1, stderr);
fprintf(stderr,
"line %lu: Expected #+BEGIN_SRC\n",
worg->linum);
rc = 1;
break;
}
}
fprintf(stderr,
"line %lu: Expected #+BEGIN_SRC\n",
worg->linum);
rc = 1;
3.3.2.2.1. Extracting information from #+BEGIN_SRC
The begin source flag in org-mode can have a number of options, but the only one we really care about for this tangler is the ":tangle" option.
static int parse_begin(char *line, size_t len, worgle_string *str);
The state machine begins right after the BEGIN_SRC declaration, which is why the string is offset by 11.
The state machine for this parser is linear, and has 5 modes:
- mode 0: Skip whitespace after BEGIN_SRC - mode 1: Find ":tangle" pattern - mode 2: Ignore imediate whitespace after "tangle", and begin getting filename - mode 3: Get filename size by reading up to the next space or line break - mode 4: Don't do anything, wait for line to end.
static int parse_begin(char *line, size_t len, worgle_string *str)
{
size_t n;
int mode;
int rc;
line += 11;
len -= 11;
if(len <= 0) return 0;
mode = 0;
n = 0;
rc = 1;
str->str = NULL;
str->size = 0;
while(n < len) {
switch(mode) {
case 0: /* initial spaces after BEGIN_SRC */
if(line[n] == ' ') {
n++;
} else {
mode = 1;
}
break;
case 1: /* look for :tangle */
if(line[n] == ' ') {
mode = 0;
n++;
} else {
if(line[n] == ':') {
if(!strncmp(line + n + 1, "tangle", 6)) {
n+=7;
mode = 2;
rc = 2;
}
}
n++;
}
break;
case 2: /* save file name, spaces after tangle */
if(line[n] != ' ') {
str->str = &line[n];
str->size++;
mode = 3;
}
n++;
break;
case 3: /* read up to next space or line break */
if(line[n] == ' ' || line[n] == '\n') {
mode = 4;
} else {
str->size++;
}
n++;
break;
case 4: /* countdown til end */
n++;
break;
}
}
return rc;
}
3.3.2.2.2. Setting up code for a new read
When a new codeblock has indeed been found, the mode is switched to MODE_CODE
,
and the block_started
boolean flag gets set. In addition, the string used
to keep track of the new block is reset.
mode = MODE_CODE;
worg->block_started = 1;
worgle_string_reset(&worg->block);
3.3.2.2.3. Appending a new file
If a new file is found, the filename gets appended to the file list
via the function worgle_append_file
.
void worgle_append_file(worgle_d *worg, worgle_string *filename);
void worgle_append_file(worgle_d *worg, worgle_string *filename)
{
worgle_file *f;
f = worgle_filelist_append(&worg->flist, filename, worg->curblock);
<<worgle_file_set_id>>
}
3.3.2.3. MODE_CODE
MODE_CODE
In MODE_CODE
, actual code is parsed inside of the code
block. The parser will keep reading chunks of code until
one of two things happen: a code reference is found, or the
END_SRC
command is found.
if(read >= 9) {
if(!strncmp(line, "#+END_SRC", 9)) {
mode = MODE_ORG;
worg->block_started = 0;
worgle_append_string(worg);
#ifndef WORGLITE
<<set_content_flag_after_block>>
#endif
continue;
}
}
if(check_for_reference(line, read, &str)) {
worgle_append_string(worg);
worgle_append_reference(worg, &str);
worg->block_started = 1;
worgle_string_reset(&worg->block);
continue;
}
worg->block.size += read;
if(worg->block_started) {
worg->block.str = line;
worg->block_started = 0;
worg->curline = worg->linum;
}
void worgle_append_string(worgle_d *worg);
In this function, the currently active string block is
appened to the currently active code block. It is called
when the parser is inside a code block (aka MODE_CODE
).
The current line number is checked if it is a valid (positive) value. A negative value indicates a properly initialized, but unset value. This will happen if the initial code block begins with a reference. A negative value will cause invalid line declarations in the generated code.
In some cases, Worgle will try to append an empty string to a block. While harmless for tangling, this can cause issues when doing metadata export. Empty strings will be ignored.
void worgle_append_string(worgle_d *worg)
{
worgle_segment *seg;
if (worg->curblock == NULL) return;
if (worg->curline < 0) return;
if (worg->block.size == 0) return;
seg = worgle_block_append_string(worg->curblock,
&worg->block,
worg->curline,
&worg->curbuf->filename);
<<worgle_segment_string_set_id>>
<<store_last_string_id>>
}
void worgle_append_reference(worgle_d *worg, worgle_string *ref);
void worgle_append_reference(worgle_d *worg, worgle_string *ref)
{
worgle_segment *seg;
if(worg->curblock == NULL) return;
seg = worgle_block_append_reference(worg->curblock,
ref,
worg->linum,
&worg->curbuf->filename);
<<worgle_segment_reference_set_id>>
<<store_last_reference_id>>
}
static int check_for_reference(char *line , size_t size, worgle_string *str);
static int check_for_reference(char *line , size_t size, worgle_string *str)
{
int mode;
size_t n;
mode = 0;
str->size = 0;
str->str = NULL;
for(n = 0; n < size; n++) {
if(mode < 0) break;
switch(mode) {
case 0: /* spaces */
if(line[n] == ' ') continue;
else if(line[n] == '<') mode = 1;
else mode = -1;
break;
case 1: /* second < */
if(line[n] == '<') mode = 2;
else mode = -1;
break;
case 2: /* word setup */
str->str = &line[n];
str->size++;
mode = 3;
break;
case 3: /* the word */
if(line[n] == '>') {
mode = 4;
break;
}
str->size++;
break;
case 4: /* last > */
if(line[n] == '>') mode = 5;
else mode = -1;
break;
}
}
return (mode == 5);
}
3.3.3. Parser Local Variables
The parsing stage requires a local variable called str
to be used from time
to time. Not sure where else to put this.
worgle_string str;
worgle_string_init(&str);
line
refers to the pointer address that the line will write to.
char *line;
line = NULL;
pos
refers to the current buffer position.
size_t pos;
pos = 0;
This is the local variable read
.
size_t read;
The overall parser mode state is set by the local variable mode
.
int mode;
It is set to be the initial mode of MODE_ORG
.
mode = MODE_ORG;
The main return code determines the overall state of the program.
int rc;
By default, it is set to be okay, which is 0 on POSIX systems.
rc = 0;
The getline function used by the parser returns a status code, which tells the program when it has reached the end of the file.
int status;
This is set to be TRUE (1) by default.
status = 0;
3.3.4. Reading a line at a time
Despite being loaded into memory, the program still reads in code one line at a time. The parsing relies on new line feeds to denote the beginnings and endings of sections and code references.
Before reading the line, the line number inside worgle is incremented.
In order to handle multiple files, this value must explicitely be reset
to be zero every time inside of the parse_file
function.
worg->linum = 0;
A special readline function has been written based on getline
that reads
lines of text from an allocated block of text. This function is called
worgle_getline
.
After the line has been read, the program checks the return code status
.
If all the lines of text have been read, the program breaks out of the
while loop.
worg->linum++;
status = worgle_getline(buf, &line, &pos, &read, size);
if(!status) break;
static int worgle_getline(char *fullbuf,
char **line,
size_t *pos,
size_t *line_size,
size_t buf_size);
fullbuf
refers to the full text buffer.
line
is a pointer where the current line will be stored.
pos
is the current buffer position.
line_size
is a variable written to that returns the size of the line. This
includes the line break character.
buf_size
is the size of the whole buffer.
static int worgle_getline(char *fullbuf,
char **line,
size_t *pos,
size_t *line_size,
size_t buf_size)
{
size_t p;
size_t s;
*line_size = 0;
p = *pos;
*line = &fullbuf[p];
s = 0;
while(1) {
s++;
if(p >= buf_size) return 0;
if(fullbuf[p] == '\n') {
*pos = p + 1;
*line_size = s;
return 1;
}
p++;
}
}
3.4. Generation
The last phase of the program is code generation.
A parsed file generates a structure of how the code will look. The generation stage involves iterating through the structure and producing the code.
Due to the hierarchical nature of the data structures, the generation stage is surprisingly elegant with a single expanding entry point.
At the very top, generation consists of writing all the files in the filelist. Each file will then go and write the top-most block associated with that file. A block will then write the segment list it has embedded inside of it. A segment will either write a string literal to disk, or a recursively expand block reference.
if(!rc && tangle_code) if(!worgle_generate(&worg)) rc = 1;
int worgle_generate(worgle_d *worg);
int worgle_generate(worgle_d *worg)
{
return worgle_filelist_write(&worg->flist, &worg->dict);
}
If the use_warnings
flag is turned on, Worgle will scan the dictionary
after generation and flag warnings about any unused blocks.
if(!rc && use_warnings) rc = worgle_warn_unused(&worg);
3.5. Cleanup
At the end up the program, all allocated memory is freed via
worgle_free
.
cleanup:
worgle_free(&worg);
return rc;
prev | home | next