box.matto.nl

home/

Generate odt files with awk

Last edited

Some time ago the publisher stopped accepting plain text files and wanted "proper" office files. Fortunately, they consider odt-files as proper office files, so I don't have to use the crap from you-know-who from Redmond.

So I started to use pandoc to convert my writings to the odt format.

But of course it is much more fun to writer a parser, so one day I set out to do so.

Elements to parse

The text I send to the publisher have a very simple layout:

  • Level 1 header
  • Level 2 header
  • One or more paragraphs
  • Level 2 header
  • One or more paragraphs
  • et cetera

I write the texts with Vi or Vim. Vi is a very good and powerful editor, and Vim adds spell checking and word count on top of that.

I have been using Vimwiki for several years, so I adopted the Vimwiki-format, even for my awkiawki wiki.

Adopting the Vimwiki-format means the following format:

= Level 1 header
== Level 2 header

And blank lines to seperate paragraphs.

So all the parser has to do is do a line by line scan and check for lines starting with =, of with == or blank lines. At first sight awk will be a great tool for this.

Awk has been around since 1977 and is a great programming language to work upon text files. If you love the minimalistic approach, then you will love awk.

Flat XML ODF text document format

Libreoffice can work with flat xml odf files. A flat xml odf file is a single xml-file that is not zipped. A normal odt file is a zip file containing six or more files, which is a lot of overkill for what we set out to do.

Minimal flat XML ODF file example

Before we can start building a parser, we need to know what the output format looks like. For this we set out the create a minimal flat XML ODF file that Libreoffice still can open.

I opened Libreoffice, created a document with a single line containing the famous 'Hello World' text and exported that as a flat odf file. This results in a huge xml-file with a lot of stuff in it that is probably not vital.

The next step was to minimize the contents the most simple file that will still open correctly in Libreoffice.

After some trial and error this was the result:

<?xml version="1.0" encoding="UTF-8"?>

<office:document xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" office:mimetype="application/vnd.oasis.opendocument.text" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:css3t="http://www.w3.org/TR/css3-text/">
 <office:meta><meta:creation-date>2016-01-16T11:06:47.753843480</meta:creation-date></office:meta>
 <office:body>
  <office:text>
   <text:p>Hello World</text:p>
  </office:text>
 </office:body>
</office:document>

We are breaking all the rules here, this file does not comply to the flat XML ODF format, but we don't care, as long as Libreoffice opens it correctly.

Create headers

The next step is to add some level 1 and level 2 headers.

Again I created a very simple document in Libreoffice and exported that to flat XML ODF to start with an example to minimize.

Again some trial and error and this was the result, I only display the part in the office:text container, the rest is just the same as above:

 <office:text>
   <text:h text:outline-level="1">Heading 1</text:h>
   <text:h text:outline-level="2">Heading 2</text:h>
   <text:p text:style-name="Standard">Hello World</text:p>
   <text:p text:style-name="Standard"/>
   <text:p text:style-name="Standard">Paragraph 2</text:p>
   <text:p text:style-name="Standard"/>
  </office:text>

So, now we only have to create a Awk script that will parse a flat text file and convert it to a flat XML ODF file according to this minimal format.

Awk script to parse a text file to flat XML ODF

This awk script will do the trick.

BEGIN {
    print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
    print "<office:document xmlns:office=\"urn:oasis:names:tc:opendocument:xmlns:office:1.0\" office:mimetype=\"application/vnd.oasis.opendocument.text\" xmlns:style=\"urn:oasis:names:tc:opendocument:xmlns:style:1.0\" xmlns:text=\"urn:oasis:names:tc:opendocument:xmlns:text:1.0\" xmlns:dom=\"http://www.w3.org/2001/xml-events\"  xmlns:css3t=\"http://www.w3.org/TR/css3-text/\">";
    print " <office:meta><meta:creation-date>2016-01-16T11:06:47.753843480</meta:creation-date></office:meta>";
    print " <office:body>";
    print " <office:text>";
    in_paragraph = 0;
    blank_line = 0;
}

/^= /{   if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
        in_paragraph = 0;
    }
    $0 = "<text:h text:outline-level=\"1\">" substr($0, 2) "</text:h>"; print; next; 
}
/^== /{  if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
        in_paragraph = 0;
    }
    $0 = "<text:h text:outline-level=\"2\">" substr($0, 3) "</text:h>"; print; next; 
} 
/^$/ { blank_line = 1; next; } 

{
if (blank_line == 1) {
    if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
    }
    print "<text:p text:style-name=\"Standard\">";
    in_paragraph = 1;
    blank_line=0;
}
print;
} 

END {

    if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
    }
    print "  </office:text>";
    print " </office:body>";
    print "</office:document>";
}

Save this file as mdwntofodt.akw. Then convert a file with the command

 awk -f mdwntofodt.akw inputfile > outputfile.fodt

Include correct datestamp

Now this is working, the next step is to put the proper date into the header. Probably this is not vital for the working of the script, but might be useful in future actions upon the fodt files.

After reading man date, it seems that date -Iseconds will give something the is close to what we saw in the export from Libreoffice in the hello world example.

Let us try this in our awk script.

In the top of the file we add a few lines:

BEGIN {
    cmd = "date -Iseconds";
    cmd | getline dateline
    close(cmd)

The line with BEGIN is the line that was already there, the three lines below are new.

With these three lines we have read the current date and time into the variable "dateline". So now we have to put this into the proper line in the fodt header:

print " <office:meta><meta:creation-date>" dateline "</meta:creation-date></office:meta>";

This is a modification of the line in the header in the hello world example above.

After some trial and error it seems that Libreoffice will not show this timestamp in the document properties. Probably because we have minimized the header to much.

But when we open this file in Libreoffice, export as "real" odt file. close Libreoffice, open Libreoffice again and open the "real" odt file and export that as flat XML odf file, we see that the original timestamp is still there. So although Libreoffice doesn't shows the timestamp in the properties window, it still retained it.

My conclusion is that although the timestamp is not shown in the properties window, it is still useful to keep this few lines into the awk script.

Workflow

My worflow is the following

  • Write text in Vi(m)
  • Parse it to convert to fodt
  • Open in Libreoffice, do some final editing
  • Save as .odt file and mail that to the publisher

The final editing involves characters like é and ï. There are not much of that kind of characters used in the Dutch language, but typing them in Vim and converting them to odt is always a pain. So I type just an e or i and than in the final editing use the Libreoffice spellchecker to insert the right characters.

Parse proper markdown to flat XML ODF

If you want to use a proper markdown format then just replace the = and the == as the Level 1 and Level 2 hearder markers with # and ##.

With this you can parse markdown, but only if the markdown file uses no other elemants beside the level 1, level 2 headers and paragraphs.

Final awk script

So here is the final awk script, including the timestamp in the ODF header:

BEGIN {
    cmd = "date -Iseconds";
    cmd | getline dateline 
    close(cmd)
    print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
    print "<office:document xmlns:office=\"urn:oasis:names:tc:opendocument:xmlns:office:1.0\" office:mimetype=\"application/vnd.oasis.opendocument.text\" xmlns:style=\"urn:oasis:names:tc:opendocument:xmlns:style:1.0\" xmlns:text=\"urn:oasis:names:tc:opendocument:xmlns:text:1.0\" xmlns:dom=\"http://www.w3.org/2001/xml-events\"  xmlns:css3t=\"http://www.w3.org/TR/css3-text/\">";
    print " <office:meta><meta:creation-date>" dateline "</meta:creation-date></office:meta>";
    print " <office:body>";
    print " <office:text>";
    in_paragraph = 0;
    blank_line = 0;
}

/^= /{   if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
        in_paragraph = 0;
    }
    $0 = "<text:h text:outline-level=\"1\">" substr($0, 2) "</text:h>"; print; next; 
}
/^== /{  if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
        in_paragraph = 0;
    }
    $0 = "<text:h text:outline-level=\"2\">" substr($0, 3) "</text:h>"; print; next; 
} 
/^$/ { blank_line = 1; next; } 

{
# print paragraph when blank_line registered
if (blank_line == 1) {
    if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
    }
    print "<text:p text:style-name=\"Standard\">";
    in_paragraph = 1;
    blank_line=0;
}
print;
} 

END {

    if (in_paragraph == 1) {
        print "</text:p><text:p text:style-name=\"Standard\"/>";
    }
    print "  </office:text>";
    print " </office:body>";
    print "</office:document>";
}