logo

Writing Filters

This document is about writing your own Dexy filters. We'll start with a very simple example, then talk in depth about filter design. Please update to Dexy 0.5.1 before trying these examples (if 0.5.1 isn't out yet, then you will need to install Dexy from source and you should do a 'git pull' every time you start working to get the latest source code). ### Simple Example You can add custom filters to any Dexy project by creating a folder in the top level of your project called 'filters'. (You can also put your custom filters in a directory called ~/dexy_filters, where they will be available to all your projects.) You need to have a blank file named \_\_init\_\_.py in this directory, this is a [python thing](http://docs.python.org/tutorial/modules.html#packages). Any filters you define in the 'filters' directory will be available to your project. Here is a simple example of a filter:
from dexy.dexy_filter import DexyFilter

class SimpleFilter(DexyFilter):
    ALIASES = ['simple']

    def process_text(self, input_text):
        return "This document has %s characters" % len(input_text)
The crucial elements for simple text-based Dexy filters like this are: + Subclass the DexyFilter class + Define at least 1 unique Alias + Define a process_text() method to do the 'work' of your filter. This method should return the text you want to output. Here are the files in our example:
.
./.dexy
./filters
./filters/__init__.py
./filters/__init__.pyc
./filters/simple_filter.py
./filters/simple_filter.pyc
./hello.txt

The dexy config is:
{
    "hello.txt|simple": {}
}
The input file is:
Hello!

The output generated is:
This document has 7 characters
### Respecting Sections The process_text method lets you return a single block of text. Sometimes we want to process a document in sections, and preserve those sections. You can write a filter that does this by implementing a process_dict method instead.
from dexy.dexy_filter import DexyFilter
from ordereddict import OrderedDict # on Python 2.7 you can also use 'from collections import OrderedDict'

class SimpleSectionsFilter(DexyFilter):
    ALIASES = ['simplesections']

    def process_dict(self, input_dict):
        output_dict = OrderedDict()
        for section_name, section_text in input_dict.iteritems():
            output_dict[section_name] = "This section has %s characters" % len(section_text)
        return output_dict
Your process_dict method will receive an OrderedDict of sections, and you should return an OrderedDict. Here are the files in our example:
.
./.dexy
./filters
./filters/__init__.py
./filters/__init__.pyc
./filters/simple_sections_filter.py
./filters/simple_sections_filter.pyc
./hello.txt

The dexy config is:
{
    "hello.txt|lines|simplesections": {}
}
We pass our input file through the 'lines' filter to split it into sections. This means that each line of the file is put into a different section. The input file is:
Hello!
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

The results are:
Section: 1
Contents: This section has 6 characters

Section: 2
Contents: This section has 124 characters

### Testing Filters Your filters are just Python classes, so you can test them using nose or any other Python testing tool. Dexy also has a 'test' filter. Here is an example of a unit test:
from filters.simple_filter import SimpleFilter

def test_simple_filter():
    f = SimpleFilter()
    assert f.process_text("hello!\n") == "This document has 7 characters"
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Here is an example of using Dexy's test filter:
{
    "hello.txt|simple": {}, 
    "hello.txt|simple|test": {
        "test-expects": "This document has 7 characters"
    }
}
Adding /mnt/build-dexy-site/dexy-site/docs/guide/writing-filters/ex3 to python sys.path so your custom filters in /mnt/build-dexy-site/dexy-site/docs/guide/writing-filters/ex3/filters will be available
batch id is 1
sorting 2 documents into run order, there are 0 total dependencies
testing hello.txt|simple|test ... ok 

### File Extensions Each Dexy filter defines INPUT_EXTENSIONS and OUTPUT_EXTENSIONS. These are a list of file extensions that the filter is capable of taking in and putting out. At the beginning of each run, the filter will figure out what extension it should output. It will usually be the first element in the list of OUTPUT_EXTENSIONS, but the filter will check whether the next filter in line can accept that as an input. If not, it tries the rest of the OUTPUT_EXTENSIONS until it finds one. If the filter can't output anything the next filter can accept, then Dexy will raise an exception to let you know that your combination of filters is impossible. The file extension is available within your filters as self.artifact.ext. Here is a filter that outputs plain text or HTML:
from dexy.dexy_filter import DexyFilter

class FileExtensionAwareFilter(DexyFilter):
    ALIASES = ['fileextension']
    INPUT_EXTENSIONS = ['.txt']
    OUTPUT_EXTENSIONS = ['.txt', '.html']

    def process_text(self, input_text):
        if self.artifact.ext == '.txt':
            return "This is a text file.\n\n%s" % input_text
        elif self.artifact.ext == '.html':
            return "<p>This is a HTML file.</p><br /><p>%s</p>" % input_text
        else:
            raise Exception("unexpected file extension %s" % self.artifact.ext)
The default file extension will be '.txt' because that is listed first in the OUTPUT_EXTENSIONS array. To force our filter to output HTML, we need to have another filter after it that only accepts HTML. The 'h' filter is a filter that does exactly that. It doesn't change the text it receives, it just says "you must give me HTML" so the previous filter will output HTML. Here is our .dexy config:
{
    "hello.txt|fileextension": {}, 
    "hello.txt|fileextension|h": {}
}
Here is the plain text output:
This is a text file.

Hello!

And here is the HTML output:
<p>This is a HTML file.</p><br /><p>Hello!
</p>
If both the INPUT_EXTENSIONS and OUTPUT_EXTENSIONS are set to ".*", then the file extension will not change when passing through the filter. i.e. if a ".txt" file comes in, it will still be a ".txt" file. ### Try It Yourself Try creating one or two custom filters of your own now and get them to run. Remember that you can use any Python package in your filter code. Here are some ideas: * Write a Dexy filter which converts each character in the input to its unicode/ascii number using the ord() function. This can be helpful for debugging the output from previous filters. You can present each character on its own line, or separated by spaces, and you can decide whether or not to print the character itself after its number. * - Write a Dexy filter which converts CleverCSS markup to CSS. * - Write a Dexy filter which takes Sudoku grids in .grid files and solves them, outputting a completed grid. * - Write a Dexy filter which takes CSV data and outputs a pretty table. Your filter should support both plain text and HTML output, and it should figure out which to return based on the file extension. You can put as many filters as you want in a single module (i.e. a single file), but it's a good idea to put filters that depend on a 3rd party library into a separate module from filters that don't. This is because if you try to import a package that isn't installed, all the filters defined in that module will be unavailable, whether or not those particular filters need the missing package. ### The Process Method When a Dexy filter is run, it is actually the filter class's process method that is called. The DexyFilter class implements a process method that checks whether methods named process_text, process_dict or process_text_to_dict exist (in that order), and it calls those methods if so. This makes it very easy to implement filters without having to worry about Dexy's internals, you just return text or a dict and that's it. However, many filters need to do more complex things than this, and so rather than implementing one of these convenience methods, they will override the process method instead. When working with the process method, you need to be aware of the filter's artifact. The artifact is responsible for persisting the content that is generated in the filter, and the artifact is also what takes care of caching so your filter only gets run when it needs to. If you implement a process method, then at a minimum you need to either: * Save your output content in the artifact's data_dict, either by assigning an OrderedDict to this attribute directly, or by calling the artifact's set_data method. This only works with non-binary data. or * Save your output content under the correct filename for the artifact. We will look at SubprocessStdout filters which take the first approach, and Subprocess filters which take the second approach. ### SubprocessStdoutFilters Rather than subclassing DexyFilter, you can also subclass any other Dexy filter to recycle that filter's functionality. Many Dexy filters don't have any code, they just subclass a filter and change some class constants. The SubprocessStdoutFilter is designed to be easily subclassed to implement new filters that run an executable on an input file (with optional command line arguments) and return whatever gets written to STDOUT. Here is the process method of the SubprocessStdoutFilter class:
<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">command</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_string_stdout</span><span class="p">()</span>
        <span class="n">proc</span><span class="p">,</span> <span class="n">stdout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">run_command</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">setup_env</span><span class="p">())</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">handle_subprocess_proc_return</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">proc</span><span class="o">.</span><span class="n">returncode</span><span class="p">,</span> <span class="n">stdout</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">artifact</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">stdout</span><span class="p">)</span>
</pre></div>
The previous_artifact_filename attribute of the artifact stores the cache file location of the previous artifact's output, which is this artifact's input. This filename is the one that our executable will run on. (The filename is written to the log so you can inspect the file, and even run it manually, which can be useful for troubleshooting.) In the last line, the contents of stdout are passed to the artifact's set_data method.
<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">set_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">data_dict</span><span class="p">[</span><span class="s">&#39;1&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
</pre></div>
Later, Dexy will automatically save the contents of the data_dict in the cache. The default command string is:
<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">command_string_stdout</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">args</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">&#39;prog&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">executable</span><span class="p">(),</span>
            <span class="s">&#39;args&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_line_args</span><span class="p">()</span> <span class="ow">or</span> <span class="s">&quot;&quot;</span><span class="p">,</span>
            <span class="s">&#39;scriptargs&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_line_scriptargs</span><span class="p">()</span> <span class="ow">or</span> <span class="s">&quot;&quot;</span><span class="p">,</span>
            <span class="s">&#39;script_file&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">artifact</span><span class="o">.</span><span class="n">previous_artifact_filename</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="s">&quot;</span><span class="si">%(prog)s</span><span class="s"> </span><span class="si">%(args)s</span><span class="s"> </span><span class="si">%(script_file)s</span><span class="s"> </span><span class="si">%(scriptargs)s</span><span class="s">&quot;</span> <span class="o">%</span> <span class="n">args</span>
</pre></div>
So, in the simplest cases, you can create a new filter just by subclassing this filter and set some constants, like this:
class BashSubprocessStdoutFilter(SubprocessStdoutFilter):
    ALIASES = ['sh', 'bash']
    EXECUTABLE = 'bash -e'
    INPUT_EXTENSIONS = [".sh", ".bash", ".txt", ""]
    OUTPUT_EXTENSIONS = [".txt"]
    VERSION_COMMAND = 'bash --version'
Or this:
class PythonSubprocessStdoutFilter(SubprocessStdoutFilter):
    ALIASES = ['py', 'pyout']
    EXECUTABLE = 'python'
    INPUT_EXTENSIONS = [".py", ".txt"]
    OUTPUT_EXTENSIONS = [".txt"]
    VERSION_COMMAND = 'python --version'
If you need to, you can override the command_string_stdout method if you need to pass arguments in a different order or pass different arguments, for example:
class CowsaySubprocessStdoutFilter(SubprocessStdoutFilter):
    ALIASES = ['cowsay']
    EXECUTABLE = 'cowsay'
    INPUT_EXTENSIONS = [".txt"]
    OUTPUT_EXTENSIONS = [".txt"]

    def command_string_stdout(self):
        args = self.command_line_args() or ""
        text = self.artifact.input_text()
        return "%s %s \"%s\"" % (self.executable(), args, text)

If you created your own custom filter earlier, then try to create another one now by subclassing SubprocessStdoutFilter. Set the EXECUTABLE to be the name of the command you want to call. Remember, this command should just print its output, not write it to a file (the next section deals with programs that write their output to a file).

You can override command_string_stdout if you need to, but remember that you can include arguments in the EXECUTABLE string and you can pass additional arguments to your filter from the .dexy file, so think about those options first. In the Ragel-for-Ruby filter, we always want to call ragel with the -R option, so we include this in the EXECUTABLE string. You can pass arguments to any SubprocessFilter-based filter by including an 'args' dict in your .dexy file, where each key is a filter alias and the value is the args you want to pass to that filter. For examples check out the cowsay filter docs.

SubprocessStdoutInputFilters

In the previous section, we just ran a command on a file and captured the output. Sometimes we also need to have additional inputs. For example, we might write a sed script and want to run this through the sed filter, along with one or more text files. Or, your Python or Ruby script might read STDIN to get user input, and you want to simulate this in your documentation.

In these cases, your sed, python or ruby script is the file that gets put through the filter. Dexy uses inputs to pass other information to a filter. The SubprocessStdoutInputFilter class handles this for you, you can subclass this if you need to create a filter where you run your script and also pass additional information to it.

Check out the sed, shinput (bash) and pyinput (python) filter docs for some examples. The sed filter also overrides command_string_stdout.

These examples mostly have plain text files being used as inputs, but inputs can be any other Dexy document, so you can use jinja templates, or the output from other scripts, or pretty much anything else as your additional inputs.

SubprocessFilters

With some tools, the natural thing to do is to capture STDOUT. With others, it makes more sense to have the executable write its output directly to a file. Particularly where the output is binary content, like an image.

Here is the process method of the SubprocessFilter class:

<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">command</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_string</span><span class="p">()</span>
        <span class="n">proc</span><span class="p">,</span> <span class="n">stdout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">run_command</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">setup_env</span><span class="p">())</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">handle_subprocess_proc_return</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">proc</span><span class="o">.</span><span class="n">returncode</span><span class="p">,</span> <span class="n">stdout</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">artifact</span><span class="o">.</span><span class="n">stdout</span> <span class="o">=</span> <span class="n">stdout</span>
</pre></div>

The default command string is:

<div class="highlight"><pre>    <span class="k">def</span> <span class="nf">command_string</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">args</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">&#39;prog&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">executable</span><span class="p">(),</span>
            <span class="s">&#39;args&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_line_args</span><span class="p">()</span> <span class="ow">or</span> <span class="s">&quot;&quot;</span><span class="p">,</span>
            <span class="s">&#39;scriptargs&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">command_line_scriptargs</span><span class="p">()</span> <span class="ow">or</span> <span class="s">&quot;&quot;</span><span class="p">,</span>
            <span class="s">&#39;script_file&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">artifact</span><span class="o">.</span><span class="n">previous_artifact_filename</span><span class="p">,</span>
            <span class="s">&#39;output_file&#39;</span> <span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">artifact</span><span class="o">.</span><span class="n">filename</span><span class="p">()</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="s">&quot;</span><span class="si">%(prog)s</span><span class="s"> </span><span class="si">%(args)s</span><span class="s"> </span><span class="si">%(script_file)s</span><span class="s"> </span><span class="si">%(scriptargs)s</span><span class="s"> </span><span class="si">%(output_file)s</span><span class="s">&quot;</span> <span class="o">%</span> <span class="n">args</span>
</pre></div>

In this case, the generated content is written directly to the cache under the correct file name, which is available from self.artifact.filename(). The self.artifact.stdout attribute is set to whatever gets written to stdout or stderr, which is going to be debugging or error messages.

Here is a simple filter that just needs constants set:

class Ps2PdfSubprocessFilter(SubprocessFilter):
    """
    Converts a postscript file to PDF format.
    """
    ALIASES = ['ps2pdf']
    EXECUTABLE = 'ps2pdf'
    INPUT_EXTENSIONS = [".ps", ".txt"]
    OUTPUT_EXTENSIONS = [".pdf"]

Here is a filter that overrides the command_string method:

class RagelRubySubprocessFilter(SubprocessFilter):
    """
    Generates ruby source code from a ragel file.
    """
    ALIASES = ['rlrb', 'ragelruby']
    BINARY = False
    EXECUTABLE = 'ragel -R'
    FINAL = False
    INPUT_EXTENSIONS = [".rl"]
    OUTPUT_EXTENSIONS = [".rb"]
    VERSION_COMMAND = 'ragel --version'

    def command_string(self):
        wf = self.artifact.previous_artifact_filename
        of = self.artifact.filename()
        return "%s %s -o %s" % (self.executable(), wf, of)

You can also do other setup work in the command_string method:

class Html2PdfSubprocessFilter(SubprocessFilter):
    """
    Renders HTML to PDF using wkhtmltopdf. If the HTML relies on assets such as
    CSS or image files, these should be specified as inputs.

    If you have an older version of wkhtmltopdf, and are running on a server,
    you may get XServer errors. You can install xvfb and run Dexy as
    "xvfb-run dexy". Or upgrade to the most recent wkhtmltopdf which only needs
    X11 client libs.
    """
    ALIASES = ['html2pdf', 'wkhtmltopdf']
    EXECUTABLE = 'wkhtmltopdf'
    INPUT_EXTENSIONS = [".html", ".txt"]
    OUTPUT_EXTENSIONS = [".pdf"]
    VERSION_COMMAND = 'wkhtmltopdf --version'

    def command_string(self):
        # Create a temporary directory and populate it with all inputs.
        self.artifact.create_temp_dir(populate=True)
        workfile = os.path.join(self.artifact.hashstring, self.artifact.previous_canonical_filename)

        args = {
            'prog' : self.executable(),
            'in' : workfile,
            'out' : self.artifact.filename()
        }
        return "%(prog)s %(in)s %(out)s" % args

And, here is a filter that overrides the process method while still taking advantage of several helper methods defined in the SubprocessFilter class:

class Pdf2ImgSubprocessFilter(SubprocessFilter):
    """
    Converts a PDF file to a PNG image using ghostscript (subclass this to
    convert to other image types).

    Returns the image generated by page 1 of the PDF by default, the optional
    'page' parameter can be used to specify other pages.
    """
    ALIASES = ['pdf2img', 'pdf2png']
    EXECUTABLE = "gs"
    GS_DEVICE = 'png16m -r300'
    INPUT_EXTENSIONS = ['.pdf']
    OUTPUT_EXTENSIONS = ['.png']
    VERSION_COMMAND = "gs --version"

    def command_string(self):
        s = "%(prog)s -dSAFER -dNOPAUSE -dBATCH -sDEVICE=%(device)s -sOutputFile=%%d-%(out)s %(in)s"
        args = {
            'prog' : self.executable(),
            'device' : self.GS_DEVICE,
            'in' : self.artifact.previous_artifact_filename,
            'out' : self.artifact.filename()
        }
        return s % args

    def process(self):
        command = self.command_string()
        proc, stdout = self.run_command(command, self.setup_env())
        self.artifact.stdout = stdout
        self.handle_subprocess_proc_return(command, proc.returncode, stdout)

        if self.artifact.args.has_key('page'):
            page = self.artifact.args['page']
        else:
            page = 1

        page_file = os.path.join(self.artifact.artifacts_dir, "%s-%s" % (page, self.artifact.filename()))
        shutil.copyfile(page_file, self.artifact.filepath())

And it, in turn, can be subclassed:

class Pdf2JpgSubprocessFilter(Pdf2ImgSubprocessFilter):
    ALIASES = ['pdf2jpg']
    GS_DEVICE = 'jpeg'
    OUTPUT_EXTENSIONS = ['.jpg']
blog comments powered by Disqus

This website was generated by Dexy. | This Page's Source | This Page's Log (large HTML page) | Back to Top

Content © 2011 Dr. Ana Nelson | Site Design © Copyright 2011 Andre Gagnon | All Rights Reserved.