I am going to steal a tiny bit of intellectual property from my company. In my defence, it’s something I wrote and it’s not vital to our business model, just some tedious low-level string processing.

Spend a few seconds to try to understand what this snippet does (no more than that, I will explain what happens in a bit):

from nltk.tokenize import sent_tokenize

def split_sentences(text):
    sentences = iter(sent_tokenize(text)))
    try:
        left = next(sentences)
    except StopIteration:
        return
    start = text.index(left)
    try:
        right = next(sentences)
    except StopIteration:
        yield text
        return
    end = text.index(right, start + len(left))
    yield text[0:end]
    while True:
        left = right
        start = end
        try:
            right = next(sentences)
        except StopIteration:
            yield text[start:]
            return
        end = text.index(right, start + len(left))
        yield text[start:end]

This piece of code is hard to understand, but not because it’s especially clever. Rather, it’s doing some tedious things that are hard to follow.

Here is the docstring:

Split text into sentences, preserving the spaces/newlines in-between.

Sentences are resolved using nltk.tokenize.sent_tokenize. Because nltk strips the sentences of surrounding spaces, we have to restore them:

  • The first sentence will keep preceding and succeeding spaces.
  • The rest of the sentences will only keep the succeeding spaces.

Now, if we study the code again, we can start getting a better picture of what’s happening. However, it’s still going to be hard to review and maintain this without comments. What kind of comments are best for something like this?

Lets try this:

# Find the starting position of the next sentence, starting from the end of the
# previous sentence
end = text.index(right, start + len(left))

This is good because it explains why we use start + len(left) as the second argument to .index, but you will need such a comment for nearly every line and it’s going to become bloated, fast.

I believe there’s a way to come up with comments that are both easier to read and better explain what the code does. Here’s what I would do (have done):

def split_sentences(text):
    """ Split text into sentences, preserving the spaces/newlines in-between.

        Sentences are resolved using `nltk.tokenize.sent_tokenize`. Because
        nltk strips the sentences of surrounding spaces, we have to restore
        them:

        - The first sentence will keep preceding and succeeding spaces.
        - The rest of the sentences will only keep the succeeding spaces.
    """

    sentences = iter(sent_tokenize(text)))

    try:
        # '  one  two  three  four  '
        #    ^^^
        left = next(sentences)
    except StopIteration:
        return

    # '  one  two  three  four  '
    #  |-^
    start = text.index(left)
    try:
        # '  one  two  three  four  '
        #         ^^^
        right = next(sentences)
    except StopIteration:
        # '  one  '
        #  ^^^^^^^
        yield text
        return

    # '  one  two  three  four  '
    #       |-^
    end = text.index(right, start + len(left))

    # '  one  two  three  four  '
    #  ^^^^^^^
    yield text[0:end]

    while True:
        # '  ...  two  three  four  '
        #         ^^^
        left = right

        # '  ...  two  three  four  '
        #         ^
        start = end
        try:
            # '  ...  two  three  four  '
            #              ^^^^^
            right = next(sentences)
        except StopIteration:
            # '  ...  two  '
            # '       ^^^^^'
            yield text[start:]
            return
        # '  ...  two  three  four  '
        #            |-^
        end = text.index(right, start + len(left))
        # '  ...  two  three  four  '
        #         ^^^^^
        yield text[start:end]

I have used this type of comments frequently in my work and the feedback I have gotten from coworkers has been positive. Plus, when I’ve had to work on my own code weeks or months later, I have found them very helpful. I hope it serves as inspiration to others.