IIS URL Multiple Specific Character Find/Replace

Posted: November 7, 2014 in Servers
Tags: , , ,

Today’s challenge at CF Webtools for myself was to find and replace any “_” (underscore) characters in a URL .htm file name and replace it with “-” (dash). The list I was given had file names with up to 7 underscores in any position. Example: my_file_name.htm

While I figured this would be a straight-forward task with IIS URL Rewrite, I was wrong.

End the end I found that I either had to create one rule for each possible underscore count or write a custom rewrite rule. I went the one rule per count route. I read in one blog you can only use up to 9 variables ({R:x}).

The other part of the rule was they had to be only in the “/articles/” directory.

My first challenge was just to get the right regular expression in place. What I found out was that the IIS (7.5) UI’s “Test Pattern” utility doesn’t accurately test. In the test this worked:

Input: http://www.test.com/articles/my_test.htm
Pattern: ^.*\/articles\/(.*)_(.*).htm$
Capture Groups: {R:1} : "my", {R:2} : "test"

However, this does not match in real-world testing. #1, don’t escape “/” (forward-slash) (really??). #2 the pattern is only matched against everything after the domain and first slash (http://www.test.com/).

So really, only this works:

Input: http://www.test.com/articles/my_test.htm
Pattern: ^articles/(.*)_(.*).htm$
Capture Groups: {R:1} : "my", {R:2} : "test"

In order to match against up to 8 underscores, you need 8 rules, each one looking for more underscores. So the next one would be:

Input: http://www.test.com/articles/my_test_file.htm
Pattern: ^articles/(.*)_(.*)_(.*).htm$
Capture Groups: {R:1} : "my", {R:2} : "test", {R:3} : "file"

To do this efficiently you just edit the web.config in the web root for that site. The end result ended up being:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <system.webServer>
        <rewrite>
            <rules>
                <rule name="AUSx1" stopProcessing="true">
                    <match url="^articles/(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}.htm" />
                </rule>
                <rule name="AUSx2" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}.htm" />
                </rule>
                <rule name="AUSx3" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}.htm" />
                </rule>
                <rule name="AUSx4" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}-{R:5}.htm" />
                </rule>
                <rule name="AUSx5" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}-{R:5}-{R:6}.htm" />
                </rule>
                <rule name="AUSx6" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}-{R:5}-{R:6}-{R:7}.htm" />
                </rule>
                <rule name="AUSx7" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}-{R:5}-{R:6}-{R:7}-{R:8}.htm" />
                </rule>
                <rule name="AUSx8" stopProcessing="true">
                    <match url="^articles/(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*)_(.*).htm$" />
                    <action type="Redirect" url="articles/{R:1}-{R:2}-{R:3}-{R:4}-{R:5}-{R:6}-{R:7}-{R:8}-{R:9}.htm" />
                </rule>
            </rules>
        </rewrite>
    </system.webServer>
</configuration>

In the end this URL:

http://www.domain.com/articles/my_file_foo_bar.htm

becomes:

http://www.domain.com/articles/my-file-foo-bar.htm
Advertisements
Comments
  1. > don’t escape “/” (forward-slash) (really??).

    Yes, really. Why would you escape the slash, it means nothing in regex.

    (Some programming languages use slash as the marker for when a regex literal starts and ends, and in those languages you need to escape any slashes that are not the ending marker, but that’s not a regex escape, it’s the host language.)

    Not sure if/how this translates to IIS URL Rewrite, but the Apache mod_rewrite single rule version would be something like:

    RewriteCond %{REQUEST_URI} ^article/
    RewriteRule ([^_]*)_(.*) $1-$2 [N]

    The N flag will loop back to the start of the rewrites and go through them again – basically a while loop: https://httpd.apache.org/docs/current/rewrite/flags.html#flag_n

    See if you can find a similar functionality in IIS… but if you can’t, change each (.*) to ([^_]) to be more accurate/efficient. (And escape the . before the htm)

  2. Doh – that ([^_]) should of course be ([^_]*)

  3. markkruger says:

    Chris – this is a great post! Nicely done. I’ve highlighted on item below (end the end should be “in the end”). I love posts like this!!

  4. Sugar Daddy says:

    Nice post, Chris. And I agree with Peter about changing your (.*) and the escaping of the dot.

    Also, I’m wondering about the order these rules execute. In other find/replace/regex operations like this (in various languages and ), I’ve discovered that I have to do the larger replaces first, otherwise the smaller ones will catch the first part of the larger ones and leave the end part unchanged, and then, since part has been changed, the longer search attempt no longer “catches”.

    Because if it executes in the listed order — and it were to look at a file with, for example, 4 underscores — it looks like it would replace only the first one and then stop processing, leaving the other three undone.

    Maybe I’m not understanding exactly how the rules are applied, but I just wanted to mention this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s