Writing regular expression(regex) for a web url in php

I use TweetDeck client to access my twitter account and post status updates. While posting a web url in the tweet, if I copy the link to a web url, then first of all it is shortened using a URL shortener service like bit.ly or tiny.cc and then the shortened url becomes hyperlinked.

Naturally, this process can be divided into three parts.

  1. Recognizing there is a web url in the text area and if yes, extracting it.
  2. Calling a url shortener service api to shorten the extracted url.
  3. Replacing the web url text with hyperlinked html text for shortened url.

In this post, I am looking into step 1 of this process that is how can we identify if there is a web url in the text. For matching the presence of a web url in the text, I am using regex matching php function preg_match(). So, the question was to how to write a regular expression for a web url. I observed what kind of web urls the TweetDeck client hyperlinks automatically. I found that it recognizes http://something.com or https://somethingelse.com or www.something.com but doesn’t consider something.com as web url.

So, the web url should start with either http:// or https:// or www. . The corresponding regular expression for this is:

$pattern = "(((https?)\:\/\/)|(www\.))";

Now, comes the part of writing the regex for what follows after the scheme definition. The response to this question asked on stackoverflow.com helps us in writing down the regex for this part which is based on the following points:

  • string must start with an ASCII letter or number
  • ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
  • optional: a port is allowed (":8080")
  • optional: anything after a slash may follow except space

The regex for this part is concatenated with the first part as described below.

$pattern .= "[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(\/[^ ]*)?";

Now, you can call preg_match(“/$pattern/”,$text,$result) where $text contains the original text and $result shall contain the result. $result[0] contains the matched string if there is any match. However, preg_match() shall return only the first such $pattern in the $text. If there maybe more than one url in $text, you would like to use preg_match_all() to match all the urls in $text.

Here’s the php script which takes input a block of text under quotes and outputs the url matched if any.

<?php
 $pattern = "(((https?)\:\/\/)|(www\.))";
 $pattern .= "[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(\/[^ ]*)?";

 $text = $argv[1];
 echo "$text\n";
 if (preg_match_all("/$pattern/",$text,$matches)){
    echo "Match\n";
     foreach($matches[0] as $url)
          echo "URL: $url\n";
 }
else
 echo "Not match\n";
?>

Here are a few sample runs of this script.

  1. php regex.php "tf http://technoflirt.com/tech ac http://technoflirt.com/noflirt"
    tf http://technoflirt.com/tech ac http://technoflirt.com/noflirt
    Match
    URL: http://technoflirt.com/tech
    URL: http://technoflirt.com/noflirt
  2. php regex.php "tf technoflirt.com"
    tf technoflirt.com
    Not match
  3. php regex.php "tf www.technoflirt.com"
    tf www.technoflirt.com
    Match
    URL: www.technoflirt.com

Some useful links I encountered while trying to solve this problem:

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.