Ultimate Regular Expression for HTML tag parsing with PHP

Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).

Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:

$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

And finally, my version based on the one above:

$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

The latter include the following enhancement:

  • accept hyphens as attribute’s middle characters (thanks Ged)

24 Responses to “Ultimate Regular Expression for HTML tag parsing with PHP”


  1. 1 Rodrigo Polo

    sorry, but isn’t working.

  2. 2 kev

    Because of Wordpress parsing algorithm, an extra space was introduced between $regex = "/< and \/?\w+.... I’ve fixed this and it should be OK now. Can you retry please ?

  3. 3 BazzA

    Hello, cheers for this. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks. So anyone copy and pasting this should replace it with a ‘proper’ one.

  4. 4 kev

    However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    Good catch ! Thanks BazzA, I think you’ve find a Wordpress’ bug. Let me investigate this issue…

  5. 5 kev

    However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    I’ve just fixed the post content.

  6. 6 kev

    BTW, if you need to see a real example use of this regexep, read the importImagesFromPost() function from my Wordpress to e107 v0.8 script . You will find there an adaptation of the regexp featured in this article.

  7. 7 Ger

    heh.. this does not catch

    [HTML tags stripped by Wordpress]

    for example. and this:

    [HTML tags stripped by Wordpress]

    aha, that’s because attribute http-equiv
    But for my needs it feets anyway – thanks a lot :)

  8. 8 Ger

    also, I guess this is not good idea to strip html at the forum about html – better would be convert to entities, isn’t it? ;)

  9. 9 kev

    Hi Ger, sorry about the tag deletion, but that’s default Wordpress policy. I think you should convert to entities manually.

    Regarding your regular expression issue, I guess you were talking about something similar to:

    <meta http-equiv="Refresh" content="5">

    So yes, this tag doesn’t match because my regular expression wasn’t considering http-equiv as a valid attribute. This is of course wrong, as the HTML specs obviously allow hyphens in attribute name.

    I’ve updated my regular expression in the blog post.

  10. 10 Pakistan Peshawar

    by which regex pateren i can get all html anchor tags from html document, i used this one:

    preg_match_all("/<a>[\s-\w+&@#\/%?=~_|!:,.;\"']+/i",$cr,$pm,PREG_SET_ORDER);

    its working but it not parsing those anchors which has images inside, eg:

    </a><a href="test.html" rel="nofollow"> </a>

    can some give me the right one so that i can get all types of achors from html ducument.
    thanks

  11. 11 bharani

    for br tag tell the regular expression plzzzzzzz

  12. 12 Craig

    Hi,

    Thanks for publishing this code, it’s very useful, however I’m trying to catch and remove non-standard tags generated by MS Word, which contain hyphens in the tag (eg ) and colons in the attributes (eg. w:st=”on”), and I notice that this code doesn’t currently pick these up. Any chance of an amended version?

    Many thanks

  13. 13 Casey Wise

    Very helpful regex assistance, thank you!

  14. 14 Hamza

    Hi there,

    can any one write a good regular expression this one for Python, to remove all hmtl tags. i really need one,

    thanks in advance

  15. 15 kev

    @Hamza: take a look here.

  16. 16 Johnny B

    HTML isn’t a regular language, so advising people to parse it with regular expressions is.. not smart.

    See http://htmlparsing.icenine.ca for more information.

  17. 17 kev

    advising people to parse it with regular expressions is.. not smart.

    I agree.

    To clarify: this code is far from being a good practice. It’s just a hack intended to get rid of HTML tag soup.

    Now, a little bit of context: PHP is not my language of choice and at the time I wrote this article I didn’t found any PHP library that is tolerant to tag soup. Hence the hack.

    For Python, the langage I practice everyday, I recommand using lxml, especially its lxml.html module. And on that subject, don’t miss Ian Bicking’s post: “lxml: an underappreciated web scraping library”.

  18. 18 Jonathan Worent

    You should try Tidy. Its very forgiving of tag soup. And will allow you to pars the tag soup as actual DOM (of sorts)

  19. 19 Fred-Eric Lafaille

    Warning: ereg_replace() [function.ereg-replace]: REG_BADRPT

  20. 20 saucy

    what part allows hyphens. i just need that part.

  21. 21 Nayana Adassuriya

    I want to get all the tag that include tag withing it.

    eg:
    here i want to get "www.google.com" and "google.jpg"

    please help me how can do that with some example.

    thanks
    Nayana Adassuriya

  22. 22 Nayana Adassuriya

    sorry for above post it remove my code part automatically

  23. 23 Nayana Adassuriya

    simply i want to get the “image url” and “link url” when a image include inside anchor tag

    how can i do it?

  1. 1 Python ultimate regular expression to catch HTLM tags at Coolkevmen
    Pingback on Jul 8th, 2008 at 0:24

Leave a Reply

Additional comments powered by BackType