R – Using grouping to pull together text and then test it

grouping, xsl-grouping, xslt, xslt-2.0

So in this grotty extruded typesetting product, I sometimes see links and email addresses that have been split apart. Example:

<p>Here is some random text with an email address <Link>example</Link><Link>@example.com</Link> and here is more random text with a url <Link>http://www.</Link><Link>example.com</Link> near the end of the sentence.</p>

Desired output:

<p>Here is some random text with an email address <email>[email protected]</email> and here is more random text with a url <ext-link ext-link-type="uri" xlink:href="http://www.example.com/">http://www.example.com/</ext-link> near the end of the sentence.</p>

Whitespace between the elements does not appear to occur, which is one blessing.

I can tell I need to use an xsl:for-each-group within the p template, but I can't quite see how to put the combined text from the group through the contains() function so as to distinguish emails from URLs. Help?

Best Solution

Here is an XSLT 1.0 solution based on the identity template, with special treatment for <Link> elements.

<xsl:template match="node()|@*">  <xsl:copy>    <xsl:apply-templates select="node()|@*" />  </xsl:copy></xsl:template><xsl:template match="Link">  <xsl:if test="not(preceding-sibling::node()[1][self::Link])">    <xsl:variable name="link">      <xsl:copy-of select="        text()        |         following-sibling::Link[          preceding-sibling::node()[1][self::Link]          and          generate-id(current())          =          generate-id(            preceding-sibling::Link[              not(preceding-sibling::node()[1][self::Link])            ][1]          )        ]/text()      " />    </xsl:variable>    <xsl:choose>      <xsl:when test="contains($link, '://')">        <ext-link ext-link-type="uri" xlink:href="{$link}" />      </xsl:when>      <xsl:when test="contains($link, '@')">        <email>          <xsl:value-of select="$link" />        </email>      </xsl:when>      <xsl:otherwise>        <link type="unknown">          <xsl:value-of select="$link" />        </link>      </xsl:otherwise>    </xsl:choose>  </xsl:if></xsl:template>

I know that XPath expressions used are some quite a hairy monsters, but selecting adjacent siblings is not easy in XPath 1.0 (if someone has a better idea how to do it in XPath 1.0, go ahead and tell me).

not(preceding-sibling::node()[1][self::Link])

means "the immediately preceding node must not be a <Link>", e.g.: only <Link> elements that are "first in a row".

following-sibling::Link[  preceding-sibling::node()[1][self::Link]  and  generate-id(current())  =  generate-id(    preceding-sibling::Link[      not(preceding-sibling::node()[1][self::Link])    ][1]  )]

means

  • from all following-sibling <Link>s, choose the ones that
    • immediately follow a <Link> (e.g. they are not "first in a row"), and
    • the ID of the current() node (always a <Link> that's "first in a row") must be equal to:
    • the closest preceding <Link> that itself is "first in a row"

If that makes sense.

Applied to your input, I get:

<p>Here is some random text with an email address<email>[email protected]</email> and hereis more random text with a url<ext-link ext-link-type="uri" xlink:href="http://www.example.com" /> near the end of the sentence.</p>