How are non-alphanumeric characters handled in the XML, e.g., & and % and the copyright symbol
For the character issue, I'll walk through the scenarios we've thought through and how we are handling each:
- Items in an ordered lists entered through the WYSIWYG editor will come across in HTML:
- <ol>
- <li>Item 1</li>
- <li>Item 2</li>
- Items entered with a bullet symbol using the WYISWYG Symbol Library are initially saved as HTML-named entities by the WYISWYG editor. We convert the HTML-named entity to an HTML-decimal numbered entity.
- Any UTF-8 character (non-alphanumeric) entered via the WYSIWYG editor is saved as an HTML-named entity. We convert all HTML-named entities to their equivalent HTML-decimal numbered entity.
- Any content the editor types directly into the raw HTML that is saved as UTF-8 (example: H&HN vs. H&HN) in the teaser and body fields or in the non-WYSIWYG editor controlled fields (like Headline and Sub-headline) are converted / translated to HTML-decimal numbered entities.
- And if a user enters the HTML-hexadecimal numbered entity, we convert THAT to HTML-decimal numbered entities, wherever we find it.
Basically, you can count on our system returning the HTML-decimal numbered entity for non-alphanumeric characters. EXCEPTION: If you have articles that were imported, you may have the special character for the ampersand & vs. the html-named entity &. If that is the case, we do not translate the special character. If you have a specific set of articles that fall into this category, do the following:
- Open the article in the admin tool
- Open the relevant field's (Teaser or Body) WYSIWYG editor
- Close and Save the editor's updates.
- The special characters will now be saved as HTML-named entities.
- Save the article.
- Re-try your export feed.
NOTE: For Teasers, depending on your site, you may need to remove the <p> tags from around the teaser's copy to prevent awkward breaks on the front-pages of your site. Example: <p>sample teaser & data</p> becomes just sample teaser & data. TURN THE BELOW INTO AN HTML TABLE WITH 4 columnsHTML Decimal Numbered Entity = UTF-8 and HTML Named Entity and HTML Hexadecimal Numbered Entity& = & and & and &x26;