vendredi 2 février 2024

Découpage d'un texte en ligne / Breaking text into lines

Nous allons traiter maintenant d'un sujet récurent et pas si simple. Il s'agit de découper un texte en lignes d'une longueur maximale définie tout en s'efforçant de respecter les césures entre les mots. Cas typique d'utilisation: découper une adresse en lignes d'adresses. Nous y intégrons une contrainte supplémentaire: si un mot dépasse la longueur de la ligne, alors il y aura césure à l'intérieur du mot, et ce pour chaque occurrence du nombre de caractères dans le mot (ainsi un mot de 100 caractères se verra découpé en 40, puis 40 puis 20 caractères).

Prenons à titre d'exemple un extrait du livre Harry Potter (dont les espaces d'une phrase ont été enlevés afin de créer un mot à découper:

Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. ThesunroseonthesametidyfrontgardensandlitupthebrassnumberfourontheDursleys'frontdoor; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed.

Nous sommes censé obtenir cela (un tableau de lignes dont aucune ne dépasse 40 caractères).

[
"Nearly ten years had passed since the",
"Dursleys had woken up to find their",
"nephew on the front step, but Privet",
"Drive had hardly changed at all.",
"Thesunroseonthesametidyfrontgardensandli",
"upthebrassnumberfourontheDursleys'frontd",
"or; it crept into their living room,",
"which was almost exactly the same as it",
"had been on the night when Mr. Dursley",
"had seen that fateful news report about",
"the owls. Only the photographs on the",
"mantelpiece really showed how much time",
"had passed."
]

Le script DataWeave qui effectue la découpe est le suivant:

%dw 2.0
import * from dw::core::Strings
output application/json
var LINE_LENGTH = 40
fun cut(s, l) = if (sizeOf(s)<=l)
{
lines:[],
line:s
}
else
{
lines:[substring(s, 0, l)] ++ cut(substring(s, l+1,
            sizeOf(s)), l).lines,
line:cut(substring(s, l+1, sizeOf(s)), l).line
}
---
payload
splitBy(" ")
reduce ((v, a={
lines: [],
line: ""
}) ->
if (a.line == "" and sizeOf(v)<=LINE_LENGTH) {
lines: a.lines,
line: v
}
else
if (a.line != "" and sizeOf(a.line ++ " " ++ v)<=LINE_LENGTH) {
lines: a.lines,
line: a.line ++ " " ++ v
}
else // too big words 1/2
if (a.line != "" and sizeOf(v)>LINE_LENGTH) {
lines: (a.lines + a.line) ++ cut(v, LINE_LENGTH).lines,
line: cut(v, LINE_LENGTH).line
}
else // too big words 2/2
if (a.line == "" and sizeOf(v)>LINE_LENGTH) {
lines: a.lines ++ cut(v, LINE_LENGTH).lines,
line: cut(v, LINE_LENGTH).line
}
else
{
lines: a.lines + a.line,
line: v
})
then (v) -> v.lines + v.line

Par rapport avec ce que nous avons vu dans les billets précédents, ce script n'apporte pas de nouveautés techniques remarquables : c'est une utilisation avancée mais classique d'une réduction, pour peu que l'on sache écrire une réduction non triviale. Le texte est d'abord découpé en une série de mots grâce à splitBy. La liste de mots est ensuite passée à la réduction qui ne présente qu'une difficulté: le traitement des mots trop longs. Notez que l'accumulateur contient à la fois les lignes déjà élaborées ("lines") et la dernière line en cours d'élaboration: ("line"). La fonction de conclusion:

then (v) -> v.lines + v.line

ajoute la dernière ligne en cours d'élaboration à la liste. Je pense que vous pouvez utiliser ce code tel quel.
__________________________________________________________________

We're now going to deal with a recurrent and not-so-simple subject. It involves cutting text into lines of a defined maximum length, while respecting the hyphenation between words. Typical use case: cutting an address into address lines. An additional constraint is added: if a word exceeds the length of the line, the word is hyphenated for each occurrence of the number of characters in the word (so a 100-character word is split into 40, then 40, then 20 characters).

Let's take an excerpt from the Harry Potter book as an example (where the spaces in a sentence have been removed to create a word to be cut):

[
"Nearly ten years had passed since the",
"Dursleys had woken up to find their",
"nephew on the front step, but Privet",
"Drive had hardly changed at all.",
"Thesunroseonthesametidyfrontgardensandli",
"upthebrassnumberfourontheDursleys'frontd",
"or; it crept into their living room,",
"which was almost exactly the same as it",
"had been on the night when Mr. Dursley",
"had seen that fateful news report about",
"the owls. Only the photographs on the",
"mantelpiece really showed how much time",
"had passed."
]

The DataWeave script that performs the cut is as follows:

%dw 2.0
import * from dw::core::Strings
output application/json
var LINE_LENGTH = 40
fun cut(s, l) = if (sizeOf(s)<=l)
{
lines:[],
line:s
}
else
{
lines:[substring(s, 0, l)] ++ cut(substring(s, l+1,
            sizeOf(s)), l).lines,
line:cut(substring(s, l+1, sizeOf(s)), l).line
}
---
payload
splitBy(" ")
reduce ((v, a={
lines: [],
line: ""
}) ->
if (a.line == "" and sizeOf(v)<=LINE_LENGTH) {
lines: a.lines,
line: v
}
else
if (a.line != "" and sizeOf(a.line ++ " " ++ v)<=LINE_LENGTH) {
lines: a.lines,
line: a.line ++ " " ++ v
}
else // too big words 1/2
if (a.line != "" and sizeOf(v)>LINE_LENGTH) {
lines: (a.lines + a.line) ++ cut(v, LINE_LENGTH).lines,
line: cut(v, LINE_LENGTH).line
}
else // too big words 2/2
if (a.line == "" and sizeOf(v)>LINE_LENGTH) {
lines: a.lines ++ cut(v, LINE_LENGTH).lines,
line: cut(v, LINE_LENGTH).line
}
else
{
lines: a.lines + a.line,
line: v
})
then (v) -> v.lines + v.line

Compared with what we've seen in previous posts, this script doesn't bring any remarkable technical novelties: it's an advanced but classic use of a reduction, provided you know how to write a non-trivial reduction. The text is first split into a series of words using splitBy. The list of words is then passed to the reduction function, which presents only one difficulty: dealing with words that are too long. Note that the accumulator contains both the lines already processed ("lines") and the last line being processed: ("line"). The conclusion function:

then (v) -> v.lines + v.line

adds the last line in progress to the list. I think you can use this code as is.

Aucun commentaire:

Enregistrer un commentaire

Pourquoi ce blog ? / Why this blog?

Mulesoft est un ESB du monde Salesforce utilisé pour construire des flots permettant aux pièces logicielles d'un Système d'Informati...