Data W(e)ave On The Beach: Question de performances autour du traitement d'une liste/Performance issues around list processing

Question de performances, épisode 2. Nous avions vu dans un précédant post combien le choix de l'implémentation des objets manipulés (Java plutôt que Json). Nous allons voir ici que le choix de l'algorithme est tout aussi essentiel.

Vous allez me dire: évidemment qu'il faut choisir le "bon" principe algorithmique et nous en tenir à ceux dont la complexité est au maximum d'ordre O(N)log(N). Les choses ne sont pas si simple, car le langage utilisé par Mulesoft, Dataweave, est un langage fonctionnel, dont le fonctionnement présente quelques particularités difficilement anticipables.

Nous allons tester trois implémentations d'une fonctionnalité à priori simple: supprimer des champs d'un objet (ce qui est aussi supprimer les lignes d'une Hashmap ou ce qui en fait office dans Dataweave). La comparaison en terme de performance est éloquente.

La première implémenattion utilise l'opérateur "--". Elle supprime une partie des champs d'un objet. Le nom de ces champs figure dans une liste placée à droite de l'opérateur "--" (variable keys ici):

a -- (keys)

Le code a été coupé en deux parties (deux "Transform" dans le flot Mulesoft) afin de s'assurer que les objets manipulés soient effectivement implémentés sous la forme de HashMap Java, la version la plus performante d'un objet Dataweave. C'est la raison d'être du premier composant de transformation:

%dw 2.0

output application/java

fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds

fun eval(prs) =

Now() then (t1)-> prs() then (r)->

log(Now() -t1)

then r

---

eval(()->{

o1:(0 to 10000) reduce(v, a={})->a ++ (v as String):v,

k:keysOf((0 to 10000) reduce(v, a={})->

if ((v mod 4)==0) a else a ++ (v as String):v)

}

)

"o1" est la hashmap à laquelle on retirera un quart de ses lignes. Les clés sont stockées dans "k". Nous allons donc retirer 2500 lignes d'une collection qui en contient 10000. Le code pour le faire (second transformeur) à l'avantage d'être d'une extrême simplicité:

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.o1 -- payload.k
)

La méthode "eval" calcule le temps nécessaire pour exécuter un code (nombre donné en millisecondes). Que nous indiquent ces méthodes "eval" ? Eh bien çà:

INFO  ... DefaultLoggingService$: 35
INFO  ... DefaultLoggingService$: 3917

Quatre secondes, c'est considérable. Que ce passe-t-il si nous doublons la taille de l'essai ? Eh bien ça:

INFO  ... DefaultLoggingService$: 35
INFO  ... DefaultLoggingService$: 12587

Le premier transformeur s'exécute tellement rapidement que les chiffres affichés ne sont pas significatifs (il peut être très fortement impacté par le garbage collector par exemple).

Le problème s'aggrave considérablement, c'est à dire très au delà d'un rapport 2. Clairement, le fonctionnement sous-jacent de l'opérateur "--" est polynomial (O(N2+))).

Tentons une autre approche. Au lieu de supprimer tous les champs d'un coup (opérateur "--"), nous allons les supprimer un par un (operateur "-" ou key est le nom d'un champ):

a - key

Le premier transformeur est strictement le même. Toutes les différences dont dans la seconde snippet:

%dw 2.0
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.k reduce (k, a=payload.o1)-> a - k
)

Pour retirer 2500 lignes parmi 10000, nous obtenons les résultats suivants:

INFO ... DefaultLoggingService$: 47

INFO ... DefaultLoggingService$: 3376

A priori, le résultat est à peu près le même. Doublons la taille de l'essai (5000/20000) pour nous en assurer:

INFO ... DefaultLoggingService$: 37

INFO ... DefaultLoggingService$: 11849

Nous confirmons, les nombres sont du même ordre de grandeur. Le problème vient bien du fait qu'on cherche à retirer un champ d'une hashmap (j'ai bien vérifié o1 est bien une HashMap Java). Or retirer une ligne d'une telle structure est de complexité O(1). Difficile d'expliquer la quadratique...

Essayons quelque chose de très différent. Plutôt que retirer des lignes d'une hashmap, nous allons en fait en recréer une autre en filtrant les lignes qui n'appartiennent pas à une autre hashmap. Cette fois, il faut modifier les deux transformeurs. Le premier génère une liste de 10000 objets que l'on pourra filter (on ne peut filtrer un objet Dataweave/Hashmap Java) et l'objet Dataweave/Hashmap Java qui contient les champs à supprimer.

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->{
		l1:((0 to 10000) reduce(v, a={})->
                    a ++ (v as String):v) pluck (v, k)-> { id:k, value:v },
		o2:(0 to 10000) reduce(v, a={})->
                    if ((v mod 4)==0) a else a ++ (v as String):v
	}
)

Le second transformeur filtre la liste et construit la hashmap avec les champs acceptés:

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.l1 reduce (r, a={})->
        if (isEmpty(payload.o2[r.id])) a ++ (r.id):r.value else a
)

L'exécution est beaucoup, beaucoup plus rapide:

INFO  ... DefaultLoggingService$: 48
INFO  ... DefaultLoggingService$: 77

Doublons l'essai pour mesurer l'impact que peut avoir la taille de l'essai:

INFO  ... DefaultLoggingService$: 37
INFO  ... DefaultLoggingService$: 72

On reste dans l'épaisseur du trait, l'impact n'est même pas perceptible.

La meilleure solution étant évidente, tentons de trouver des raisons pour une telle différence. Je ne peux ici qu'avancer des hypothèses, en partant de ce que j'ai appris d'autres technologies utilisant une approche fonctionnelle. Avançons que la raison tient au principe d'immutabilité: une valeur ne peut pas être modifiée, la seule action autorisée et d'en recréer une autre qui reflète la modification demandée. Donc retirer une ligne d'une hashmap implique en fait d'en recréer une autre, plus courte d'une ligne. L'opération n'est donc pas anecdotique si cette hashmap est importante. Sachant cela, la complexité en O(N2) s'explique très bien.

Mais me diriez vous, suppression et ajout sont toutes deux des modifications et nous devrions avoir un comportement de même type en recréant l'objet. Or ce n'est pas le cas. Pourquoi ?

Evidemment, si l'immutabilité - louée de partout comme un puissant pattern de robustesse - s'accompagnait systématiquement d'un tel inconvénient, l'approche fonctionnel, dont elle est un élément essentiel, serait resté un sujet d'étude confiné au monde universitaire. Heureusement, il existe des "trucs" qui permettent de contourner les obstacles. Essentiellement, la plupart des langages fonctionnels implémentent les listes de manière récursives: L2 qui est L1 auquel on ajoute un élément E nouveau, peut être définie de la manière suivante:

Il n'est donc plus nécessaire de recopier L1 ! (qui peut être référencée ailleurs sans danger puisqu'elle est immutable). Ajouter un élément à une liste n'est donc pas un algorithme en O(N) et donc construire la ligne n'est pas un algorithme en O(N2); mais juste O(N).

OK. Mais la, ce qui est généré est une hashmap, pas une liste. Eh bien, pas sûr. En fait la transformation de l'objet construit par notre snippet en une hashmap est probablement l'étape finale pour se conformer à la directive output application/java. Il est fort probable qu'en interne, un objet Dataweave en cours de construction soit aussi implémenté sous la forme d'une liste. Cela expliquerait d'ailleurs que la recherche d'un champ dans un objet Dataweave soit d'ordre O(N) et non d'ordre O(1) comme c'est le cas d'une hashmap Java.

Que déduire de tout ça: eh bien que décidément, Mulesoft nécessite une compréhension profonde de ses mécanismes pour pouvoir être utilisé efficacement. Dès qu'il faut traiter des objets ou listes de grande taille, on a toutes les chances de tomber dans un piège ou un autre et voir les performances s'effondrer, si on ne possède pas de base solides en algorithmie et en programmation fonctionnelle. Les erreurs ou approximation ne pardonnent pas. Pour le prouver, le prochain post concernera un flot capable de détecter des changements dans une liste d'un million de lignes. Spoiler: avec la bonne implémentation, on parle de secondes pour un tel traitement, au pire de minutes, pas d'heures !

_________________________________________________________________________

Performance issues, episode 2. In a previous post, we saw how important it is to choose the implementation of the manipulated objects (Java rather than Json). We'll see here that the choice of algorithm is just as essential.

You'll tell me: of course we have to choose the “right” algorithmic principle and stick to those whose complexity is of order O(N)log(N) at most. Things aren't quite so simple, however, because the language used by Mulesoft, Dataweave, is a functional language, with a few peculiarities that are difficult to anticipate.

We're going to test three implementations of an apparently straightforward functionality: deleting fields from an object (which is also deleting rows from a Hashmap, or what Dataweave calls a Hashmap). The comparison in terms of performance speaks for itself.

The first implementation uses the “--” operator. It deletes part of an object's fields. The name of these fields appears in a list to the right of the “--” operator operator (variable keys here):

a -- (keys)

The code has been split into two parts (two “Transforms” in the Mulesoft flow) to ensure that the objects manipulated are actually implemented in the form of a Java HashMap, the most powerful version of a Dataweave object. This is the purpose of the first transformation component:

%dw 2.0

output application/java

fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds

fun eval(prs) =

Now() then (t1)-> prs() then (r)->

log(Now() -t1)

then r

---

eval(()->{

o1:(0 to 10000) reduce(v, a={})->a ++ (v as String):v,

k:keysOf((0 to 10000) reduce(v, a={})->

if ((v mod 4)==0) a else a ++ (v as String):v)

}

)

“o1” is the hashmap from which we will remove a quarter of its lines. The keys are stored in “k”. So we're going to remove 2,500 lines from a collection containing 10,000. The code to do this (second transformer) has the advantage of being extremely simple:

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.o1 -- payload.k
)

The “eval” method calculates the time needed to execute a code (given in milliseconds). What do these “eval” methods tell us? Well, this:

INFO ... DefaultLoggingService$: 35

INFO ... DefaultLoggingService$: 3917

Four seconds is a long time. What happens if we double the trial size? Well, this:

INFO  ... DefaultLoggingService$: 35
INFO  ... DefaultLoggingService$: 12587

The first transformer runs so fast that the numbers displayed are not significant (it may be heavily impacted by garbage collection, for example).

The problem worsens considerably, i.e. far beyond a ratio of 2. Clearly, the underlying operation of the “--” operator is polynomial (O(N2+))).

Let's try another approach. Instead of deleting all the fields at once (“--” operator), we'll delete them one by one (“-” operator, where key is the name of a field:

a - key

The first transformer is strictly the same. All the differences are in the second snippet:

%dw 2.0
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.k reduce (k, a=payload.o1)-> a - k
)

To remove 2500 lines from 10000, we obtain the following results:

INFO ... DefaultLoggingService$: 47

INFO ... DefaultLoggingService$: 3376

A priori, the result is more or less the same. Let's double the test size (5000/20000) to be sure:

INFO ... DefaultLoggingService$: 37

INFO ... DefaultLoggingService$: 11849

We confirm that the numbers are of the same order of magnitude. The problem comes from the fact that we're trying to remove a field from a hashmap (I've checked that o1 is a Java HashMap). Removing a row from such a structure is O(1) complex. Hard to explain the quadratic...

Let's try something very different. Rather than removing rows from one hashmap, we'll actually recreate another by filtering out rows that don't belong to another hashmap. This time, we need to modify both transformers. The first generates a list of 10,000 objects that can be filtered (a Dataweave/Hashmap Java object cannot be filtered) and the Dataweave/Hashmap Java object containing the fields to be removed.

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->{
		l1:((0 to 10000) reduce(v, a={})->
                    a ++ (v as String):v) pluck (v, k)-> { id:k, value:v },
		o2:(0 to 10000) reduce(v, a={})->
                    if ((v mod 4)==0) a else a ++ (v as String):v
	}
)

The second transformer filters the list and builds the hashmap with the accepted fields:

%dw 2.0
output application/java
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
---
eval(()->
     payload.l1 reduce (r, a={})->
        if (isEmpty(payload.o2[r.id])) a ++ (r.id):r.value else a
)

Execution is much, much faster:

INFO  ... DefaultLoggingService$: 48
INFO  ... DefaultLoggingService$: 77

Let's duplicate the test to measure the impact of test size:

INFO  ... DefaultLoggingService$: 37
INFO  ... DefaultLoggingService$: 72

The impact is not even perceptible.

The best solution being obvious, let's try to find reasons for such a difference. I can only speculate here, based on what I've learned from other technologies using a functional approach. Let's put forward that the reason lies in the principle of immutability: a value cannot be modified, the only action allowed is to recreate another one that reflects the requested modification. So removing a line from a hashmap actually means recreating another one, one line shorter. The operation is therefore not trivial if the hashmap is large. With this in mind, the complexity in O(N2) is easy to explain.

But you might say, deleting and adding are both modifications, and we should have the same type of behavior when recreating the object. But this is not the case. Why isn't this the case?

Obviously, if immutability - praised everywhere as a powerful robustness pattern - were systematically accompanied by such a drawback, the functional approach, of which it is an essential element, would remain a subject of study confined to the academic world. Fortunately, there are a number of “tricks” that can be used to overcome these obstacles. Essentially, most functional languages implement lists recursively: L2, which is L1 to which we add a new element E, can be defined as follows:

So there's no need to copy L1! (which can safely be referenced elsewhere, since it is immutable). So adding an element to a list is not an O(N) algorithm, and building the line is not an O(N2) algorithm; it's just O(N).

OK. But then, what's generated is a hashmap, not a list. Well, not sure. In fact, transforming the object built by our snippet into a hashmap is probably the final step in complying with the output application/java directive. It's highly likely that a Dataweave object under construction is also implemented as a list. This would explain why the search for a field in a Dataweave object is of order O(N) and not of order O(1), as is the case with a Java hashmap.

What can we deduce from all this? Well, Mulesoft definitely requires a deep understanding of its mechanisms to be used effectively. As soon as you have to deal with large objects or lists, there's every chance of falling into one trap or another and seeing performance plummet, if you don't have a solid grounding in algorithmic and functional programming. Errors and approximations are unforgiving. To prove this, the next post will concern a stream capable of detecting changes in a list of a million lines. Spoiler: with the right implementation, we're talking seconds for such processing, at worst minutes, not hours.

Data W(e)ave On The Beach

mercredi 8 janvier 2025

Question de performances autour du traitement d'une liste/Performance issues around list processing

Aucun commentaire:

Enregistrer un commentaire

Pourquoi ce blog ? / Why this blog?

Signaler un abus