Data W(e)ave On The Beach: Accès par clé et Performance / Key access and Performance

Mulesoft est une plateforme qui offre des performances très honorables à partir du moment où l'on sait s'en servir. Le souci est qu'à part de vagues conseils très généraux (et peu utiles pour qui fait preuve d'un peu de bon sens), très peu d'articles sont disponibles sur Internet à ce sujet. Et bien en voici un pour combler ce manque et qui peut GRANDEMENT vous aider. Il y en aura d'autres lorsque je ferai de nouvelles "découvertes". Il s'agit d'un article à la fois concret et pointu, et qui sur les sujets traiter permet de gagner ENORMEMENT (x10, x100, x1000!) en performance.

Un besoin très, très récurent est de faire une recherche dans une liste à partir d'une clé. Par exemple, pour s'assurer qu'un objet d'un ID donné est présent et d'aller consulter les informations qu'il recèle. Pour effectuer cette recherche, la plupart des développeurs vont écrire quelque chose comme:

%dw 2.0
output application/json
fun search(list, id) = list filter (v)->v.id==id
---
payload search 86

Pour retrouver dans la liste en payload, l'objet d'ID 86. Voici une TRES TRES mauvaise implémentation de ladite fonctionnalité ! NE FAITES PAS CA ! Pourquoi ? Parce que Mulesoft va parcourir tous les éléments de la liste, pour n'en récupérer qu'un seul ! Si vous utilisez "search" à l'intérieur d'un "map" ou d'un "reduce", vous aller probablement exécuter un processus quadratique (type NxN) et vos performances vont s'effondrer. C'est le problème de performance N°1 dans la plupart des projets. C'est un vrai attrape nigaud: en test avec 10 à 100 éléments on a 100 à 10000 opérations élémentaires à exécuter et cela prend de quelques centaines de millisecondes au pire. En PROD, avec un million d'éléments, nous nous retrouvons avec 1000 milliards d'opérations !

La bonne solution est de créer une "HashMap" qui référence chaque objet en fonction de son ID et ensuite de les rechercher en fonction de l'ID qui sert de clé:

%dw 2.0

output application/json

var hashmap = payload groupBy $.id

fun search(hm, id) = hm[id][0]

---

payload map (o)->search(hashmap, o.id)

Qu'est ce qu'une "HashMap" dans Mulesoft ? Et bien, c'est tout simplement un objet. Une clé est le nom d'un champ de cet objet HashMap, les valeurs de ces champs, sont les objets de la liste initiale. Autant de champs, autant de clés et donc autant d'objets référencés. Comment transformer une liste en un objet et donc en une "HashMap" ? Facile, on utilise la fonction groupBy qui "regroupe" les objets en fonction d'un critère, ici tout simplement l'ID. La valeur de chaque champ est un tableau d'un seul élément (car les ID étant discriminants, le regroupement ne contient que l'objet propriétaire), d'où le [0] dans la fonction "search".

La recherche par HashCode est d'ordre constant (1.1 si certaines conditions sont remplies: la taille de la HashMap est un nombre premier, la fonction de hachage est uniforme, la table est remplie à 75%). Donc tout processus garde son ordre (X x ~1 = ~X), s'il fait référence à cette fonction de recherche. Noter qu'il faut compter l'ordre de la création de la HashMap dans la complexité de l'algorithme, qui est linéaire (ordre N) Ce qui donne dans notre cas N+N = 2N. Cela reste linéaire et donc performant (les esprits chagrins me feront remarquer que la recherche d'un unique élément dans la liste est plus performant avec "filter" et ils auront raison mais ...juste pour ce cas la !).

Tout va bien alors ? Ben non. Ou plutôt cela dépend d'un autre facteur: la plateforme d'implémentation. Si la HashMap est en "JSON", les performances sont catastrophiques (pour de grandes listes). En "Java", c'est très performant. Pourquoi ? Parce que la résolution de [id] en JSON provoque le parcourt de tous les champs (et donc cette recherche est d'ordre N), mais en Java, l'objet est une LinkedHashMap et l'accès est par hashcode (ordre constant). Le code présenté plus haut, donc n'est hélas pas performant car quadratique. Il faut écrire plutôt:

%dw 2.0
output application/java
---
payload groupBy $.id

Dans un premier Transform afin d'avoir la HashMap

%dw 2.0
output application/json
var hashmap = payload
fun search(hm, id) = hm[id][0]
---
payload map (o)->search(hashmap, o.id)

Que l'on utilise dans un second Transform.

J'ai écrit un petit programme de test qui mesure tout çà et je vous présente les résultats (le délai qui nous intéresse est celui qui sépare les timestamps C et D).

Si j'utilise JSON, voila ce que j'obtiens:

Que constate t'on ?

Pour une liste de 1000 éléments, le délai est de 83ms
Pour une liste de 10000 éléments, le délai est de 2,5 secondes
Pour une liste de 100000 éléments, le délai est de 4,47 minutes!

Passons à Java:

Que constate t'on ?

Pour une liste de 1000 éléments, le délai est de 67ms
Pour une liste de 10000 éléments, le délai est de 84 ms
Pour une liste de 100000 éléments, le délai est de 173 ms (si, si)
Pour une liste d'un million (!) d'éléments : 2.5 s

Conclusion: le choix de l'implémentation de "l'objet" qui nous sert de HashMap est CRUCIAL. On change d'ordre de l'algorithme et donc les performances bondissent avec la taille de la liste a créer. On passe de l'impossible au possible, rien de moins. Ma question est: pourquoi cette information qui devrait être sur le site de Mulesoft en première page, taille 50em, rouge clignotant est NULLE PART ?

_________________________________________________________________________

Mulesoft is a platform that offers very respectable performance, provided you know how to use it. The problem is that, apart from vague general advice (of little use to anyone with a little common sense), there are very few articles on the subject available on the Internet. Well, here's one to fill the gap, and one that can help you a LOT. There will be more as I make new "discoveries". It's an article that's both concrete and to the point, and which, on the subjects dealt with, will help you gain ENORMOUSLY (x10, x100, x1000!) in performance.

A very, very recurrent need is to search a list using a key. For example, to ensure that an object with a given ID is present, and to consult the information it contains. To perform this search, most developers will write something like:

%dw 2.0
output application/json
fun search(list, id) = list filter (v)->v.id==id
---
payload search 86

To find the ID 86 object in the payload list. This is a VERY VERY poor implementation of this feature! DON'T DO IT! Why not? Because Mulesoft will go through all the elements in the list, and only retrieve one! If you use "search" inside a "map" or a "reduce", you'll probably run a quadratic process (NxN type) and your performance will plummet. This is the No. 1 performance problem in most projects. It's a real eye-catcher: in testing, with 10 to 100 elements, you have 100 to 10,000 elementary operations to execute, taking from a few hundred milliseconds at worst. In PROD, with a million elements, we end up with 1,000 billion operations!

The best solution is to create a "HashMap" that references each object according to its ID, and then search for them according to the ID that serves as the key:

%dw 2.0

output application/json

var hashmap = payload groupBy $.id

fun search(hm, id) = hm[id][0]

---

payload map (o)->search(hashmap, o.id)

What is a "HashMap" in Mulesoft? Well, it's simply an object. A key is the name of a field in this HashMap object, and the values of these fields are the objects in the initial list. As many fields, as many keys and therefore as many referenced objects. How do you transform a list into an object, and therefore into a HashMap? Easily, we use the groupBy function, which "groups" objects according to a criterion, in this case simply the ID. The value of each field is an array of a single element (because IDs are discriminating, the grouping contains only the owner object), hence the [0] in the "search" function.

HashCode search is of constant order (1.1 if certain conditions are met: the size of the HashMap is a prime number, the hash function is uniform, the table is 75% full). So any process keeps its order (X x ~1 = ~X), if it refers to this search function. Note that the order of the HashMap creation must be taken into account in the complexity of the algorithm, which is linear (order N). In our case, this gives N+N = 2N. It's still linear and therefore efficient (those with a grudge will point out that the search for a single element in the list is more efficient with "filter", and they'll be right, but that's just the case!)

So all's well then? Well, no. Or rather, it depends on another factor: the implementation platform. If the HashMap is in "JSON", performance is catastrophic (for large lists). In "Java", it performs very well. Why is this? Because resolving [id] in JSON causes all fields to be searched (and so this search is of order N), but in Java, the object is a LinkedHashMap and access is by hashcode (constant order). Unfortunately, the code presented above is not efficient, as it is quadratic. Instead, write:

%dw 2.0

output application/java

---

payload groupBy $.id

First Transform to obtain the HashMap

%dw 2.0
output application/json

var hashmap = payload

fun search(hm, id) = hm[id][0]

---

payload map (o)->search(hashmap, o.id)

Which we use in a second Transform.

I've written a little test program that measures all this, and here are the results (the delay we're interested in is the one between timestamps C and D).

If I use JSON, here's what I get:

What do we see?

For a list of 1000 elements, the delay is 83ms
For a list of 10000 elements, the delay is 2.5 seconds.
For a list of 100000 elements, the delay is 4.47 minutes!

Let's move on to Java:

What do we see?

For a list of 1000 elements, the delay is 67ms
For a list of 10000 elements, the delay is 84ms
For a list of 100,000 elements, the delay is 173 ms (yes, yes)
For a list of a million (!) elements: 2.5 s

Conclusion: the choice of implementation of the "object" that serves as our HashMap is CRUCIAL. We're changing the order of the algorithm, so performance jumps with the size of the list to be created. We go from the impossible to the possible, nothing less. My question is: why is this information, which should be on the front page of the Mulesoft website, size 50em, flashing red, NOWWHERE to be found?

Data W(e)ave On The Beach

jeudi 8 février 2024

Accès par clé et Performance / Key access and Performance

Aucun commentaire:

Enregistrer un commentaire

Pourquoi ce blog ? / Why this blog?

Signaler un abus