Data W(e)ave On The Beach: Regroupement et Structuration / Grouping and structuring

Voici un article qui devrait vous être très utile: comment transformer un CSV en une liste d'objets structurés. On me dira: mais c'est très simple avec Mulesoft, il suffit de passer de application/csv à application/json et le tour et joué. C'est le cas, en effet, pour des objets strictement tabluaires, c'est à dire que chaque ligne représente un objet qui ne contient que des champs à valeur litérale. Mais quand est-il si l'objet est structuré, c'est à dire si N lignes de CSV représente UN objet contenant un champ évalué avec une sous liste de N objets ? Là, il faut effectuer une transformation. C'est le principe de celle-ci dont il est question dans cet article.

Entrons dans le concret. Soit l'exemple sylvestre suivant, qui décrit des arbres dans une foret. Les colonnes communes sont forest, tree, trunk et root. Les informations à mettre dans les objets secondaires (les branches des arbres) sont branch et leave. Enfin, les informations qui permettent d'identifier les arbres sont forest et tree (et qui sont aussi des données communes). Donc à l'entrée, on a ça:

forest,tree,trunk,root,branch,leaves
f,A,12cm,3m,1,110
f,A,12cm,3m,2,220
F,A,18cm,8m,1,150
F,A,18cm,8m,2,270
f,B,110cm,12m,1,310
f,B,110cm,12m,2,320
f,B,110cm,12m,3,340
f,C,16cm,4m,1,230

Et en sortie, on veut obtenir cela:

[
  {
    "forest": "f",
    "tree": "A",
    "trunk": "12cm",
    "root": "3m",
    "branches": [
      {
        "branch": "1",
        "leaves": "110"
      },
      {
        "branch": "2",
        "leaves": "220"
      }
    ]
  },
  {
    "forest": "F",
    "tree": "A",
    "trunk": "18cm",
    "root": "8m",
    "branches": [
      {
        "branch": "1",
        "leaves": "150"
      },
      {
        "branch": "2",
        "leaves": "270"
      }
    ]
  },
  {
    "forest": "f",
    "tree": "B",
    "trunk": "110cm",
    "root": "12m",
    "branches": [
      {
        "branch": "1",
        "leaves": "310"
      },
      {
        "branch": "2",
        "leaves": "320"
      },
      {
        "branch": "3",
        "leaves": "340"
      }
    ]
  },
  {
    "forest": "f",
    "tree": "C",
    "trunk": "16cm",
    "root": "4m",
    "branches": [
      {
        "branch": "1",
        "leaves": "230"
      }
    ]
  }
]

La fonction qui permet de construire les objets structurés est la suivante. Elle a été écrite afin d'être générique dans le cas d'objets ne contenant qu'une sous liste d'objets simples (donc non structurés à leur tour) afin de pouvoir être utilisée tel quel. C'est le cas le plus courant et elle devrait donc être suffisante dans la majorité des cas. Il faudra l'adapter pour qu'elle puisse générer des objets contenant plusieurs sous listes, ou des objets qui ont des listes d'objets possédant eux même leur propre sous liste.

Cette fonction prend quatre paramètres:

lst: la liste des objets tabulaires bruts provenant de CSV
id: un tableau contenant les noms des champs identifiants les objets
common: un tableau contenant les noms des champs commun (c'est à dire à placer dans les objets racines). Les champs qui ne sont pas dans cette liste commune sont automatiquement affectés aux objets inclus.
sublist: le nom du champ de l'objet principal qui désigne la liste des objets inclus.

%dw 2.0
output application/json
fun group(lst, id, common, sublist) = lst 
    groupBy ((ln) -> (id reduce (fld, acc="")-> acc ++ ln[fld] ++ ","))
    mapObject ((o, i) -> (i):(
        (common reduce (v, acc={})->acc ++ (v):o[0][v]))
            ++ (sublist): o map (b) -> b -- common
        )
    pluck $    
---
group(payload, ["forest", "tree"], 
    ["forest", "tree", "trunk", "root"], "branches")

Le principe de cette fonction est le suivant:

Les lignes CSV sont d'abord regroupées en fonction d'une "clé" qui est la concaténation des valeurs des champs d'identification (par groupBy)
Ensuite chaque sous liste obtenue dans l'étape précédente est passée à une réduction qui copie les champs "communs" de la première ligne dans un objet
Les champs communs de chaque ligne de la sous liste sont retranchés (b -- common) et la sous liste est ajoutée sous la forme d'un champ de l'objet créé lors de l'étape précedevte (++ (sublist):)
Par un appel à pluck, on transforme la Map obtenue par GroupBy en une liste.

________________________________________________________________________

Here's an article that should come in very handy: how to transform a CSV into a list of structured objects. Some people will say: but it's very simple with Mulesoft, you just have to switch from application/csv to application/json and you're done. This is indeed the case for strictly tabular objects, i.e. where each line represents an object containing only literal fields. But what if the object is structured, i.e. if N CSV lines represent ONE object containing a field evaluated with a sub-list of N objects? In this case, a transformation is required. That's what this article is all about.

Let's get down to the nitty-gritty. Consider the following sylvan example, which describes trees in a forest. The common columns are forest, tree, trunk and root. Secondary objects (tree branches) are branch and leave. Finally, the information used to identify the trees is forest and tree (which are also common data). So this is the input:

forest,tree,trunk,root,branch,leaves
f,A,12cm,3m,1,110
f,A,12cm,3m,2,220
F,A,18cm,8m,1,150
F,A,18cm,8m,2,270
f,B,110cm,12m,1,310
f,B,110cm,12m,2,320
f,B,110cm,12m,3,340
f,C,16cm,4m,1,230

And that's what we want to achieve:

[
  {
    "forest": "f",
    "tree": "A",
    "trunk": "12cm",
    "root": "3m",
    "branches": [
      {
        "branch": "1",
        "leaves": "110"
      },
      {
        "branch": "2",
        "leaves": "220"
      }
    ]
  },
  {
    "forest": "F",
    "tree": "A",
    "trunk": "18cm",
    "root": "8m",
    "branches": [
      {
        "branch": "1",
        "leaves": "150"
      },
      {
        "branch": "2",
        "leaves": "270"
      }
    ]
  },
  {
    "forest": "f",
    "tree": "B",
    "trunk": "110cm",
    "root": "12m",
    "branches": [
      {
        "branch": "1",
        "leaves": "310"
      },
      {
        "branch": "2",
        "leaves": "320"
      },
      {
        "branch": "3",
        "leaves": "340"
      }
    ]
  },
  {
    "forest": "f",
    "tree": "C",
    "trunk": "16cm",
    "root": "4m",
    "branches": [
      {
        "branch": "1",
        "leaves": "230"
      }
    ]
  }
]

The function used to construct structured objects is as follows. It has been written to be generic in the case of objects containing only a sub-list of simple objects (i.e. unstructured in turn), so that it can be used as is. This is the most common case, and should therefore be sufficient in the majority of cases. It will need to be adapted so that it can generate objects containing several sub-lists, or objects that have lists of objects themselves possessing their own sub-lists.

This function takes four parameters:

lst: the list of raw tabular objects from CSV
id: an array containing the names of the fields identifying the objects
common: an array containing the names of common fields (i.e. to be placed in the root objects). Fields not in this common list are automatically assigned to the included objects.
sublist: the name of the main object field that designates the list of included objects.

%dw 2.0
output application/json
fun group(lst, id, common, sublist) = lst 
    groupBy ((ln) -> (id reduce (fld, acc="")-> acc ++ ln[fld] ++ ","))
    mapObject ((o, i) -> (i):(
        (common reduce (v, acc={})->acc ++ (v):o[0][v]))
            ++ (sublist): o map (b) -> b -- common
        )
    pluck $    
---
group(payload, ["forest", "tree"], 
    ["forest", "tree", "trunk", "root"], "branches")

This function works as follows:

CSV rows are first grouped according to a "key" which is the concatenation of the values of the identifying fields (by groupBy).
Next, each sub-list obtained in the previous step is passed to a reduction that copies the "common" fields of the first line into an object.
The common fields of each line of the sublist are subtracted (b -- common) and the sublist is added as a field of the object created in the previous step (++ (sublist):)
A call to pluck transforms the Map obtained by GroupBy into a list.

Data W(e)ave On The Beach

mardi 20 février 2024

Regroupement et Structuration / Grouping and structuring

Aucun commentaire:

Enregistrer un commentaire

Pourquoi ce blog ? / Why this blog?

Signaler un abus