Exploring the SAP Signavio Open Dataset with hundreds of thousands of Process Models

SAP Signavio provides the Business Process Management Academic Initiative platform for process modeling, governance, and analysis free of charge to the academic community. As an additional means to facilitate academic research and to give back to the academic community, we have made many of the process models, as well as other business models created on the platform, available for non-commercial research purposes on an open data sharing platform. If you are a researcher interested in the SAP Signavio Academic Models (SAP-SAM) dataset, you can access it on zenodo.org. 

This blog post will give a basic introduction of the dataset and provide use case ideas, as well as usage recommendations. 

SAP Signavio Academic Models contains around 600,000 models of business processes, business decisions, and related notions (this count does not include auto-created example models). About 70,000 users have created the models over the course of around ten years. We can assume with reasonable confidence that SAP-SAM is by far the largest openly available process model dataset. 

The most frequently occurring models in the dataset are Business Process Model and Notation (BPMN) models, as well as Business Decision and Notation (DMN) models and value chains (that depict high-level perspectives on business processes). In addition, a plethora of models in other notations exists, ranging from Case Model and Notation (CMMN) models– the less appreciated sibling of BPMN and DMN – via old-school process models created as Event-driven Process Chains to Petri Nets. 

The models in the dataset are serialized in an SAP Signavio-specific data format but conceptually standard-compliant (in the case of models that have in fact an underlying standardized notation). Example code that shows how to work with the models is available on GitHub. 

Running a quick analysis of the dataset can provide intriguing insights about how (process) modeling languages are used, as well as about the community that has contributed to the creation of the models. Below, we provide two exemplary insights, focusing on BPMN models1 as the most popular notation. 

Example 1 – Natural Language Use 

Using a natural language analysis tool, we have determined the language of the model names of all BPMN models. Unsurprisingly, most models are written (or, more precisely, named) in English, today’s lingua franca. German as the second most prevalent language is no surprise either, considering that SAP Signavio is primarily Germany-based and has its roots in German academia. However, it is interesting to see that Italian and Spanish models are more prevalent than French ones: After all, a small set of French example models is created whenever a user registers a new modeling repository (workspace) using French as the default language; in contrast, in case the default language is neither French nor German, these example models will be created in English. This indicates that the Italian and Spanish academic communities are apparently substantially more active than the French community. Another surprise is that Estonian and Slovenian are among the top ten languages, although their speaker population is relatively small. This can be explained by the fact that both Estonia and Slovenia have a vibrant academic business process management community, led by influential and internationally renowned professors. 


Number of BPMN models per language.


Example 2 – BPMN Element Use 

We have also plotted the prevalence of the most popular element types in the collection’s BPMN models. Here, we can see that most element types are merely used by a small subset of the models. Let us highlight that by looking at the long tail of the distribution (not in the diagram), we can see that more than 33 element types are used in less than 1% of the models. This observation supports the well-known practical assumption that the full set of BPMN elements is unnecessarily large for most use cases (hence, SAP Signavio Process Manager supports the configuration of element subsets to facilitate coherently simple models and convenient modeling). 


Prevalence of element types across all BPMN models.

The SAP-SAM dataset can be used to develop and evaluate a broad range of concepts, algorithms, and software artifacts that are designed to either explicitly address business process model management and analytics needs, or more generally the management of large knowledge bases. As process management-specific examples, the following use cases are worth highlighting: 

  • Modeling pattern identification: What are typical ‘best practice’ patterns that modelers use, and what are typical mistakes that they make? 
  • Modeling recommendation generation: Given an unfinished/incomplete process model, what next changes should a modeler reasonably apply? 
  • Process querying: How can one effectively and efficiently obtain useful and precise information from large, messy, and potentially inconsistent collections of process models? 

When working with the dataset, the following set of recommendations should be followed: 

  • It typically makes sense to filter out the auto-generated example processes that are part of the dataset. 
  • Researchers may want to filter out very small or very large models. For this, the number of nodes (i.e., tasks in a process model) is a good proxy. 
  • For some use cases, researchers may want to remove models in which many task labels are very short; such models typically represent technical modeling exercises or strongly simplified research examples that demonstrate process flow peculiarities that are relevant in a particular theoretical-technical context. 

Given the academic nature of SAP-SAM, when working with the dataset, it is important to acknowledge some limitations: After all, the models have been created by students, teachers, and researchers for academic purposes and are in many aspects not representative of the models created by real-world organizations; in particular, real-world models typically reflect a broader organizational objective that the models aim to achieve, whereas academic models may focus on technical aspects that are either learned by a student or investigated by a researcher. 

The SAP Signavio Academic Models dataset contains the — to this date — largest openly available collection of business process models. Academic researchers can use this dataset to inform the design and evaluation of algorithms and software artifacts for the analysis of process models, and in particular large collections thereof. 

For more details, note that:

  • Example code that shows how researchers can work with the dataset is available on GitHub. 
  • The dataset itself is available on Zenodo. 

Did you like this blog post? Follow SAP Signavio @ SAP Community for more content like this and subscribe to our Product News Subscription Center to be always up to date on the latest product news at SAP Signavio. 

This post is based on a paper that the author has co-authored with Diana Sola, Christian Warmuth, Bernhard Schäfer, Peyman Badakhshan, and Jana-Rebecca Rehse (University of Mannheim).