Therefore, the software is distributed to all participating institutes via the CVMFS file system .
Given the rapid development of software of a collaboration of \(\mathcal (3000)\) members, it would be unfeasible for each collaborating institute to maintain and validate its own software installation. This experiment uses a huge software stack which is provided and maintained centrally by a dedicated team on a specific platform (at the time of writing the migration from Scientific Linux 6 to CentOS 7 is still not fully completed). The most challenging constraints are imposed by running analysis jobs of the ATLAS high energy physics experiment . The requirements on BAF2 are as broad as the range of research fields it serves. A realistic simulation of the load caused by a diverse mix of jobs is very difficult, and even after the two years of operations, new use cases and workloads appear on a regular basis.įor this reason, we consider the presentation of experiences with the operational system in its entirety under realistic production workloads of users to be more useful and will present the observed effects in “ Operational Experience” in detail before concluding in “ Conclusion”. Another aspect which comes into play is that even refined synthetic benchmarks are quite different from the constantly evolving load submitted by users. While we performed some benchmarking of the components before putting the system into operation, we believe that our results cannot be easily transferred one-to-one to setups using different hardware or which operate at different scales. On purpose, benchmarks of the system are not presented. The paper will conclude with a presentation of experiences and observations collected in the first two years of operation in “ Operational Experience”. After this introduction, we will present the key components of the cluster in “ CVMFS”, “ Containerization”, “ CephFS”, “ XRootD”, “ Cluster Management” and “ HTCondor” in-depth. These requirements and which solutions we chose to tackle them will be discussed in “ BAF2 Requirements” followed by a short overview of the new concepts of this cluster in “ Cluster Concept”. In contrast, BAF2 is adapted to the increasingly varying demands of the different communities. BAF1 maintenance was characterized by a large collection of home-brewed shell scripts and other makeshift solutions. The latter provides software maintained by CERN and HEP collaborations. It used a TORQUE/Maui-based resource management system whose jobs were run directly on its worker nodes, a Lustre distributed file system without any redundancy (except for RAID 5 disk arrays) for data storage and an OpenAFS file system to distribute software which was later supplemented by CVMFS clients (see “ CVMFS” for more information on CVMFS). BAF1 was a rather conventional cluster whose commissioning started in 2009. Both BAF1 and BAF2 were purchased to perform fundamental research work in all kinds of physics fields ranging from high energy physics (HEP), hadron physics, theoretical particle physics, theoretical condensed matter physics to mathematical physics. Occasionally we compare the setup of this cluster with the one of its predecessor (BAF1). We call this cluster second generation Bonn Analysis Facility (BAF2) in the following. This work describes the commissioning and first operational experience gained with an HPC/HTC cluster at the Physikalisches Institut at the University of Bonn. To cope with the increased demands, an ongoing trend to consolidate different communities and fulfil their requirements with larger, commonly operated systems is observed . Therefore, the diversity and complexity of services run by HPC/HTC cluster operators has increased significantly over time. As a result of these continuously emerging new tools, software stacks on which jobs rely become increasingly complex, data management is done with more and more sophisticated tools and the demands of users on HPC/HTC clusters evolve with breathtaking speed . New tools and technologies keep showing up and users are asking for them. But just scaling up existing resources is not sufficient to handle the ever increasing amount of data. Researchers require more and more computing and storage resources allowing them to solve ever more complex problems. High performance and high throughput computing (HPC/HTC) is an integral part of scientific progress in many areas of science.