

#### The Mu3e Data Acquisition System

#### Handling Terabits per second without hardware trigger –

#### Sebastian Dittmeier on behalf of the Mu3e Collaboration Physikalisches Institut – Heidelberg University IFDEPS – Annecy – 13.03.2018



INTTERNATIONAL MAX PLANCK RESEARCH SCHOOL



FOR PRECISION TESTS OF FUNIDAMENTAL SYMMETRIES



# Trigger-less DAQ in HEP

o Trigger-less:

- Without hardware trigger
- Software-only event selection

o Data Acquisition challenges:

- High resolution
   Detectors with millions of channels
- High luminosities/rates
   Fast detectors, fast signal processing
- ➤ High data throughput
- o Why trigger-less data acquisition?
  - o Improve "trigger" efficiency (e.g. LHCb Run III upgrade)
  - High statistics required
  - for precision experiments (e.g. PANDA, Mu3e)





# Martin

#### The Mu3e Experiment

Search for the charged lepton flavor violating decay  $\mu^+ \rightarrow e^+ e^- e^+$ 



<u>Standard Model</u> Highly suppressed branching ratio BR<sub>SM</sub> < 10<sup>-54</sup>

Probe physics beyond SM Any observation is a clear sign for new physics!



Current limit on  $\mu^+ \rightarrow e^+e^-e^+$  **BR**<sub>meas</sub> < **10**<sup>-12</sup> (SINDRUM 1988)

#### Goal of Mu3e

Enhance sensitivity to branching ratios  $\mathcal{O}(10^{-16})$ 

IFDEPS - Annecy - 13.03.2018



- Stopped muons decay in a solenoidal magnetic field of B = 1T
- Low momentum electrons:  $p_e \leq 53 \text{ MeV/c}$
- > Thin silicon pixel tracking detector: precise momentum ( $\sigma_p < 1.0 \text{ MeV/c}$ ) and vertex ( $\mathcal{O}(100 \ \mu\text{m})$ ) measurement
- > Scintillating fibres and tiles: precise time information ( $\sigma_t < 500 \text{ ps}$ )

#### An Experiment at the Intensity Frontier

- For the final sensitivity goal of  $\mathcal{O}(10^{-16})$  we need to observe  $\mathcal{O}(10^{16})$  events
- High rate of muons, available at Paul-Scherrer-Institut (CH)

• Phase I:  $\mathcal{O}(10^8 s^{-1})$ :

- Existing Compact Muon Beamline
- Single event sensitivity goal:  $2 \times 10^{-15}$

#### $\circ$ Phase II: $\mathcal{O}(10^9 \text{ s}^{-1})$ :

- o Future High Intensity Muon Beamline
- o Under investigation
- Sensitivity goal: 𝒪(10<sup>-16</sup>)



#### An Experiment at the Intensity Frontier $\sub$

- For the final sensitivity goal of  $\mathcal{O}(10^{-16})$  we need to observe  $\mathcal{O}(10^{16})$  events!
- o High rate of muons, available at Paul-Scherrer-Institut (CH)
- Phase I: *O* Measure and reconstruct all events
  - Existing > Trigger-less data acquisition
  - Single e
     2 × 10
     Continous readout of the full detector
     Online event reconstruction and filtering

• Phase II:  $\mathcal{O}(10^{\circ} \text{ s}^{\circ})$ :

- o Future High Intensity Muon Beamline
- o Under investigation
- Sensitivity goal: *O*(10<sup>-16</sup>)



#### Readout Bandwidth Requirements

o Hit rates derived from full detector simulation

- Pixel detector only: 2844 sensors = 178 MPixel
- Hit rates increase by a factor of 20 for Phase II

| Muon stopping rate (Phase I)                  | 100 MHz                            |
|-----------------------------------------------|------------------------------------|
| Maximum hit rate of the busiest pixel sensor  | 1.5 MHz/cm <sup>2</sup>            |
| Average total pixel hit rate                  | 1.06 GHz                           |
| Data rate due to pixel hits (32 bits per hit) | 34 Gb/s                            |
| Data rate due to pixel noise                  | 5.7 Gb/s $\cdot R_{noise,pix}$ /Hz |
| Total readout bandwidth                       | 3.8 Tb/s                           |

 $R_{noise,pix}$ : Noise rate per pixel  $\ll$  10 Hz























#### Mu3e Pixel Sensors – *MuPix*

o High Voltage Monolithic Active Pixel Sensors
o 180 nm HV-CMOS process (AMS AH18)
o Current Prototype: *MuPix8*







#### MuPix8 Readout Architecture





#### MuPix8 Readout Architecture





#### MuPix8 Readout Architecture





### Clock and Reset Distribution

 Synchronous timestamps: Global synchronous clock and reset signal required
 Custom designed optical clock distribution system



- Via FMC Connector





#### Mu3e Front-end Board

o Arria V FPGA

Interface for up to 45 sensors
 LVDS links running at 1.25 Gb/s



 2 Samtec Firefly duplex x4 transceivers
 o FPGA Multi-Gigabit transmitters at 6.25 Gb/s
 o Receivers: Reset, clock signal, sensor configuration
 o Sensor ASIC clock distribution

First stage of data reduction



# Front-end Firmware Description



# Front-end Firmware Description



# Front-end Firmware Description



# **Optical Components**

o All transceivers tested extensively

- Front-end & clock distribution:
   Samtec Firefly (x4 duplex, x12 simplex) also in magnetic field (0.6 T)
- Switching board: MiniPod (x12 simplex)
- Receiving card: QSFP (x4 duplex)







# Optical Data Transmission Tests

#### <u>Minipods</u>

- 12-fold optical transmitter and receiver
- o 1 m long multi mode fibre
- o 12 channels at 6.25 Gb/s
- $\circ$  Error-free: BER <  $10^{-16}$



#### Samtec Firefly

- o 4-fold optical transceiver
- Tested setup:
   error free up to 8 Gbps
- $\circ$  BER < 10<sup>-15</sup>



6 Gbps PRBS7 data after optical transmission with Samtec Firefly



48 x 6.25 Gb/s

Rx

Data merger

Tx

Rx

### Switching Boards

∘ PCIe40 board (LHCb, ALICE) Rx • Arria10 FPGA o 48 optical Tx and Rx o 2 PCIe3 x8 interfaces o Delivery in 2018/2019

4 x 10 Gb/s



# GPU Farm: Receiving Card

- Commercial DE5a-NET board (Terasic)
- o Large Arria10 FPGA
- Two banks of DDR3 memory
- o PCIe 3.0 x8 interface
- o 4 QSFP optical transceivers
- Daisy chain of optical links between PCs



#### GPU Filter Farm

o Time slices of 50 ns for track & vertex search

- ▶ Process  $20 \cdot 10^6$  time slices per second
- o 12 filter farm PCs with one GPU each
- $_{\odot}$  Process at least  $1.7 \cdot 10^{6}$  time slices per second
- ≻GPUs are ideal for this task!
- Thousands of cores
- o Optimal parallel performance
- o Best suited for many floating-point operations / second



o On-FPGA: Track preselection using geometrical criteria

Coordinate transformation

Direct memory access to PC memory





Direct memory access to GPU memory

Track fitting: *Triplet Fit* <u>arXiv:1606.04990</u>
 Multiple scattering dominated, linearized, can be parallelized





 $\circ$  Vertex selection for signal topology: 2 e<sup>+</sup> + 1 e<sup>-</sup>



 $\circ$  Vertex selection for signal topology: 2 e<sup>+</sup> + 1 e<sup>-</sup>

Implementation test on GTX 1080 Ti 2.0  $\cdot$  10<sup>6</sup> time slices processed > required 1.7  $\cdot$  10<sup>6</sup>

# Mu3e Pixel Readout Demonstrator



Pixel sensors Large prototype: MuPix8 *operational* 

Front-end FPGA Prototype boards: Stratix IV *operational* 





<u>Switchir g board</u> PCIe40 (LHCb development) *delive v 2018* 

PC FPGA on PCIe card: Stratix IV





#### Mu3e Front-End Board Prototype



#### Mu3e Front-End Board Prototype







# Hardware Operational Tests

Successful operation of eight MuPix8 in parallel on a test beam at DESY

- $_{\circ}$  Configuration of sensors  $\checkmark$
- o Data transmission:
  - $_{\odot}$  Sensors to front-end  $\checkmark$
  - $\circ$  Front- to back-end  $\checkmark$

 $_{\odot}$  Sensors respond to positron beam  $\checkmark$ 





# Summary

- Mu3e sensitivity goal requires high statistics
- ≻Trigger-less DAQ
- Three FPGA-based DAQ layers
- o All subsystems run synchronously
- Data reduction:
   From 3.8 Tb/s raw data to < 100 MB/s to disk</li>

Sebastian Dittmeier - Mu3e DAO

Demonstrator readout tests successful



