How does Simulation-based Testing for Self-driving Cars match Human Perception?

Software metrics such as coverage and mutation scores have been extensively explored for the automated quality assessment of test suites. While traditional tools rely on such quantifiable software metrics, the field of self-driving cars (SDCs) has primarily focused on simulation-based test case generation using quality metrics such as the out-of-bound (OOB) parameter to determine if a test case fails or passes. However, it remains unclear to what extent this quality metric aligns with the human perception of the safety and realism of SDCs, which are critical aspects in assessing SDC behavior. To address this gap, we conducted an empirical study involving 50 participants to investigate the factors that determine how humans perceive SDC test cases as safe, unsafe, realistic, or unrealistic. To this aim, we developed a framework leveraging virtual reality (VR) technologies, called SDC-Alabaster, to immerse the study participants into the virtual environment of SDC simulators. Our findings indicate that the human assessment of the safety and realism of failing and passing test cases can vary based on different factors, such as the test's complexity and the possibility of interacting with the SDC. Especially for the assessment of realism, the participants' age as a confounding factor leads to a different perception. This study highlights the need for more research on SDC simulation testing quality metrics and the importance of human perception in evaluating SDC behavior.


INTRODUCTION
In recent years, the development of autonomous systems has impacted our society in many aspects of our life [14,19].For instance, humans no longer rely on vacuuming their houses or mowing their grasses manually; nowadays, we have robots that do (and will do) much of our chores [9].However, specific safety-critical instances of such autonomous systems such as unmanned aerial vehicles (UAVs) and self-driving cars (SDCs) [37,38,63,65,67] may experience failures that can harm humans or damage the environment [28].
Testing safety-critical autonomous systems is crucial to avoid harmful incidents in real environments [3,11,22,74,75].To that end, simulation environments have been widely adopted to test cyber-physical systems (CPS) in general [10,21,50], and SDCs in particular [10,21].As opposed to real-world testing, simulation-based testing is easier to replicate, is more cost-efficient, and can be as effective as field testing [21,30].Figure 1 illustrates two test cases where an SDC model is deployed in a virtual environment, and the simulated car is expected to behave according to the control algorithms.A test case is said to pass if the car's behavior can be considered safe, while unsafe behavior constitutes a failing test case.Figure 1a shows an unsafe behavior (failing test) as the SDC drives off the lane, while Figure 1b shows a passing test.
Current research on simulation-based test case generation (STSG) of SDCs relies on an oracle that determines if a system under test is safe or unsafe based on a limited set of safety metrics [11,24,52], particularly the out-of-bound (OOB) metric.The metric is largely adopted for assessing the safety behavior in STSG [24,49,52].Both test cases illustrated in Figure 1 are classified using the OOB metric [12] and align with the human perception of safety.
However, it is yet unclear whether STSG metrics (e.g., OOB) serve as meaningful oracles for assessing the safety behavior of SDCs.For instance, the test cases in Figure 2 are marked pass according to the OOB metric, as the SDC is keeping the lane.On the contrary, from a human standpoint, we can consider the behavior of the SDC hardly as safe.In the first test case using the BeamNG.techsimulator [26], as shown in Figure 2a, the SDC approaches solid delineators after ignoring a speed bump.Despite maintaining its lane at a speed of 50 km/h, there is a high risk of an accident in classifying this test case as a technical pass based on the OOB metric.In the second test case using the CARLA simulator [21], shown in Figure 2b, the SDC ignores the red signal.Since the car stays in the lane, it meets the OOB metric, leading to a false passing test case.
Inspecting the OOB metric reveals that it is measured at a single point in time in simulation, which is insufficient to identify unsafe behaviors.For instance, Figure 2a shows the speed bumps on the right lane, and evaluating the SDC at a single point is insufficient to assess its safety over these speed bumps.In such cases, having a time window will be more informative to assess the overall SDC behavior.Unlike real-world speed bumps, which are smooth and rounded, the test bumps have sharp edges that damage the SDC even at reasonable speeds (from a human viewpoint).Similarly, Figure 2b shows another instance where we observe the red light signal, but the SDC ignores it.It is unclear whether the red signal was already there before the SDC drove past it or the signal turned red just after the SDC analyzed the simulation scene.We hypothesize that current simulation-based testing of SDCs does not always align with the human perception of safety [24,49,52] and realism [5,48,56,73], which are relevant aspects impacting the effective To answer our general research question (i.e., addressing the problem of safety and realism of test cases that are described in our motivating examples), we conducted an empirical study involving 50 participants using our framework named SDC-Alabaster.The framework employs virtual reality (VR) technologies [62] (i) to immerse humans in virtual SDCs so that they can sense and experience the virtual environment as similar as possible to the real world, and (ii) to enable SDC developers and researchers to analyze the human perception of safety and realism of SDC test cases.The participants in our study are asked to assess the level of safety and realism of multiple, diverse simulation-based test cases.Moreover, we provide the participants to experience simulation-based test cases in which they have the possibility to influence the behavior of (i.e., interact with) the SDC.For this purpose, we experimented with two representative SDC simulators as virtual environments, BeamNG.tech and CARLA, which are widely used in academia and industry [1,24].
The paper contributes and complements previous research as follows: • we propose the SDC-Alabaster framework to assess simulation-based SDC test cases from a human point of view with VR; • we investigate the perceived level of safety and realism of simulation-based SDC test cases by conducting an empirical study with 50 participants.We publicly share a replication package with the code to reproduce our results (Section 9); • we develop a taxonomy on impacting factors on the perceived realism of SDC simulators and provide a discussion on confounding factors and implications of our work.
The paper covers background (Section 2), study design (Section 3), our framework, experiments, and methodology.Section 4 presents our results, followed by discussions in Section 5 and threats to validity in Section 6.We discuss related work and conclusions in Section 7 and Section 8.

BACKGROUND
This section provides a background on existing technologies used in our study, such as simulators, test generators, and test runners for SDCs, as well as VR technology.

SDC simulators
We investigate when the safety metrics of STSG for SDCs match the human perception.To answer this question, we use two state-of-the-art SDC simulators namely BeamNG.tech, and CARLA.They are among the used SDC simulators widely used in academia and practice [21,24,29,47,52,78].Furthermore, they implement fundamentally different physics behaviors.
2.1.1BeamNG.tech.We use BeamNG.techsimulator as a well-known reference technology used in recent years in several studies and software engineering competitions on testing SDCs [11,12,24,27,52].The BeamNG.tech simulator comes along with a soft-body physics engine that allows the simulation of body deformations and therefore more realistic simulations regarding crashes and impacting forces on objects.
2.1.2CARLA.Another widely used simulator in academia and practice is CARLA [21,29,34,47,78,80].The differences between CARLA and BeamNG.techare twofold.On the one hand, CARLA comes with a rigid-body physics engine, which works differently than the soft-body physics engine of BeamNG.tech.A rigid-body simulation environment does not deform objects; e.g., when a crash happens, the objects remain rigid.

Test generators & Test Runner
Both simulators require descriptions of the test case scenarios and we use existing test generators to automatically generate test cases for them.Concretely, we use test generators from the tool competition of the Search-Based Software Testing workshop [24,52].The actual road in the simulation environments is the result of interpolating the road points that are generated by the test generator.
In order to run test cases in simulation environments, we need a test runner that manages the execution of the test cases and reports the test outcomes.We use the SDC-Scissor [11] tool, which integrates a test selection strategy for simulation-based test cases.We use SDC-Scissor since it has implemented a test runner that monitors the OOB metrics, which is suitable for our study.

Virtual reality
The notion of VR refers to the immersive experience of users being inside a virtual world.In our study, we want to provide the study participant with an immersive experience of the test cases, to have more accurate feedback on their perception of the safety and realism of SDC.We leverage VR headsets and tooling for the simulation environments to achieve this goal.

2.3.1
Headset & VR connection with simulation environments.We use the HTC Vive Pro 2 headset to provide the study participants with a 360°VR experience, which offers an unrestricted view compared to a standard monitor.The headset connects via wire to an external device with a dedicated GPU for high-resolution VR rendering.Most SDC simulators do not support VR out of the box.This is also the case for BeamNG.techand CARLA.Therefore, for our study, we use third-party tools to enable the missing VR support for both simulators.
For BeamNG.tech, we use vorpX, a specialized tool to transform any visual output to the screen to a compatible input for VR headsets so that it provides an immersive feeling for the user.The vorpX software gives a broader view angle when wearing a VR headset.The user can move the head and can explore the virtual environment according to its head movement.In the case of the CARLA simulator, Silvera et al. [62] implemented an extension of CARLA, allowing the simulator to be compatible with the HTC Vive Pro 2 VR headset.When launching the CARLA application, passing the -VR flag puts the simulator into VR mode so that can be used with the headset.

METHODOLOGY
Overall, our research aims to explore how safety metrics, i.e., OOB, match human perception.Specifically, we investigate the factors that make simulation-based SDC test cases safe or unsafe.Hence, with SDC-Alabaster (see Section 3.3.2),we conducted an empirical study involving 50 participants (recruiting explained in Section 3.4), with several steps (summarized by Figure 3) devised to collect different types of evidence and data to answer our main question: When and why do safety metrics of simulation-based test cases of self-driving cars match human perception?For this purpose, the usage of SDC-Alabaster immerses the study participants in virtual SDCs within widely used virtual environments, thanks to VR technologies (as detailed in Section 3.3).

Research questions
We structured our study around three main research questions (RQs).RQ 1 explores participants' perceptions of SDC test failures and safety levels with and without VR technology.We hypothesize that the OOB safety metric in software engineering may not align with human safety perception.We evaluate alignment through Likert-scale responses from participants, correlating it with test case outcomes (Section 4.1).Statistical tests on experimental and survey data are used to investigate the impact of simulators (BeamNG.tech vs. CARLA), driving views (outside and driver's view), and test case complexity (with/without obstacles/vehicles) on SDC safety perception.
3.1.2RQ 2 : Impact of human interaction on the assessments of SDCs.Once we know how humans perceive the safety of SDC test cases and how this is related to the OOB metric (RQ 1 ), we investigate whether human-based interactions with the virtual SDC affect the safety perception of the test case.We argue that the safety perception of a SDC can vary when having the ability to interact, i.e., the possibility to accelerate and deaccelerate the vehicle manually, and previous VR research has shown that interactions can influence the environment positively or negatively [32,33,35,40,46,51,54,58].This aspects deserves investigation since it can help developers and researchers in designing better test cases and evaluation metrics, which lead us to our second research question: To what extent does the safety assessment of simulation-based SDC test cases vary when humans can interact with the SDC? 3.1.3RQ 3 : Human-based assessment of Realism.We argue that the level of realism of SDC simulationbased test cases is another important factor influencing the safety perception of SDCs.It is important to note that the notion of realism relates to the Reality Gap [5,48,56,73] (see Section 7), which is a critical concern regarding the oracle problem in simulation-based testing: "due to the different properties of simulated and real contexts, the former may not be a faithful mirroring of the latter".While recent studies provide solutions for addressing the reality gap, e.g., by leveraging domain randomization techniques or using data from real-world observations [16,39,79], in the development phase of CPS, there is no prior study that studied and/or characterized the perception of realism of SDC test cases from human participants when using VR technologies [32,46,54].Hence, to complement RQ 1 and RQ 2 , our study addresses the following third research question: Hence, after the experiments for RQ 1 and RQ 2 , we ask the study participants to evaluate the level of realism for BeamNG.techand CARLA.Then, we develop a taxonomy of aspects influencing these environments' realism to help improve simulation environments for effective testing of SDCs so that different properties of simulated and real contexts are minimized.

Design overview
Figure 3 overviews the design of our study involving 12 steps: In step 1 , we welcome and introduce the study participant by explaining the context and the procedure for the experiments.The participant in step 2 sits before a computer screen and experiences three simulation-based test cases with the BeamNG.techsimulator.While sitting before a computer, the participant wears a VR headset for the next steps.In step 3 , the participant experiences three test cases with the BeamNG.techsimulator observing the SDC from an outside view perspective while in step 4 , the participant experiences three test cases with the BeamNG.techsimulator from a driver view perspective.The step 5 focuses on general feedback on the experiments with the BeamNG.techsimulator.Then, the steps 2 , 3 , 4 are repeated for the CARLA simulator in 6 , 7 , 8 .In step 9 , for the CARLA simulator, the participant, while wearing a VR headset from a driver's view, experiences three test cases in which they can control the SDC speed with a keyboard.In addition to step 9 , one group of participants in step 10 will experience a crash with the SDC.The step 11 focuses on general feedback on the experiments with the CARLA simulator while the step 12 focuses on general feedback on the overall study.
For the steps 2 -4 , and 6 -9 , the participant experiences three test cases.The first test case is the warm-up so that the participant can familiarize himself or herself with the simulation environment.The second test case has no obstacles, and the third test case has obstacles (i.e., has higher complexity).At step 10 , the participant only experiences the complex test case with obstacles.

Design implementation
We implement our design by conducting experiments with our test runner called SDC-Alabaster.The test runner uses three distinct test cases created by a test generator (see Section 2.2).The participants give responses to our survey questionnaires using Google Forms.

Test cases.
We use three distinct test cases generated by the Frenetic test generator [15] for different purposes.The first test case is the warm-up that lets the participant familiarize with the simulation environment and view setting, e.g., to get used to the VR headset and the simulator.Test cases generated are processed differently between BeamNG.tech and CARLA since CARLA.An automatically generated test case in BeamNG.tech(Section 2) consists of a sequence of XYcoordinates (i.e., the road points).The CARLA simulator, however, does not need all the road points defined in the test.SDC-Alabaster segments road definitions, using only the start and end points of the segments to declare scenarios in CARLA.Moreover, it enables user immersion and safety evaluation by automatically adapting test case specifications for CARLA and utilizing VR headsets for immersive experiences in its virtual environment.

Survey questionnaires.
We employ Google Forms for our questionnaires, a free and userfriendly survey tool.Table 1 summarizes participant questions, having multiple choice (MC), open answer (OA), and Likert scale (LS) questions (with values from 1-5, where 1 for very unsafe, 5 for very safe, and 3 for neutral).to address our research questions (RQs).Participants answered Q1 and Q2 after the second and third test cases, respectively, with the first test case serving as a warm-up without safety assessment.For Q3-Q8, participants provide responses after all three simulator test executions, i.e., at step 5 for BeamNG.techand step 11 for CARLA.Note that at step 11 , we include an additional question, Q9, for experiments involving CARLA, which includes interactive scenarios requiring keyboard inputs to control the SDC's speed.

Experimental Setting.
We conducted experiments in a dedicated, soundproof room to eliminate external distractions.Participants sat at a table equipped with a desktop computer, laptop, and a VR headset.They used the laptop running the Google Forms application to complete survey questionnaires and the desktop computer for non-VR experiments.For VR experiments, participants used the HTC Vive Pro 2 headset, known for its high visual resolution, powered by the nVidia GeForce RTX 3080 and Windows 10 operating system.Additional extensions were employed to allow a full VR experience to participants, such as vorpX for BeamNG.tech'sVR support and the DReyeVR extension for CARLA, were used.We also integrated SDC-Alabaster to facilitate testing with both BeamNG.techand CARLA simulators.Furthermore, the participants were allowed to interact with specific SDC test cases, with the keyboard enabling them to adjust the SDC's speed.

Study participants
We recruit participants via email invitations sent to our industrial partners, university students, and researchers across departments.We target various mailing lists, including non-computer science organizations, and leverage social media platforms such as Twitter and LinkedIn.We use physical and digital flyers to attract diverse participants, ensuring a broad range of backgrounds and education levels.
3.4.1 Pre-survey.When participants sign up for our experiments, we email them a pre-survey created with Google Forms to collect demographic information.This survey includes an introduction to the topic, an overview of the experiment (including approximate time and location), and a recommendation to wear contact lenses.It also provides details about the simulator and VR headset used.Furthermore, the pre-survey includes a disclaimer regarding confidentiality and anonymity and a warning about potential VR-related accidents or fatalities that the participants could experience.Following this section, we gather background information on participants, as detailed in the Appendix (appx.) of our replication package (Section 9).These questions cover testing and driving experience, VR technology usage, age, and gender.This additional information helps us investigate potential confounding factors affecting safety and realism perception.

Data collection
We gather data from two primary sources: the survey (both pre-experiment and during the experiments) and the simulation logs collected during participant experiments.

Survey data.
For both BeamNG.techand CARLA simulators, participants evaluate test cases considering the various questions reported in Table 1.Specifically, for steps 2 -4 and 6 -9 , Likert-scale and text data are collected for each test case except the warm-up case.For step 10 , only Likert-scale and text data are collected for test cases with obstacles.Additionally, at steps 5 and 11 , general feedback on the simulators is collected after the test executions with all viewpoints.Complementary, participants rate the perceived safety and realism of each simulator using Likertscale values based on their own driving experiences.Finally, general feedback on the experiments is collected at step 12 .In total, we collected 21 Likert-scale, 23 open, and 1 single-choice response per participant during the experiments.In addition to the experimental survey, we gather data from the pre-survey (Section 3.4.1) to obtain participant demographics, mainly through single-choice and open-text responses.

Simulation data.
For each test case in each participant's experiment, we collect relevant data, saving logs (see Section 9) in JSON files of SDC-Alabaster.These logs include timestamped vehicle position coordinates, sensor data (e.g., fuel, gear, wheel speed), and OOB metric violations (i.e., driving off the lane), categorizing the test as pass or fail based on this metric.Additionally, on CARLA, the log structure includes also weather condition details.It is important to note that to enhance our findings further, we also analyze participants' quantitative and qualitative insights both with and without VR headsets as well as when experiencing different driving views.

Data analysis
3.6.1 RQ 1 & RQ 2 : Perceived level of safety.We utilize various visualizations, including stacked barplots and boxplots, to assess safety and realism perceptions.We apply statistical tests: Wilcoxon rank-sum, and Vargha-Delaney to determine the effective size.For RQ 1 , we mainly analyze responses from the test cases where the participant has no interaction with the SDC; for RQ 2 , we analyze the data where the participant has some direct interactions with the SDC by a keyboard to control the vehicle's speed.In RQ 2 , we explore how SDC interactions affect the safety and realism perceptions of participants.For this, we analyze Likert-scale scores and qualitative feedback.We employ stacked bar plots to examine data spread across the two categories in steps 8 and 9 .
3.6.2RQ 3 : Taxonomy on realism.With RQ 3 , we examine the realism of SDC test cases and their correlation with human safety assessments.We identify and categorize factors affecting test case realism in a taxonomy based on the participant responses in question Q4 at steps 5 and 11 .
We adopt a two-step approach for the initial taxonomy creation.Initially, two authors analyze responses grouped by the simulators: one author focuses on Q4 from step 5 with the BeamNG.techsimulator, and the other on Q4 from step 11 with the CARLA simulator.Each author proposes categories via an open-card sorting method [64].In the second step, both authors collaboratively define a meta-taxonomy by discussing their proposed categories.Subsequently, this meta-taxonomy is employed to label all Q4 responses for BeamNG.techand CARLA (steps 5 and 11 ).To do this, the two authors responsible for the meta-taxonomy and a third author conduct a hybrid card sorting labeling process using online spreadsheets.They individually assign each response to the metataxonomy categories or create new categories when necessary.A collaborative approach is employed for validation, where each of the three co-authors reviews and addresses any disagreements in assignments during an online meeting.

RESULTS
In this section, we present the survey results for RQ 1 , focusing on participants' safety perception of the test cases, and RQ 2 , examining how this perception changes when participants can interact with the SDC.For RQ 3 , we developed a taxonomy by classifying participants' comments on test case realism.

RQ 1 : Human-based assessment of safety metrics
To address RQ 1 , we analyzed Likert scale values across various data subgroups.These subgroups included comparisons between test outcomes (failures and successes based on OOB metrics) and different test case complexities (with and without obstacles).This allowed us to identify factors influencing perceived safety among participants.We present boxplots and statistical tests (appx.B.1) for each subgroup.
4.1.1Safety perception of failing vs. passing test cases.Figure 4 illustrates perceived safety distributions for test cases grouped by test outcome (OOB metric).We found a significant difference (Table 5) in how participants rate safety for failing and passing test cases on a Likert scale.Finding 1: The passing test cases (i.e., the cases where the OOB metric is not violated) have a higher perception of safety from the participants than those failing (OOB metric is violated).
The aforementioned Finding 1 is somewhat expected and is aligned with comments from study participants (appx.C.1).These comments pertain to the BeamNG.techsimulator, excluding VR and obstacles.We selected these comments for their exclusive focus on SDC lane-keeping, providing qualitative insights into the OOB metric without obstacle influence.Notably, among comments where the SDC violates the OOB metric (test case failure), safety concerns are recurrent: "As the car did not drive all the time on the street, I felt unsafe.[...]."-(P3/B1/S1)"; "When the car starts to go off the road when driving in a curve, it feels pretty unsafe."-(P31/B1/S1)"; "Not Very Safe since the car sometimes drove a bit from the road."-(P45/B1/S1).
On passing test cases where the OOB metric is not violated, we can find that the participants gave consistent comments in terms of safety: "The car was driving in lane and at a safe speed considering the road is empty." -(P16/B1/S1); "The car was following the path in a safe way and was not speeding up too much." -(P25/B1/S1).
All comments that support Finding 1 are listed in appx.C.1.

Safety perception
With and Without obstacles.Additionally, participants assessed test cases with varying complexity, including additional obstacles.Figure 4 displays differences in perceived safety, with statistical significance reported in appx.B.1.Concretely, failing test cases are generally seen as less safe, but those with added obstacles are perceived as even less safe.In contrast to passing test cases, perceived safety remains largely unaffected by the higher complexity of scenarios (e.g., additional obstacles).As shown in appx.B.1, no significant statistical differences were observed in the samples, leading us to conclude: Finding 2: There is no statistical difference in safety perception between scenarios with and without obstacles when the OOB metric is not violated.However, when the car goes out of bounds, the scenario is perceived as significantly less safe with obstacles ( = 3.52 * 10 −16 ).
From participants, we received qualitative support for Finding 2. For those feeling unsafe with scene obstacles, here are representative answers: "The car crashed toward an obstacle and even running over bumps was not so smooth as humans would do.Definitively more unsafe than the previous scenario."-(P1/B1/S2);"Ran off the road in a curve and hit obstacles without slowing down, which resulted in flat tires."-(P24/B1/S2).
In participants who felt safe or neutral when obstacles were present, consistent comments were reported: "It car was running smooth with obstacles, there was a moment when it was too close to one of the obstacle" -(P16/B1/S2); "The vehicle does well to avoid obstacles while maintaining the safe speed" -(P18/B1/S2); "The driver accelerated over all the obstacles and did not have a perfect finish." -(P40/B1/S2); "Car was driving well.Only at the end it went off the road, but there was no object it bumped into." -(P45/B1/S2).
All comments that support Finding 2 are reported in appx.C.1.
No Yes This is also evident from the smaller interquartile range in with VR (compared to the without VR).
Finding 3: The utilization of VR had a minor impact on safety perception.However, participants using VR tend to perceive scenarios as somewhat less safe, though this difference was not statistically significant (Wilcoxon rank-sum test,  = 0.16).
Certain participant comments support Finding 3.For instance, a neutral participant stated: "The prespective doesnt change much with the vr" -(P22/B2/S1).Another example is a comment from a participant who felt very unsafe: "The same as without the VR glasses.The car was not able to keep the middle of the lane and was driving badly compared to a human." -(P28/B2/S1).

Different views with different complexity.
In Figure 6, we note a decrease in test case safety perception across various viewpoints.Statistical differences are evident in appx.B.1, supporting the following general finding: Finding 4: Overall, participants found the test cases less safe with obstacles.
Participants' general comments during the experiment for each simulator qualitatively support Finding 4. Representative comments on BeamNG.techdriving behavior include: "It did not look at safety lines, which is very dangerous if other traffic is involved.It also ran off the road multiple times, which can easily lead to a loss of control.Also, the car rashed into easily avoidable obstacles." -(P24/B); "At least the AI seems to have an understanding of the general elements of the simulation, like the road.However, it seems to struggle with bumps in the middle of the road and also seems to drive too fast in curvy situations." -(P31/B).
In the case of CARLA, we got the following representative comments on the driving behavior with regard to different complexity of the scenario: "Except at the roundabouts, the car followed traffic rules, signals, and speed limits.However, it kept crashing and losing control in the roundabouts." -(P27/C); "In most scenarios, the AI did well.From what I have seen during the simulations, it is not able to drive around roundabouts and does not stop at stop signs." -(P31/C); "very slow driving, unsmooth behavior, always too close to roundabout and abrupt stopping in front of obstacles." -(P41/C).
We observe that the perception of safety drops when increasing the complexity (i.e., adding obstacles to the scenario).This observation is coherent among both simulators, BeamNG.tech and CARLA, as reported by the participants during the experiment.

RQ 2 : Impact of human interaction on the assessments of SDCs
To assess the safety perception of test cases with human interaction with the SDC, participants controlled SDC speed during the test execution.Figure 7 shows the Likert scale of responses.We compare responses when participants can or cannot control the car and when obstacles are present.

4.2.1
Safety perception with and without interaction with the SDC.In general, interacting with the SDC enhances participants' perception of safety.From appx.B.2, we observe a statistically significant difference, leading to the following finding: Finding 5: Safety perception of test cases is not static: When users can interact with the SDC, participants feel significantly safer ( = 0.013) compared to when they cannot.
The participants' justification supports Finding5, e.g., controlling the SDC speed enhances safety perception, as P1 reported: "The fact I could control the car when needed gave me a safer perception of the driving experience.Moreover, I could speed up the car when I wanted to." -(P1).However, not all participants perceive interaction-based test cases as inherently safe.For instance, participant P4 comments: "With a bit of control, it feels safer, especially being able to adjust the speed in dangerous situations.However, it is still not safe since the car ends up going off-road at the end of the scenario." -(P4).While the SDC remains self-steering, it may still crash despite having speed control capability.4.2.2Safety perception for with and without obstacles.When interactive test cases involve obstacles, participants perceive them as less safe than obstacle-free scenarios, a statistically significant difference, leading to the following finding: Finding 6: Incorporating obstacles into the simulation, where participants interact with the SDC, leads to significantly lower perceived safety in test cases ( = 0.026) compared to obstacle-free interactive scenarios.This finding is also coherent with the answers of the study participants, e.g., by P4: "It felt safer, especially since it was stopping the speed when it had another car in front.However, it still went to the footpath, making it not safe" -(P4).From the comment, we observe safer perception through speed control.P20 also states: "it could have stopped before hitting the camion" -(P20).
However, as the study participant cannot control the SDC's steering, some accidents remain unavoidable, as reported by P19: "Hit the bike driver" (P19).P40 gives a clearer comment: "Two matters: 1) driver keeps its distance to the can in the front, but with sharp breaks instead of slowing down the car.2) unable to avoid strange behaviors and drove next to a car with unstable drive and had an accident" (P40).The participant can maintain distance by adjusting speed, but accidents can occur during lane changes.
In non-interactive test cases, obstacles induce insecurity among participants.However, the level of how they feel unsafe when obstacles are included is higher in the case where the participants can interact with the SDC.This leads to the following finding: Finding 7: In the simulation, obstacles in non-interactive SDC test cases reduce safety perception ( = 0.013).Yet, the ability to interact with the car raises more discomfort (making participants feel less safe) when obstacles are present.
Besides the statistical tests, we also note participant comments supporting Finding 7. Some express discomfort in obstacle scenarios without the ability to control the car, as evident in the following example: "The car was breaking and accelerating a lot while being behind the other car, and also the other car was not behaving safely on the road, ending the simulation with an accident between the two, so it felt quite unsafe overall."(P25).Some participants also experience the worst-case scenario without control, as reported by P28: "It drove extremely close up to the ambulance car and finally crashed into it.therefore, the worst case happens." (P28).

RQ 3 : Taxonomy on realism
Realism is a crucial aspect to consider when evaluating test case safety.We created a taxonomy to gauge the perceived realism of study participants.Two coders used open card sorting on 50 comments each to establish categories, which were later reviewed by a third coder.Table 2 presents the seven resulting categories with their descriptions.
Next, two coders independently classified 100 comments using the designed taxonomy.Disagreements were resolved by a third coder.Table 2 and Figure 8 show the classification of comments related to question Q4 in steps 5 and 11 .We categorized comments as positives (increasing realism) and negatives (decreasing realism) in the taxonomy.We observe that most classifications fall under World Objects, totaling 46, with 32 positives and 14 negatives.

Others
This category relates to participants' comments that do not fit into the above categories.Finding 8: Several factors (e.g., the surroundings, car design, and object scale) impact the participants' perceived realism.The World Objects category dominates with 32 positive (e.g., car design) and 14 negative (e.g., traffic objects) aspects affecting realism perception.
Examples of positive comments with the BeamNG.techsimulator: "The realism is quite good, especially in the car design.The car structure was damaged after crashing; the wheels were getting broken, and there was smoke coming out.The inside view of the car was also pretty real, with the driver's hand moving the steering wheel and all the car panel commands.[...] ." -(B/P4); "They respect the scale from the objects." -(B/P22).Examples of positive comments for the CARLA simulator: "The surroundings have more detail, which made it feel more realistic." -(C/P31); "The environment (lighting, obstacles) feels quite real." -(C/P17).An example of a negative comment: "The grass, the horizon as well, and the red vertical lines do not look very realistic."-(B/P3).Besides finding in Section 8, we noted that the Immersion category generally received positive comments about perceived realism.
Finding 9: The Immersion category primarily comprises comments on factors that affect realism (e.g., view, perspective).It includes 16 positive (e.g., the realism of driver's seat) and 2 negative (e.g., low realism outside the vehicle) comments influencing participants' perceived realism.This finding is reasonable since a driver sits in the driver's seat, unlike the perspective in a video game.The following quotes support this: "The driver seat simulator felt very realistic." -(B/P14); "It was different when I sat in the car than from outside, so it felt more real.But still looked like a game, so not that realistic." -(B/P21).In summary, comments on Immersion were positive, indicating that the driver seat viewpoint and VR usage enhanced perceived realism.

DISCUSSION
We first discuss safety considerations for simulation-based tests, including RQ 1 and interactive test cases RQ 2 .Then, we delve into realism by discussing the taxonomy of influencing factors.

RQ 1 & RQ 2 : Human-based safety assessment of simulation-based test cases
The study participants perceived passing test cases (OOB metric not violated) as safer than failing ones (Finding 1), aligning with the OOB metric-based test oracle.This observation is supported by [37], where participants' assessment of driving quality correlates with metrics related to the SDC's lateral position.The OOB metric generally reflects test case safety.However, the extent to which the safety perception varies depending on certain simulation factors (e.g., obstacle inclusion) remains unclear.Hence, we conducted experiments with test cases featuring additional obstacles.
In Section 2, we found that adding obstacles to a passing test case does not significantly affect safety perception.However, participants perceive failing test cases as less safe with additional obstacles.Therefore, human safety perception does not proportionally align with the OOB metric.The OOB metric can be violated, but it still does not distinguish the case if there are additional obstacles in the test case, but the human does and perceives the test case unsafer.
We experimented with different immersion levels (i.e., various viewpoints), and as reported in Finding 3, participants using VR headsets perceived test cases as slightly less safe.This perception change is minimal when evaluating VR.Consequently, when using humans as oracles, outcomes vary based on immersion levels in virtual environments.Hence, similar human-based studies on simulation-based test cases for SDCs [37] may exhibit a slight bias if immersion is not considered.When grouping safety perceptions of test cases by their assessed viewpoints, cases with obstacles were generally perceived as less safe than those without obstacles (Finding 4).Thus, using the OOB metric as an oracle may not always accurately represent safety perceptions from a human perspective.This observation aligns with the example illustrated by Figure 2a and Figure 2b.
As shown in Finding 5, participants perceived test cases as safer when they could control the vehicle's speed (i.e., they express a higher trust level in the SDC behavior), which means that the safety perception of simulation-based test cases depends on the user interaction levels.Having control over the vehicle impacts safety perception, which may not align with the OOB metric.In the case of test cases involving participant interaction, safety perception generally decreases when obstacles are present, as indicated by Finding 6.This aligns with the findings for non-interactive test cases, as highlighted in Finding 7.

RQ 3 : Taxonomy on test cases' realism
As shown in Finding 8, most participants' comments on Question Q4 fall under the World Objects category.As discussed in Section 1, we conjecture that assessing test case safety should also consider realism.The importance of World Objects, with respect to realism, confirms the fact that pure lane-keeping (as it is the focus of OOB) is not enough for doing a realistic safety assessment.Given that most comments related to test case realism are categorized as World Objects, it becomes essential to prioritize when evaluating test case safety.The Immersion category predominantly features comments expressing a positive or heightened sense of realism, as revealed in Finding 9. Participants' immersion, particularly their viewpoint, influences perceived realism.Notably, the driver seat perspective yields a higher realism perception, as evident in comments on Finding 9, consequently impacting safety perception.The importance of immersion, with respect to realism, confirms that static 2D assessment (again, as it is the focus for OOB) is not enough for doing a realistic safety assessment.
When we take a closer look at the participants' demographics and how they assess the level of realism, we observed that the participants in the age range between 18 and 30 years tend to assess the test cases 17% more realistically (Likert scale) than the older participants.Another insight is that we do not observe a different assessment of realism among the genders.Hence, there are confounding factors that influence the perception of realism, such as the age of the participant.This aspect suggests that the reality-gap characteristics are not deterministic measures as they depend on the human perception that might vary, as for the case of the participants'age.

Implications & Lessons learned
The oracle definition for SDCs is many-fold as the safety has different aspects characterizing it.The OOB metric may not always reflect human safety perception in test cases due to various unaccounted factors.To enhance simulation-based testing, SDC testers and practitioners should consider devising alternative metrics that better align with human safety perception.Interacting with the car boosts perceived safety, potentially due to distrust in the AI driving the SDC.Future research should explore this further, ruling out other influencing factors.If low trust in AI is the main issue, this suggests shaping the direction of autonomous driving research toward increasing the level of trustworthiness of SDCs, which represents an important limiting factor to SDC real-world adoption.
As motivated in Section 1, realism significantly influences the safety perception of SDCs, as reflected in participants' comments on Q4.For this reason, we have created a taxonomy of factors that affect realism in simulation-based SDC testing, to guide future research in the field.The taxonomy provides an overview of factors impacting the realism of SDC simulation-based testing.We argue that our taxonomy is instrumental in supporting future research on the perceived realitygap, which is critical to bridge the gap between the simulation-based outcome of a test case and what happens eventually in the real world.Furthermore, we think the taxonomy provides a base for investigating similar limitations in other CPS application domains, which leverage simulation environments and target to improve the human perception of the realism and safety of CPSs.

Threats to internal validity
The study participants rated safety and realism based on their immersion into the scenario.To limit the risks of unbiased assessments, we employed modern VR technology (HTC Vive Pro 2) to enhance immersion.The simulators, BeamNG.tech and CARLA, utilize distinct predefined maps.BeamNG.techemploys a flat map from the SBST tool competition [52], while CARLA uses built-in urban-like maps, which impose some constraints on road definition.These differing maps may lead to varying perceptions of test case safety and realism due to their distinct natures.This is something we plan to investigate for future work.
The different personal interactions with the study participants might influence the participants' focus during the experiments.To limit this risk, we used a protocol sheet during the experiments to ensure that all steps of the experiments were equally performed to minimize this threat.

Threats to external validity
We recruited study participants primarily from an academic computer science background, which may not represent the general population.To address this potential bias, we ensured diversity in terms of age, gender, and driving experience, reducing the influence of factors beyond professional background.Another concern is the focus on the OOB metric, which may introduce bias as there are various metrics for evaluating SDCs in simulation environments.We chose OOB due to its widespread use among researchers and practitioners, as documented in recent studies [12,24,27,37,52].Our study's limited use of only two simulators, BeamNG.tech and CARLA, restricts the generalizability of our findings to these specific platforms.However, we selected them because they are widely adopted in academia and industry, ensuring the reproducibility of our results compared to less-maintained options such as Udacity1 and SVL [57].

RELATED WORK
In this section, we elaborate on related work on testing in virtual environments and assessing the quality of oracles in the context of CPS.We group the recent and ongoing research concerning topics that are relevant to our investigation such as (i) simulation-based testing, (ii) the testing metrics adopted, the oracle problem, and (iii) VR in software engineering.

Simulation-based testing
The automated testing of Cyber-Physical Systems (CPSs) remains an ongoing research challenge [63].In this context, simulation-based testing emerges as a promising approach to enhance testing practices for Safety-Critical Systems (SDC) [11,12,49,55] and to support test automation [5,6,72,73,76].Past research on testing CPS in simulation environments focused on monitoring CPS and predicting unsafe states [63,67] of the systems using simulation environments [67,77] as well as generating scenarios programmatically [53] or based on real-world observations [23,66].Recent research also proposed cost-effective regression testing techniques, including test selection [11], prioritization [8,12] and minimization techniques to expose CPS faults or bugs earlier in the development and testing process.This research effort fundamentally contributed toward more robust and reliable simulation-based testing practices.However, it remains challenging to replicate the same bugs observed in physical tests within simulations [4,73] and generate representative simulated test cases that uncover realistic bugs [5].Hence, previous research in the field was conducted on the premise that simulation environments sufficiently represent, with high fidelity, safety-critical aspects of the real world according to human judgments.In our paper, we hypothesize that the current simulation-based testing of SDCs (and general CPSs) does not always align with the human perception of safety and realism, which heavily impacts the effectiveness of simulation-based testing in general.To that end, in our research, we investigated when and why the safety metrics of simulation-based test cases of SDCs match human perception.

Testing metrics & the Oracle Problem
To automatically infer the expected test outcome from a given input remains an unsolved challenge, which is known as the oracle problem.Many research papers propose some techniques to address this problem into the context of traditional software systems such as generating oracles [7] or improving already existing test oracles [36,[69][70][71].In either case, the previous research do not show an approach that produce fully optimal and effective oracles.However, while the oracle problem still remains an open challenge which requires humans to define the oracle, for the sake of test automation, several code coverage and mutation score metrics have been proposed for for quantitatively assessing the quality of traditional software systems.
Software engineering for CPS is increasingly explored, with recent efforts mainly focused on bug characterization [25], testing [2,20,81], and verification [17] of self-adaptive CPSs.Another emerging area of research is related to the automated generation of oracles for testing and localizing faults in CPSs based on simulation technologies.For instance, Menghi et al. [44] proposed SOCRaTes, an approach to automatically generate online test oracles in Simulink able to handle CPS Simulink models featuring continuous behaviors and involving uncertainties.The oracles are generated from requirements specified in a signal logic-based language.In this context, for the sake of test automation, just like traditional software testing, simulation-based testing of SDCs relies on an oracle that determines whether the observed behavior of a system under test is safe or unsafe.To that end, current research on automated safety assessment focuses primarily on a limited set of temporal and non-temporal safety metrics for SDCs [11,24,52,68].In particular, the out-of-bound (OOB) non-temporal metric is largely adopted for assessing SDCs in simulation-based testing [24,49,52], to determine if a test case fails or passes.However, it is yet unclear whether this metric serves as a meaningful oracle for assessing the safety behavior of SDCs in simulation-based testing in general.
This study is built on our hypothesis that current simulation-based testing of SDCs does not always align with the human perception of safety and realism, and for this reason, we focus on understanding and characterizing this mismatch in our research.Close to our work, a recent study [37] conducted a human-based study and observed that correlations between the computed quality metrics and the perceived quality by humans are meaningful for assessing the test quality for SDCs.However, such previous work did not investigate the factors that define the test quality and realism of the simulation environments from a human point of view with the use of virtual reality [62] as done in our work.
A critical concern concerning the oracle problem in simulation-based testing is represented by the Reality Gap [5,48,56,73].Due to the different properties of simulated and real contexts, the former may not be a faithful mirroring of the latter.Simulations are necessarily simplified for computational feasibility yet reflect real-world phenomena at a given level of veracity, the extent of which is the result of a trade-off between accuracy and computational time [18].Robotics simulations rely on the replication of phenomena that are difficult to accurately replicate, e.g., simulating actuators (i.e., torque characteristics, gear backlash), sensors (i.e., noise, latency), and rendered images (i.e., reflections, refraction, textures).This gap between reality and simulation is commonly referred to as the reality-gap [18].A closely related problem concerns the concrete realistic bug reproduction and exposure in simulation environments [5,73].It is indeed challenging to capture the same bugs as physical tests [4,73] and to generate effective test cases that can expose real-world bugs in simulation [5].While recent studies provide solutions for addressing the reality gap (e.g., leveraging domain randomization techniques or using data from real-world observations) [16,18,39,41,59,79] in the development phase of CPS, there is no prior study that investigated and/or characterized the perception of realism of SDC test cases from human participants.This study focuses on addressing this specific open question in the context of RQ3.

Immersion Technology in Software engineering
Furthermore, using VR for software engineering was also considered by [31,43] but with another focus as well.They used VR to gain design knowledge from legacy systems by using diferent visualization approaches using immersion technologies.Furthermore, most papers [42,60,61] referring to the potential use of VR and AR for the workspace of software development teams.In general, the use of VR and AR in software engineering is not well studied yet, and the only papers available or mainly vision papers for future research [45].However, in our work, we present a practical application of VR for assessing the test oracles with a Human-in-the-Loop approach.

CONCLUSION
Testing self-driving car (SDC) software, such as traditional software, relies on safety and quality oracles.However, depending solely on metrics such as the OOB for simulation-based SDC testing can be limited in terms of reliability and perceived realism from a human standpoint.In this study, we explored when and why safety metrics align with human perception in SDC testing.We conducted an empirical study with 50 participants from diverse backgrounds, evaluating their perception of test case safety and realism.We observed that the safety perception of SDC significantly decreases as test case complexity rises.Interestingly, safety perception improves when participants can control the SDC's speed, indicating that OOB metric is not sufficient to match/model human (more subjective) factors.Additionally, realism perception varies with the complexity of scenarios (i.e., object additions) and different participant viewpoints.These findings emphasize the need for more meaningful safety metrics that align with human perception of safety and realism to bridge the current problem of the reality-gap in simulation-based testing.Future work should also consider other safety metrics, as suggested by recent studies [68], to enhance SDC software testing in simulation environments and improve safety and realism.

DATA AVAILABILITY
A replication package with data, code, and appendices is publicly available on Zenodo [13].Wouldn"t feel safe in it" P20/B1/S2 very unsafe "it bumped on an easily avoidable side object and couldn't even get out of it" P21/B1/S2 very unsafe "the car hit an obstical and wanted to go on without stopping" P24/B1/S2 very unsafe "Ran off the road in a curve and hit obstacles without slowing down, which resulted in flat tires." P25/B1/S2 neutral "Car was a bit bumpy along the way and actually hit one of the obstacle on the left even if was controlling the speed well along the way it still felt a bit unsafe" P27/B1/S2 very unsafe "The car started too fast and did not seem to have any control over its speed or direction.It crashed onto the pylons multiple times as well." P28/B1/S2 very unsafe "The car did not take the speed bumps into account and was not driving in the middle of the lane.At the very end it even drove completely off... " P30/B1/S2 very unsafe "Hit the obstacles in the middle of the lane.Car had an accident." P31/B1/S2 very unsafe "Despite the relatively narrow road, the car was going pretty fast.Therefore, it felt a bit unsafe.The AI even hit a bump which is the main reason that it feels unsafe." P32/B1/S2 unsafe "Going fast on bumps and offroad multiple times.Parked offraod at the end." P33/B1/S2 very unsafe "The car went out of the road once and creashed due to one obstracle.It could not recover fom the crash." P34/B1/S2 very unsafe "hit obstacle, fast over bumps" P35/B1/S2 very unsafe "Went over the speed bumps to fast and crashed." P36/B1/S2 very unsafe "no speed change on the speed bumps, partiall loss of controll after some hard crush in the end." P37/B1/S2 neutral "Still not feeling in danger, but the car went faster and drifted a little bit in the curve whereas driving should be smoother than that." P38/B1/S2 unsafe "Touched obstacles" P39/B1/S2 unsafe "car was trying to avoid obstacle which caused it to go wide on road" P40/B1/S2 neutral "The driver accelerated over all the obsticles and did not have a perfect finish." P41/B1/S2 very unsafe "collision with obstacles, jumps on road and changes of trajectory" P43/B1/S2 unsafe "Car was not staying on the road, as well as driving to fast for the given conditions.It was again not stayingon the right side of the road during corners." P44/B1/S2 very unsafe "crashing the obstacle and having a crash" P45/B1/S2 safe "Car was driving well.Only at the end it went off the road, but there was no object it bumped into." P46/B1/S2 very unsafe "the car was driving too fast and thus hit obstacles.and in order to get back on the road, it went on the wrong side of the road, which made me feel unsafe as a passenger." P48/B1/S2 unsafe "car did not hold the line" P49/B1/S2 very unsafe "does not drive exactly between the lines and then hit the pole.gives a scary feeling." P50/B1/S2 very unsafe "The car went off the lane and crashed into multiple obstacles.It got stuck at one obstacle and tried to drive through it, which didn't succeed." , Vol.The movements were very abrupt, the physics did not feel too realistic P25 5 The scenario it quite realistic probably also because it consider the simulation inside a city in which the car has to deal with traffic signs and other cars, but also people.P26 3 Better for shadows, textures, and realistic obstacles.However, phisics still not realistic enough.P27 3 There were some lags in the animation and the movement of the car did not seem realistic at times.During the rainy weather scenarios, raindrops were falling inside the car even though everything was closed.P28 3 still perceivable as a computer simulation, but with much more details and therefore kind of comparable to reality.

Fig. 2 .
Fig. 2. Examples of unsafe tests with valid OOB criteria

3. 1
.1 RQ 1 : Human-based assessment of safety.Our first research question is: RQ 1 : To what extent does the OOB safety metric for simulation-based test cases of SDCs align with human safety assessment?

Fig. 4 .
Fig. 4. Perceived safety of failing and passing tests grouped by scenario's complexity

Fig. 7 .
Fig. 7. Safety perception with and without interaction with the SDC (grouped by complexity)

B
STATISTICAL RESULTS ON SAFETY PERCEPTION B.1 RQ 1 P29

Table 1 RQ 3 :
What are the main reality-gap characteristics perceived by humans in SDC test cases?

Table 1 .
Survey questions with Likert-scale (LS), Open answer (OA), and Single-choice (SC) types Loop simulAtion-BASed Testing sElf-driving caRs).Specifically, we implement an interface to run test cases with the CARLA simulator for the steps 6 -10 .As for BeamNG.tech,with SDC-Alabaster we can also add obstacles to the test cases in CARLA to achieve similar complexity levels for the experiments.Additionally, with SDC-Alabaster, and for steps 9 -10 , the participants could control the SDC speed with the keyboard.

Table 2 .
Taxonomy description including # of positive and negative comments on the perception of realism

Table 7 .
Color-coded comments by safety perception on BeamNG.techVR, without obstacles) of participants supporting Finding 1 The car cut two corners and was off the road with 2 wheels at a time." P36/B1/S1 unsafe "car went out of the road partially on curves, could be harmful if the road sides have different level" P37/B1/S1 safe "Overall, I do not think I would be in danger in this scenario, but I would still feel anxious because the car drove overboard and the wheels were in the grass.If a driver would do that I would ask him or her whether he or she is feeling ok.

Table 8 .
Color-coded comments by safety perception on BeamNG.tech(no VR, with obstacles) of participants supporting Finding 2 It seemed the car was not in control.With the turns and very narrowly escaping the obstacles.

Table 14 .
Comments on realism of the test cases in CARLA It was possible to observe almost anything you see in a city (pedestrian, cars, other vehicles, traffic signs, etc.).It was also way more realistic driving style.P2 2The simulation was very haltingly and it felt in general very artificial.The sound of the car was just weird, especially for an electrical car.P3 4 it had sounds and a lot of world objects P4 4 Very good actually.I don't put a 5 because there is always room for improvement and I've seen game engines with more realistic results, but I was positively surprised.While the car was designed as a single box, the landscape was much more realistic, which made you more immerse in the scenario.Also the fact that it uses full VR (3D) makes a big difference.The scenario was very real along with traffic lights, day and night, foggy and it looked like high quality graphics.theobjects in it were very realistic including the sounds P14 4 It gave a clear sense of traffic rules and turns although you can expect much more traffic and dangers in the real world.BeamNG t generate realistic environments and scenarios, as well as a better graphical aspect P21 5 Compared to the first simulation it was better and more realistic because of the enviroment.P22 5The scale of the objects and details are really good P23 5 It was very realistic, because it was in a city, there were other vehicles and the weather was changing.P24 1 1, No. 1, Article .Publication date: January 2023.D.2 Realism for CARLA It was more realistic than BeamNG since it was an actual city.The car was also driving smoother which helped for the realism.P36 4 the city environment and the obstacle, cars, etc. were realitic.justthedrivers of bicycle and cars were too artifically crazy in some cases.youdon'texpect that level of crazyness on the real streets normally P37 5 there are cars and environment on the side and there are street lights and stop signs.Too bad there are no bytanders The simulation is quite realistic including sound.The resolution of traffic signs and lights (far objects) are too low and direction lines on trafic lights and sometime on the streets are missing.P41 4 it has some latency, which gives an akward feeling, but was better than the other simulator P42 5 Great generation and simulation P43 3 Cars do not follow all traffic rules.They ignore stop signs and do not indicate when to turn.But the being driven feels realistic to me, just like being a passenger.P44 4 well reproduced city conditions (infrastructure, buildings, vegetation, etc).Not always well understood behaviours of other vehicles, weather conditions included P45 4 Quite a lot of details but you can see it is obviously not real world.becauseit is really what a city or village could look like so i really reacted to what was happening on screen.P505The scenarios were very realistic, complete with other vehicles that weren't driving correctly all the time (which is accurate to real life).