Thread 'High disk reads'

Author	Message
Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50785 - Posted: 14 Oct 2024, 18:11:13 UTC Last modified: 14 Oct 2024, 18:11:42 UTC Did someone else see this? Good that my SSD is fast, some task pushing on towards 100 TB of data Task Manager ID: 50785 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50786 - Posted: 14 Oct 2024, 18:16:32 UTC Same as this I'm just more patient :D https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6229 ID: 50786 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2762 Credit: 307,237,868 RAC: 133,435	Message 50787 - Posted: 14 Oct 2024, 19:57:03 UTC - in response to Message 50785. Looks like the affected VMs are busy with swapping. What happens if you configure them to use 1 GB RAM (or even 2 GB)? Beside less disk activity this measure should increase average CPU usage (currently around 33 %). ID: 50787 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50788 - Posted: 14 Oct 2024, 20:18:07 UTC - in response to Message 50787. From the discussion before its just this batch of WU's that could do with more memory? I don't see alot of writes only reads, so it doesn't seem like swaping? Anyhow I turned it up to 1 GB, as you can see I have plenty of memory for the ATLAS task so no problem for me. ID: 50788 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2762 Credit: 307,237,868 RAC: 133,435	Message 50791 - Posted: 15 Oct 2024, 7:13:37 UTC - in response to Message 50788. One of my computers is currently running a Herwig7 (native) for more than 7.5 days. That task has 3 main processes Herwig, rivetvm.exe and runRivet.sh which use a total of >900MB physical RAM. A standard Theory VM is configured with 630 MB "physical RAM" which means the same task would be forced to swap out large amounts of data. Now just a guess: Imagine the scientific processes use a large data array (or a large DB) and traverses through it for each event that it processes. Then it permanently drops parts of the array from RAM to read the next parts from disk. This could explain the huge read activity as well as the low CPU usage since the CPU has to wait for that data. Beside that the VM needs RAM for the OS and the CVMFS cache. The CVMFS cache can be several GB and each object that is not already in the page cache has to be read from disk. Might be a good idea to increase the default RAM setting for Theory VMs on the project server. ID: 50791 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,131,233 RAC: 1,773	Message 50794 - Posted: 15 Oct 2024, 8:14:31 UTC - in response to Message 50785. @Toby: Did you notice, whether this high disk-read throughput was during the integration phase or the event-processing phase? I suppose the latter. For me it was during the event processing phase and computezrmle confirmed that he saw three processes only used during event processing. @computezrmle: At the momemt I'm running a Herwig7.2.1 on a slow laptop (6th.gen), but after 4 days still in the integration phase (integrate 460 of 760) I gave this VM 1024MB of RAM and 2 threads. After the integration however only 4000 events have to be processed. Atm: From the swap: 965116 free, 83456 used and 196224 avail Mem. From Mem 72576 free, 559484 used and 370996 cache. ID: 50794 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2762 Credit: 307,237,868 RAC: 133,435	Message 50795 - Posted: 15 Oct 2024, 8:41:16 UTC - in response to Message 50794. ]I gave this VM 1024MB of RAM and 2 threads.[/quote] I'd leave the default of 1 thread for Theory VMs for the following reasons: - BOINC (LHC@home) configures Theory vbox as singlecore and gets confused if you change this - On saturated computers VirtualBox needs significantly more internal CPU cycles to run multicore VMs compared to singlecore - There's only 1 process using nearly all CPU%/TIME (here: Herwig) [pre] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 617962 boinc3 39 19 1202136 858012 49436 R 97.54 0.651 90,00 Herwig 21468 boinc3 39 19 477256 32928 3864 S 1.536 0.025 39:25.23 rivetvm.exe 7028 boinc3 39 19 32464 13032 1972 S 0.230 0.010 15:53.74 runRivet.sh[/pre] ID: 50795 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50809 - Posted: 15 Oct 2024, 19:10:31 UTC - in response to Message 50791. Make sense, I imagine there is a large write at the begining to dump the db to page file then many reads later. Since the other projects use more memory it would make sense that the Theroy task have something similar or the theroy tasks are reserved for lighter work to support the user base with less powerful computers. 3rd option would be to split the work in to 2 types that users can opt into. ID: 50809 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50811 - Posted: 15 Oct 2024, 19:15:41 UTC - in response to Message 50794. Last modified: 15 Oct 2024, 19:18:12 UTC Looks like its processing? ID: 50811 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2762 Credit: 307,237,868 RAC: 133,435	Message 50812 - Posted: 15 Oct 2024, 19:19:59 UTC - in response to Message 50811. Would be interesting if a higher RAM value for the VM leads to lower disk read activity combined with a higher CPU usage (in Vbox manager). ID: 50812 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50813 - Posted: 15 Oct 2024, 19:29:17 UTC - in response to Message 50812. BOINC decide to run ATLAS now so there wasn't any new theroy running with 1 GB, they come up at some point. A little sad with these deadlines ID: 50813 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 1,758	Message 50814 - Posted: 15 Oct 2024, 19:35:30 UTC - in response to Message 50813. Last modified: 15 Oct 2024, 19:36:47 UTC Herwig only in mcplot atm: https://mcplots-dev.cern.ch/production.php?view=runs&rev=2794&display=succ a few successful, but the most not successful! ID: 50814 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,131,233 RAC: 1,773	Message 50817 - Posted: 16 Oct 2024, 5:36:35 UTC - in response to Message 50814. Herwig only in mcplot atm: https://mcplots-dev.cern.ch/production.php?view=runs&rev=2794&display=succ a few successful, but the most not successful! You can't call 'unknown' as not successful. How I read the figures: 19423 attempts 1684 success 267 failure 17472 unknown ID: 50817 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1323 Credit: 100,683,619 RAC: 142,769	Message 50818 - Posted: 16 Oct 2024, 8:14:52 UTC Herwig7 used to always run with no problems and I probably have some of those saved in my records since I tend to save a few of each version of event generator since I have always watched them start so I could tell which one it was......but lately they want to run for 10 days and are actually running the entire time when checking the running log ID: 50818 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1996 Credit: 164,496,269 RAC: 98,854	Message 50824 - Posted: 16 Oct 2024, 16:15:49 UTC - in response to Message 50817. Herwig only in mcplot atm: https://mcplots-dev.cern.ch/production.php?view=runs&rev=2794&display=succ a few successful, but the most not successful! You can't call 'unknown' as not successful. How I read the figures: 19423 attempts 1684 success 267 failure 17472 unknown what exactly does "unknown" mean ? The high number of "unknown" irritates me. Are they useful for the science, or are they not? ID: 50824 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,131,233 RAC: 1,773	Message 50825 - Posted: 16 Oct 2024, 16:32:15 UTC - in response to Message 50824. Herwig only in mcplot atm: https://mcplots-dev.cern.ch/production.php?view=runs&rev=2794&display=succ a few successful, but the most not successful! You can't call 'unknown' as not successful. How I read the figures: 19423 attempts 1684 success 267 failure 17472 unknown what exactly does "unknown" mean ? The high number of "unknown" irritates me. Are they useful for the science, or are they not? Unknown means: jobs still in the pipeline (on the server, in the (re)send-queue, processing by a client or elsewhere. Latest figures: 19423 attempts 1757 success 271 failure 17395 unknown ID: 50825 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50829 - Posted: 16 Oct 2024, 18:23:20 UTC 1GB looks good. ID: 50829 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1996 Credit: 164,496,269 RAC: 98,854	Message 50834 - Posted: 17 Oct 2024, 13:03:10 UTC - in response to Message 50829. 1GB looks good. is this app_config setting the correct way to increase the memory to 1GB: <cmdline>--memory_size_mb 1024</cmdline> ID: 50834 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,131,233 RAC: 1,773	Message 50835 - Posted: 17 Oct 2024, 13:10:52 UTC - in response to Message 50834. Last modified: 17 Oct 2024, 13:11:13 UTC 1GB looks good. is this app_config setting the correct way to increase the memory to 1GB: <cmdline>--memory_size_mb 1024</cmdline> An example of a whole app_config.xml for Windows: <app_config> <project_max_concurrent>3</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>2</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>3</avg_ncpus> <cmdline>--memory_size_mb 4096 --nthreads 3</cmdline> </app_version> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>3</avg_ncpus> <cmdline>--memory_size_mb 2048 --nthreads 4</cmdline> </app_version> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <avg_ncpus>1</avg_ncpus> <cmdline>--memory_size_mb 1024 --nthreads 1</cmdline> </app_version> </app_config> ID: 50835 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 962 Credit: 786,528,022 RAC: 107,080	Message 50838 - Posted: 18 Oct 2024, 19:50:48 UTC After some time its still good, there is less reads than writes overall. about 45 MB of page file ID: 50838 · Reply Quote