Message boards :
ATLAS application :
ATLAS native version 2.81
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
With 2.80 there were problems on many hosts running CentOS7 without Singularity. Version 2.81 now requires Singularity for all hosts. Please let us know if there are any other problems. As a reminder, this new version doesn't require python any more. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,287,067 RAC: 129,051 ![]() ![]() |
... any other problems. ATLAS now writes a messed and incomplete stderr.txt. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260611424 |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
Strange.. it looks like the output is truncated. Does it happen with all your tasks? This one produced a valid HITS file so the task did finish properly. Maybe you can attach a "tail -f" to a running task's stderr.txt to see if the full stderr is produced, to see whether it's the upload at the end of the task that is truncating it. |
![]() Send message Joined: 15 Jun 08 Posts: 2628 Credit: 267,287,067 RAC: 129,051 ![]() ![]() |
Logs from v2.73 (taken from different hosts) look like this: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260556228 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260504032 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260565904 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260565904 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260566471 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260546480 Same hosts but now the logs from v2.81: https://lhcathome.cern.ch/lhcathome/result.php?resultid=260644836 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260676781 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260582954 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260566230 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260601877 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260602160 Looks like it happens also on other user's hosts (including David's and that from Agile Boincers): https://lhcathome.cern.ch/lhcathome/result.php?resultid=260806947 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260803398 https://lhcathome.cern.ch/lhcathome/result.php?resultid=260817739 Remarkable: The call to "boinc_finish()" is no longer logged at the end although it should be the very last function to be called. Tailed one of my tasks. https://lhcathome.cern.ch/lhcathome/result.php?resultid=260805920 01:35:56 (35863): wrapper (7.7.26015): starting 01:35:56 (35863): wrapper: running run_atlas (--nthreads 1) Mi 29. Jan 01:35:56 CET 2020: Arguments: --nthreads 1 Mi 29. Jan 01:35:56 CET 2020: Threads: 1 Mi 29. Jan 01:35:56 CET 2020: Checking for CVMFS Mi 29. Jan 01:35:57 CET 2020: Probing /cvmfs/atlas.cern.ch... OK Mi 29. Jan 01:35:57 CET 2020: Probing /cvmfs/atlas-condb.cern.ch... OK Mi 29. Jan 01:35:58 CET 2020: Probing /cvmfs/grid.cern.ch... OK Mi 29. Jan 01:35:58 CET 2020: Probing /cvmfs/cernvm-prod.cern.ch... OK Mi 29. Jan 01:35:59 CET 2020: Probing /cvmfs/sft.cern.ch... OK Mi 29. Jan 01:36:00 CET 2020: Probing /cvmfs/alice.cern.ch... OK Mi 29. Jan 01:36:01 CET 2020: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE Mi 29. Jan 01:36:01 CET 2020: 2.7.0.0 18751 7991 53804 59736 2 61 6690448 7077889 4725 65024 0 3745861 99.9594 3272721 11952 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://<IP_censored_by_volunteer/>:3128 1 Mi 29. Jan 01:36:01 CET 2020: CVMFS is ok Mi 29. Jan 01:36:01 CET 2020: Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img Mi 29. Jan 01:36:01 CET 2020: Checking for singularity binary... Mi 29. Jan 01:36:01 CET 2020: which: no singularity in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) Mi 29. Jan 01:36:01 CET 2020: Singularity is not installed, using version from CVMFS Mi 29. Jan 01:36:01 CET 2020: Checking singularity works with /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname Mi 29. Jan 01:36:12 CET 2020: [34mINFO: [0m Convert SIF file to sandbox... <hostname_censored_by_volunteer/> [34mINFO: [0m Cleaning up image... Mi 29. Jan 01:36:12 CET 2020: Singularity works Mi 29. Jan 01:36:13 CET 2020: Starting ATLAS job with PandaID=4622986354 Mi 29. Jan 01:36:13 CET 2020: Running command: /cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /home/boinc4/BOINC_ATLAS/slots/2 -B /cvmfs,/home /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh start_atlas.sh Mi 29. Jan 09:07:12 CET 2020: *** The last 200 lines of the pilot log: *** Mi 29. Jan 09:07:12 CET 2020: "preExe": { Mi 29. Jan 09:07:12 CET 2020: "cpuTime": 1, Mi 29. Jan 09:07:12 CET 2020: "wallTime": 28 Mi 29. Jan 09:07:12 CET 2020: }, Mi 29. Jan 09:07:12 CET 2020: "total": { Mi 29. Jan 09:07:12 CET 2020: "cpuTime": 23998, Mi 29. Jan 09:07:12 CET 2020: "wallTime": 26222 Mi 29. Jan 09:07:12 CET 2020: }, Mi 29. Jan 09:07:12 CET 2020: "validation": { Mi 29. Jan 09:07:12 CET 2020: "cpuTime": 0, Mi 29. Jan 09:07:12 CET 2020: "wallTime": 2 Mi 29. Jan 09:07:12 CET 2020: }, Mi 29. Jan 09:07:12 CET 2020: "wallTime": 26190 Mi 29. Jan 09:07:12 CET 2020: } Mi 29. Jan 09:07:12 CET 2020: }, Mi 29. Jan 09:07:12 CET 2020: "machine": { Mi 29. Jan 09:07:12 CET 2020: "cpu_family": "23", Mi 29. Jan 09:07:12 CET 2020: "linux_distribution": [ Mi 29. Jan 09:07:12 CET 2020: "CentOS Linux", Mi 29. Jan 09:07:12 CET 2020: "7.6.1810", Mi 29. Jan 09:07:12 CET 2020: "Core" Mi 29. Jan 09:07:12 CET 2020: ], Mi 29. Jan 09:07:12 CET 2020: "model": "8", Mi 29. Jan 09:07:12 CET 2020: "model_name": "AMD Ryzen Threadripper 2950X 16-Core Processor", Mi 29. Jan 09:07:12 CET 2020: "node": "<hostname_censored_by_volunteer/>", Mi 29. Jan 09:07:12 CET 2020: "platform": "Linux-4.12.14-lp151.28.36-default-x86_64-with-centos-7.6.1810-Core" Mi 29. Jan 09:07:12 CET 2020: }, Mi 29. Jan 09:07:13 CET 2020: "transform": { Mi 29. Jan 09:07:13 CET 2020: "cpuEfficiency": 0.9087, Mi 29. Jan 09:07:13 CET 2020: "cpuPWEfficiency": 0.9087, Mi 29. Jan 09:07:13 CET 2020: "cpuTime": 7, 09:07:13 (35863): run_atlas exited; CPU time 24289.052160 09:07:13 (35863): called boinc_finish(0) Mi 29. Jan 09:07:13 CET 2020: "cpuTimeTotal": 23998, Mi 29. Jan 09:07:13 CET 2020: "externalCpuTime": 118, Mi 29. Jan 09:07:13 CET 2020: "processedEvents": 200, Mi 29. Jan 09:07:13 CET 2020: "trfPredata": null, Mi 29. Jan 09:07:13 CET 2020: "wallTime": 26289 Mi 29. Jan 09:07:13 CET 2020: } Mi 29. Jan 09:07:13 CET 2020: } Mi 29. Jan 09:07:13 CET 2020: } Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | DEBUG | queue_monitor | pilot.util.auxiliary.4622986354 | update_server | xml:will send fileinfo Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | DEBUG | queue_monitor | pilot.control.job | get_proper_state | state=finished Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | DEBUG | queue_monitor | pilot.control.job | get_proper_state | serverstate=running Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | DEBUG | queue_monitor | pilot.control.job | get_proper_state | serverstate=finished Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | INFO | queue_monitor | pilot.control.job.4622986354 | send_state | pilot will not update the server (heartbeat message will be written to file) Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | INFO | queue_monitor | pilot.control.job.4622986354 | send_state | job 4622986354 has finished - writing final server update Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,096 | DEBUG | queue_monitor | pilot.control.job.4622986354 | get_data_structure | building data structure to be sent to server with heartbeat Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,097 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | get_job_metrics | will not add max space = -244901878 B to job metrics Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,104 | DEBUG | queue_monitor | pilot.api.analytics | get_fitted_data | removing tails from data to be fitted Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,104 | INFO | queue_monitor | pilot.api.analytics | get_fitted_data | fitting pss+swap vs Time Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,105 | INFO | queue_monitor | pilot.api.analytics | get_fitted_data | current memory leak: 7.89 B/s (using 426 data points, chi2=9404948) Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,106 | DEBUG | queue_monitor | pilot.util.auxiliary.4622986354 | get_job_metrics | job metrics="coreCount=1 actualCoreCount=1 nEvents=200 leak=7.89 chi2=9404948" Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,106 | INFO | queue_monitor | pilot.control.job.4622986354 | get_data_structure | total number of processed events: 200 (read) Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,107 | INFO | queue_monitor | pilot.user.atlas.utilities | get_memory_values | using path: /home/boinc4/BOINC_ATLAS/slots/2/PanDA_Pilot-4622986354/memory_monitor_summary.json (trf name=prmon) Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | DEBUG | queue_monitor | pilot.user.atlas.utilities | get_memory_monitor_info | summary_dictionary={'Max': {'rx_packets': 3964589, 'nprocs': 10, 'nthreads': 1, 'rx_bytes': 5439601189, 'wtime': 26661, 'rss': 2149436, 'write_bytes': 0, 'vmem': 3364108, 'read_bytes': 0, 'stime': 98, 'tx_bytes': 6554250252, 'pss': 2088708, 'wchar': 0, 'rchar': 0, 'tx_packets': 4788198, 'swap': 0, 'utime': 24157}, 'Avg': {'write_bytes': 0, 'nprocs': 5, 'nthreads': 0, 'rx_bytes': 204021, 'rx_packets': 148, 'vmem': 3269241, 'read_bytes': 0, 'swap': 0, 'tx_bytes': 245828, 'pss': 1920860, 'wchar': 0, 'rchar': 0, 'tx_packets': 179, 'rss': 1975430}} Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.user.atlas.utilities | get_memory_monitor_info | extracted standard info from prmon json Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.user.atlas.utilities | get_memory_monitor_info | extracted standard memory fields from prmon json Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | .............................. Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . Timing measurements: Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . get job = 0 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . initial setup = 5 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . payload setup = 0 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . total setup = 5 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . stage-in = 1 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,109 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . payload execution = 26884 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,110 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | . stage-out = 4 s Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,110 | INFO | queue_monitor | pilot.util.auxiliary.4622986354 | timing_report | .............................. Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,111 | DEBUG | queue_monitor | pilot.control.job.4622986354 | send_state | wrote heartbeat to file /home/boinc4/BOINC_ATLAS/slots/2/heartbeat.json Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,111 | DEBUG | queue_monitor | pilot.control.job | queue_monitor | job 4622986354 was dequeued from the monitored payloads queue Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,111 | DEBUG | queue_monitor | pilot.control.job | queue_monitor | tmp job object deleted Mi 29. Jan 09:07:13 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | job summary report Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | -------------------------------------------------- Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | PanDA job id: 4622986354 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | task id: 20360903 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | errors: (none) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,495 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | status: LOG_TRANSFER = DONE Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | pilot state: finished Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | transexitcode: 0 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | exeerrorcode: 0 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | exeerrordiag: Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | exitcode: 0 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | exitmsg: OK Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | cpuconsumptiontime: 24352 s Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | nevents: 200 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | neventsw: 0 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | pid: 65308 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | pgrp: 65308 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,496 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | corecount: 1 Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | event service: False Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | -------------------------------------------------- Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.auxiliary.4622986354 | make_job_report | Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue jobs has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue payloads has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue data_in has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue data_out has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue current_data_in has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue validated_jobs has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue validated_payloads has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,497 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue monitored_payloads has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue finished_jobs has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue finished_payloads has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue finished_data_in has 1 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue finished_data_out has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue failed_jobs has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue failed_payloads has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue failed_data_in has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue failed_data_out has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue completed_jobs has 0 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.util.queuehandling | queue_report | queue completed_jobids has 1 job(s) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,498 | INFO | retrieve | pilot.control.job.4622986354 | has_job_completed | job 4622986354 has completed (purged errors) Mi 29. Jan 09:07:14 CET 2020: 2020-01-29 08:06:13,499 | INFO | retrieve | pilot.util.processes | cleanup | overall cleanup function is called Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:13,499 | DEBUG | retrieve | pilot.util.processes | cleanup | work directory was removed: /home/boinc4/BOINC_ATLAS/slots/2/PanDA_Pilot-4622986354 Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:14,511 | INFO | retrieve | pilot.info.jobdata | collect_zombies | --- collectZombieJob: --- 10, [65308] Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:14,511 | INFO | retrieve | pilot.info.jobdata | collect_zombies | zombie collector trying to kill pid 65308 Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:14,511 | INFO | retrieve | pilot.info.jobdata | collect_zombies | harmless exception when collecting zombies: [Errno 10] No child processes Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,515 | INFO | retrieve | pilot.util.processes | cleanup | collected zombie processes Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,515 | INFO | retrieve | pilot.util.processes | cleanup | will now attempt to kill all subprocesses of pid=65308 Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,823 | INFO | retrieve | pilot.util.processes | kill_processes | process IDs to be killed: [65308] (in reverse order) Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,941 | WARNING | retrieve | pilot.util.processes | kill_processes | found no corresponding commands to process id(s) Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,941 | INFO | retrieve | pilot.util.processes | kill_orphans | Do not look for orphan processes in BOINC jobs Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,941 | INFO | retrieve | pilot.control.job | retrieve | ready for new job Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,941 | INFO | retrieve | root | retrieve | pilot has finished for previous job - re-establishing logging Mi 29. Jan 09:07:15 CET 2020: mpi4py not found Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,944 | INFO | retrieve | pilot.util.auxiliary | pilot_version_banner | **************************************** Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,944 | INFO | retrieve | pilot.util.auxiliary | pilot_version_banner | *** PanDA Pilot version 2.3.4 (12) *** Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,944 | INFO | retrieve | pilot.util.auxiliary | pilot_version_banner | **************************************** Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,945 | INFO | retrieve | pilot.util.auxiliary | pilot_version_banner | Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:15,946 | INFO | retrieve | pilot.util.auxiliary | display_architecture_info | architecture information: Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:16,724 | INFO | retrieve | pilot.util.auxiliary | display_architecture_info | Mi 29. Jan 09:07:15 CET 2020: LSB Version: :core-4.1-amd64:core-4.1-noarch Mi 29. Jan 09:07:15 CET 2020: Distributor ID: CentOS Mi 29. Jan 09:07:15 CET 2020: Description: CentOS Linux release 7.6.1810 (Core) Mi 29. Jan 09:07:15 CET 2020: Release: 7.6.1810 Mi 29. Jan 09:07:15 CET 2020: Codename: Core Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:16,724 | INFO | retrieve | pilot.util.auxiliary | pilot_version_banner | **************************************** Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,230 | DEBUG | retrieve | pilot.util.monitoring | check_local_space | checking local space on /home/boinc4/BOINC_ATLAS/slots/2 Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,270 | INFO | retrieve | pilot.util.monitoring | check_local_space | sufficient remaining disk space (100103356416 B) Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,271 | WARNING | retrieve | pilot.control.job | proceed_with_getjob | since timefloor is set to 0, pilot was only allowed to run one job Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,271 | DEBUG | retrieve | pilot.control.job | retrieve | [job] retrieve thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,287 | DEBUG | job | pilot.control.job | control | job control ending since graceful_stop has been set Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,287 | DEBUG | job | pilot.control.job | control | [job] control thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,313 | WARNING | copytool_out | pilot.util.common | should_abort | data:copytool_out:received graceful stop - abort after this iteration Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,316 | WARNING | monitor | pilot.control.monitor | control | aborting monitor loop since graceful_stop has been set Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,316 | INFO | monitor | pilot.control.monitor | control | [monitor] control thread has ended Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,371 | DEBUG | MainThread | pilot.workflow.generic | run | thread count now at 14 threads Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,372 | DEBUG | MainThread | pilot.workflow.generic | run | enumerate: [<_MainThread(MainThread, started 140543163242304)>, <ExcThread(queue_monitor, started 140542000232192)>, <ExcThread(queue_monitoring, started 140542570641152)>, <ExcThread(validate, started 140542992168704)>, <ExcThread(execute_payloads, started 140541991839488)>, <ExcThread(validate_post, started 140542595819264)>, <ExcThread(data, started 140542983776000)>, <ExcThread(copytool_in, started 140542553855744)>, <ExcThread(create_data_payload, started 140542950205184)>, <ExcThread(failed_post, started 140542579033856)>, <ExcThread(validate_pre, started 140542958597888)>, <ExcThread(job_monitor, started 140542562248448)>, <ExcThread(copytool_out, started 140542975383296)>, <ExcThread(payload, started 140542966990592)>] Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,488 | DEBUG | create_data_payload | pilot.control.job | create_data_payload | [job] create_data_payload thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,543 | WARNING | queue_monitoring | pilot.util.common | should_abort | data:queue_monitoring:received graceful stop - abort after this iteration Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,583 | INFO | validate_pre | pilot.control.payload | validate_pre | [payload] validate_pre thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,671 | DEBUG | copytool_in | pilot.control.data | copytool_in | [data] copytool_in thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:17,793 | INFO | failed_post | pilot.control.payload | failed_post | [payload] failed_post thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,063 | DEBUG | validate | pilot.control.job | validate | [job] validate thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,155 | WARNING | queue_monitor | pilot.util.common | should_abort | job:queue_monitor:received graceful stop - abort after this iteration Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,155 | DEBUG | queue_monitor | pilot.control.job | queue_monitor | [job] queue monitor thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,287 | DEBUG | data | pilot.control.data | control | data control ending since graceful_stop has been set Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,287 | DEBUG | data | pilot.control.data | control | [data] control thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,318 | DEBUG | copytool_out | pilot.control.data | copytool_out | [data] copytool_out thread has finished Mi 29. Jan 09:07:15 CET 2020: 2020-01-29 08:06:18,348 | INFO | validate_post | pilot.control.payload | validate_post | [payload] validate_post thread has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:18,383 | DEBUG | payload | pilot.control.payload | control | payload control ending since graceful_stop has been set Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:18,383 | DEBUG | payload | pilot.control.payload | control | [payload] control thread has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:18,387 | DEBUG | MainThread | pilot.workflow.generic | run | thread count now at 4 threads Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:18,387 | DEBUG | MainThread | pilot.workflow.generic | run | enumerate: [<_MainThread(MainThread, started 140543163242304)>, <ExcThread(queue_monitoring, started 140542570641152)>, <ExcThread(execute_payloads, started 140541991839488)>, <ExcThread(job_monitor, started 140542562248448)>] Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:18,563 | INFO | execute_payloads | pilot.control.payload | execute_payloads | [payload] execute_payloads thread has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:19,403 | DEBUG | MainThread | pilot.workflow.generic | run | thread count now at 3 threads Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:19,403 | DEBUG | MainThread | pilot.workflow.generic | run | enumerate: [<_MainThread(MainThread, started 140543163242304)>, <ExcThread(queue_monitoring, started 140542570641152)>, <ExcThread(job_monitor, started 140542562248448)>] Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:20,546 | DEBUG | queue_monitoring | pilot.control.data | queue_monitoring | [data] queue_monitor thread has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:21,423 | DEBUG | MainThread | pilot.workflow.generic | run | thread count now at 2 threads Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:21,423 | DEBUG | MainThread | pilot.workflow.generic | run | enumerate: [<_MainThread(MainThread, started 140543163242304)>, <ExcThread(job_monitor, started 140542562248448)>] Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:56,145 | WARNING | job_monitor | pilot.control.job | check_job_monitor_waiting_time | no jobs in monitored_payloads queue (waited for 74 s) Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:56,145 | DEBUG | job_monitor | pilot.control.job | job_monitor | [job] job monitor thread has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:56,679 | INFO | MainThread | pilot.workflow.generic | run | end of generic workflow (traces error code: 0) Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:56,679 | INFO | MainThread | root | wrap_up | traces error code: 0 Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:56,679 | INFO | MainThread | root | wrap_up | pilot has finished Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:57,354 [wrapper] ==== pilot stdout END ==== Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:57,609 [wrapper] ==== wrapper stdout RESUME ==== Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:58,639 [wrapper] Pilot exit status: 0 Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:59,802 [wrapper] STATUSCODE: 0 Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:06:59,989 [wrapper] apfmon messages muted Mi 29. Jan 09:07:16 CET 2020: ---- find pandaIDs.out ---- Mi 29. Jan 09:07:16 CET 2020: total 60 Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 11357 Jul 25 2019 LICENSE Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 20 Sep 9 13:04 MANIFEST.IN Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 8 Dec 12 19:00 PILOTVERSION Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 2212 Nov 14 11:01 README.md Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 221 Jul 25 2019 TODO.md Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 11 Jan 29 01:37 pandaIDs.out Mi 29. Jan 09:07:16 CET 2020: drwx------ 14 boinc4 boinc 320 Jan 29 01:37 pilot Mi 29. Jan 09:07:16 CET 2020: -rwx------ 1 boinc4 boinc 21225 Dec 12 19:00 pilot.py Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 766 Oct 10 16:01 setup.py Mi 29. Jan 09:07:16 CET 2020: Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:00,779 [wrapper] pandaIDs.out files: Mi 29. Jan 09:07:16 CET 2020: -rw------- 1 boinc4 boinc 11 Jan 29 01:37 /home/boinc4/BOINC_ATLAS/slots/2/pilot2/pandaIDs.out Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:03,550 [wrapper] pandaIDs.out content: Mi 29. Jan 09:07:16 CET 2020: 4622986354 Mi 29. Jan 09:07:16 CET 2020: Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:04,906 [wrapper] Test setup, not cleaning Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:05,829 [wrapper] ==== wrapper stdout END ==== Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:06,373 [wrapper] ==== wrapper stderr END ==== Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:08,893 [wrapper] wrapper wrapperexiting ec=0, duration=27042 Mi 29. Jan 09:07:16 CET 2020: 2020-01-29 08:07:09,882 [wrapper] apfmon messages muted Mi 29. Jan 09:07:16 CET 2020: *** Error codes and diagnostics *** Mi 29. Jan 09:07:16 CET 2020: "exeErrorCode": 0, Mi 29. Jan 09:07:16 CET 2020: "exeErrorDiag": "", Mi 29. Jan 09:07:16 CET 2020: "pilotErrorCode": 0, Mi 29. Jan 09:07:16 CET 2020: "pilotErrorDiag": "", Mi 29. Jan 09:07:16 CET 2020: *** Listing of results directory *** Mi 29. Jan 09:07:16 CET 2020: insgesamt 372420 Mi 29. Jan 09:07:16 CET 2020: -rw-r--r-- 1 boinc4 boinc 267260 28. Jan 22:24 pilot2.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 4492 28. Jan 22:36 queuedata.json Mi 29. Jan 09:07:17 CET 2020: -rwx------ 1 boinc4 boinc 16499 28. Jan 22:36 runpilot2-wrapper.sh Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 100 29. Jan 01:35 wrapper_26015_x86_64-pc-linux-gnu Mi 29. Jan 09:07:17 CET 2020: -rwxr-xr-x 1 boinc4 boinc 5432 29. Jan 01:35 run_atlas Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 105 29. Jan 01:35 job.xml Mi 29. Jan 09:07:17 CET 2020: drwxrwx--x 2 boinc4 boinc 120 29. Jan 01:35 shared Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 6392 29. Jan 01:35 init_data.xml Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 8192 29. Jan 01:35 boinc_mmap_file Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 0 29. Jan 01:35 boinc_lockfile Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 15218 29. Jan 01:36 start_atlas.sh Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 858 29. Jan 01:36 RTE.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 276437 29. Jan 01:36 input.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 246141021 29. Jan 01:36 EVNT.19609587._000498.pool.root.1 Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 2903 29. Jan 01:36 pandaJob.out Mi 29. Jan 09:07:17 CET 2020: drwxr-xr-x 3 boinc4 boinc 60 29. Jan 01:36 APPS Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 397 29. Jan 01:36 setup.sh.local Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 3671389 29. Jan 01:37 agis_schedconf.cvmfs.json Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 7849349 29. Jan 01:37 agis_ddmendpoints.json Mi 29. Jan 09:07:17 CET 2020: drwx------ 3 boinc4 boinc 300 29. Jan 01:37 pilot2 Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 119402527 29. Jan 09:04 HITS.20360903._025614.pool.root.1 Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 801 29. Jan 09:05 memory_monitor_summary.json Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 313307 29. Jan 09:06 log.20360903._025614.job.log.tgz.1 Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 58912 29. Jan 09:06 heartbeat.json Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 8586 29. Jan 09:06 pilotlog.txt Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 1416718 29. Jan 09:07 log.20360903._025614.job.log.1 Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 463 29. Jan 09:07 output.list Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 9372 29. Jan 09:07 runtime_log.err Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 692 29. Jan 09:07 runtime_log Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 1802240 29. Jan 09:07 result.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 568 29. Jan 09:07 5jSKDm7hnGwn9Rq4apoT9bVoABFKDmABFKDmANOODmABFKDm30xRam.diag Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 2336 29. Jan 09:07 stderr.txt Mi 29. Jan 09:07:17 CET 2020: HITS file was successfully produced: Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 119402527 29. Jan 09:04 shared/HITS.pool.root.1 Mi 29. Jan 09:07:17 CET 2020: *** Contents of shared directory: *** Mi 29. Jan 09:07:17 CET 2020: insgesamt 359036 Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 246141021 29. Jan 01:35 ATLAS.root_0 Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 15218 29. Jan 01:35 start_atlas.sh Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 858 29. Jan 01:35 RTE.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw-r--r-- 1 boinc4 boinc 276437 29. Jan 01:35 input.tar.gz Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 119402527 29. Jan 09:04 HITS.pool.root.1 Mi 29. Jan 09:07:17 CET 2020: -rw------- 1 boinc4 boinc 1802240 29. Jan 09:07 result.tar.gz |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 ![]() ![]() |
I think this is caused by the way logging to stderr works in the new version, where stdout is redirected to a function. If the function is slow (e.g. because it calls "date" for every line) then it will write the output with some delay. The boinc_finish() messages on the other hand are written directly by the BOINC wrapper so appear immediately. If someone knows a better way to redirect stdout to stderr with timestamps, the new wrapper is hosted in github and contributions are welcome :) https://github.com/davidgcameron/boinc-scripts/blob/master/native/run_atlas.sh |
©2025 CERN