Interactive app environment in 1.7

While testing OOD 1.7, I noticed that the interactive apps seem to be starting with limited environment, which is different than in 1.6.

In our case, we source the Lmod through a series of profile.d like scripts that are called from /etc/profile.d, shown below. In 1.6 case the module.sh would get sourced, while in 1.7 it does not. We do have below a condition that UID<500 does not source these files, so, any chance that the interactive app sessions start as non-user? I am not seeing why/how could that be but just want to make sure.

Or, any other thoughts?

Below is the profile.d files structure
cat /etc/profile.d/chpc.sh if [[ {UID} -ge 500 ]]
then
if [ -f /uufs/chpc.utah.edu/sys/etc/chpc.sh ]
then
source /uufs/chpc.utah.edu/sys/etc/chpc.sh
fi
fi
$ cat /uufs/chpc.utah.edu/sys/etc/chpc.sh
for i in /uufs/chpc.utah.edu/sys/etc/profile.d/.sh; do
if [ -r “$i” ]; then
if [ “$PS1” ]; then
. “$i”
else
. "i" >/dev/null 2>&1 fi fi done ls /uufs/chpc.utah.edu/sys/etc/profile.d/
.sh
/uufs/chpc.utah.edu/sys/etc/profile.d/module.sh
(this is a bit longer file that sets up the appropriate Lmod).

Thanks,
MC

I should add that we do have a workaround for that, to add this to the template/script.sh.erb:
if [ -z “$LMOD_VERSION” ]; then
source /etc/profile.d/chpc.sh
fi

You should use script_wrapper or header in your cluster config instead of editing the template scripts (script wrapper needs the %s because it wraps the script and you can have things above or below it).

      batch_connect:
        vnc:
          header: "#!/bin/bash"
          script_wrapper: |
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi
            %s
        basic:
          # same result as above
          header: |
            #!/bin/bash
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi

As to the difference between 6 and 7, I can’t say off the top of my head why they’d be different.

1 Like

Thanks Jeff, I did not think about the cluster configs.

Actually, @mcuma I can think of what’s different now, especially as it relates to environment variables. We added something to copy_environment for all the schedulers.

In, say, SLURM it’s --export. Do you set and or use job_environment map? (I found through some utah docs that seem to indicate you use SLURM).

The behaviour now is, if you don’t use job_environment, it won’t use the --export flag - which is what it did in 1.6.

If you do use job_environment then it’ll do --export=NONE,FOO,BAR if you don’t use copy_environment and --export=ALL,FOO,BAR if you do set copy_environment to true (I know that phrase has alot of ifs in it and may be convoluted).

This is the only thing I can think of. The sessions could not start as a non-user, OOD submits the job through slurm as the UID of the given user (unless you have some wrapper script in front of srun).

Hi Jeff,

I think you may be on the right track since the environment seems to be passed, but, the Lmod “module” command is an alias which probably does not pass. Though, the job should be opening a new terminal on the compute node and source the Lmod, which is what the standard SLURM job does. And there the alias is functional. So, it’s still looking like the /etc/profile.d/… part is not being sourced.

I don’t think we do anything with the job environment, and the default should be --export=ALL

Is there any documentation on the job_environment?

Thanks,
MC

In updating from 1.6 to 1.7, I had to add sourcing the module environment setup .sh script to script_wrapper.

Can we add this to the Docs somewhere. In the release notes maybe? This burned me as well

Thanks,
Morgan

:frowning_face: Well there’s clearly something awry here. Looking over the code again, the default SLURM behavior for job environments should be the same. That is, the commands executed should be the same as before, with the same environment.

@milberg & @mjbludwig do you also use SLURM?

Looking at SLURM help tickets I see stuff like this. The SLURM FAQ also says something similar.
The user environment is re-populated from a copy of the environment taken when the job was submitted through sbatch, with the SLURM_* environment variables added in to it.
So somehow we corrupted the environment when we run srun.

@jeff.ohrstrom Yes, we are using Slurm. Besides adding sourcing the module environment setup .sh script in the cluster config, I later found that I needed to add it to the Job Composer templates as well.

OK, we clearly broke previous behavior.

We used to set SBATCH_EXPORT=NONE which makes it unclear to me how slurm used to find this function definition? We now just use the --export argument, but that’s only if there are other job_environment variables.

There’s a lot of talk in the tickets that generated this change, but it’s likely we’ll have to patch this as it looks like the previous behavior (however it happened) is expected and indeed a much better experience.

Glad I found this thread. We updated to 1.7.11 in production today and found that it broke all our jupyter apps. I had forgotten to test this in dev. It doesn’t know the ‘module’ command so nothing launches.

We will be fixing this in 1.7.13 so that the behavior is reverted. But the fixes in this PR should address the problem and be unaffected when 1.7.13 is released.

1 Like

We have an additional issue for our center in that users and groups create their own modules and source them in their .bashrc files. This work around provided doesn’t help for this issue. Any idea when you’ll be releasing version 1.7.13?

Thanks,
Dori

We have the patch in and we’re building now, so tomorrow we should be able to verify everything is working as it should.

Then we’ll be able to promote it to stable, so maybe tomorrow 5-28 and maybe Friday 5-29 depending (and it’s 1.7.14 now to also get a Safari/noVNC patch).